use of org.deeplearning4j.text.tokenization.tokenizer.Tokenizer in project deeplearning4j by deeplearning4j.
the class Windows method windows.
/**
* Constructs a list of window of size windowSize.
* Note that padding for each window is created as well.
* @param words the words to tokenize and construct windows from
* @param tokenizerFactory tokenizer factory to use
* @return the list of windows for the tokenized string
*/
public static List<Window> windows(String words, TokenizerFactory tokenizerFactory) {
Tokenizer tokenizer = tokenizerFactory.create(words);
List<String> list = new ArrayList<>();
while (tokenizer.hasMoreTokens()) list.add(tokenizer.nextToken());
return windows(list, 5);
}
use of org.deeplearning4j.text.tokenization.tokenizer.Tokenizer in project deeplearning4j by deeplearning4j.
the class DefaultDocumentIteratorTest method testDocumentIterator.
@Test
public void testDocumentIterator() throws Exception {
ClassPathResource reuters5250 = new ClassPathResource("/reuters/5250");
File f = reuters5250.getFile();
DocumentIterator iter = new FileDocumentIterator(f.getAbsolutePath());
InputStream doc = iter.nextDocument();
TokenizerFactory t = new DefaultTokenizerFactory();
Tokenizer next = t.create(doc);
String[] list = "PEARSON CONCENTRATES ON FOUR SECTORS".split(" ");
///PEARSON CONCENTRATES ON FOUR SECTORS
int count = 0;
while (next.hasMoreTokens() && count < list.length) {
String token = next.nextToken();
assertEquals(list[count++], token);
}
doc.close();
}
Aggregations