Search in sources :

Example 16 with Tokenizer

use of org.deeplearning4j.text.tokenization.tokenizer.Tokenizer in project deeplearning4j by deeplearning4j.

the class Windows method windows.

/**
     * Constructs a list of window of size windowSize.
     * Note that padding for each window is created as well.
     * @param words the words to tokenize and construct windows from
     * @param tokenizerFactory tokenizer factory to use
     * @return the list of windows for the tokenized string
     */
public static List<Window> windows(String words, TokenizerFactory tokenizerFactory) {
    Tokenizer tokenizer = tokenizerFactory.create(words);
    List<String> list = new ArrayList<>();
    while (tokenizer.hasMoreTokens()) list.add(tokenizer.nextToken());
    return windows(list, 5);
}
Also used : ArrayList(java.util.ArrayList) StringTokenizer(java.util.StringTokenizer) DefaultStreamTokenizer(org.deeplearning4j.text.tokenization.tokenizer.DefaultStreamTokenizer) Tokenizer(org.deeplearning4j.text.tokenization.tokenizer.Tokenizer)

Example 17 with Tokenizer

use of org.deeplearning4j.text.tokenization.tokenizer.Tokenizer in project deeplearning4j by deeplearning4j.

the class DefaultDocumentIteratorTest method testDocumentIterator.

@Test
public void testDocumentIterator() throws Exception {
    ClassPathResource reuters5250 = new ClassPathResource("/reuters/5250");
    File f = reuters5250.getFile();
    DocumentIterator iter = new FileDocumentIterator(f.getAbsolutePath());
    InputStream doc = iter.nextDocument();
    TokenizerFactory t = new DefaultTokenizerFactory();
    Tokenizer next = t.create(doc);
    String[] list = "PEARSON CONCENTRATES ON FOUR SECTORS".split(" ");
    ///PEARSON CONCENTRATES ON FOUR SECTORS
    int count = 0;
    while (next.hasMoreTokens() && count < list.length) {
        String token = next.nextToken();
        assertEquals(list[count++], token);
    }
    doc.close();
}
Also used : DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) TokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory) InputStream(java.io.InputStream) File(java.io.File) Tokenizer(org.deeplearning4j.text.tokenization.tokenizer.Tokenizer) ClassPathResource(org.datavec.api.util.ClassPathResource) Test(org.junit.Test)

Aggregations

Tokenizer (org.deeplearning4j.text.tokenization.tokenizer.Tokenizer)17 ArrayList (java.util.ArrayList)5 DefaultStreamTokenizer (org.deeplearning4j.text.tokenization.tokenizer.DefaultStreamTokenizer)5 Test (org.junit.Test)5 StringTokenizer (java.util.StringTokenizer)4 File (java.io.File)2 ClassPathResource (org.datavec.api.util.ClassPathResource)2 TokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory)2 InputStream (java.io.InputStream)1 List (java.util.List)1 Pair (org.deeplearning4j.berkeley.Pair)1 Sequence (org.deeplearning4j.models.sequencevectors.sequence.Sequence)1 VocabWord (org.deeplearning4j.models.word2vec.VocabWord)1 BasicLineIterator (org.deeplearning4j.text.sentenceiterator.BasicLineIterator)1 SentenceIterator (org.deeplearning4j.text.sentenceiterator.SentenceIterator)1 DefaultTokenizer (org.deeplearning4j.text.tokenization.tokenizer.DefaultTokenizer)1 JapaneseTokenizer (org.deeplearning4j.text.tokenization.tokenizer.JapaneseTokenizer)1 NGramTokenizer (org.deeplearning4j.text.tokenization.tokenizer.NGramTokenizer)1 UimaTokenizer (org.deeplearning4j.text.tokenization.tokenizer.UimaTokenizer)1 DefaultTokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory)1