Search in sources :

Example 46 with TokenizerFactory

use of org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory in project deeplearning4j by deeplearning4j.

the class DefaulTokenizerTests method testDefaultTokenizer1.

@Test
public void testDefaultTokenizer1() throws Exception {
    String toTokenize = "Mary had a little lamb.";
    TokenizerFactory t = new DefaultTokenizerFactory();
    Tokenizer tokenizer = t.create(toTokenize);
    Tokenizer tokenizer2 = t.create(new ByteArrayInputStream(toTokenize.getBytes()));
    int position = 1;
    while (tokenizer2.hasMoreTokens()) {
        String tok1 = tokenizer.nextToken();
        String tok2 = tokenizer2.nextToken();
        log.info("Position: [" + position + "], token1: '" + tok1 + "', token 2: '" + tok2 + "'");
        position++;
        assertEquals(tok1, tok2);
    }
    ClassPathResource resource = new ClassPathResource("reuters/5250");
    String str = FileUtils.readFileToString(resource.getFile());
    int stringCount = t.create(str).countTokens();
    int stringCount2 = t.create(resource.getInputStream()).countTokens();
    assertTrue(Math.abs(stringCount - stringCount2) < 2);
}
Also used : DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) TokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory) ByteArrayInputStream(java.io.ByteArrayInputStream) ClassPathResource(org.datavec.api.util.ClassPathResource) Test(org.junit.Test)

Example 47 with TokenizerFactory

use of org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory in project deeplearning4j by deeplearning4j.

the class TokenizerFunction method getTokenizerFactory.

private TokenizerFactory getTokenizerFactory() {
    try {
        TokenPreProcess tokenPreProcessInst = null;
        // token preprocess CAN be undefined
        if (tokenizerPreprocessorClazz != null && !tokenizerPreprocessorClazz.isEmpty()) {
            Class<? extends TokenPreProcess> clazz = (Class<? extends TokenPreProcess>) Class.forName(tokenizerPreprocessorClazz);
            tokenPreProcessInst = clazz.newInstance();
        }
        Class<? extends TokenizerFactory> clazz2 = (Class<? extends TokenizerFactory>) Class.forName(tokenizerFactoryClazz);
        tokenizerFactory = clazz2.newInstance();
        if (tokenPreProcessInst != null)
            tokenizerFactory.setTokenPreProcessor(tokenPreProcessInst);
        if (nGrams > 1) {
            tokenizerFactory = new NGramTokenizerFactory(tokenizerFactory, nGrams, nGrams);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return tokenizerFactory;
}
Also used : TokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory) NGramTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.NGramTokenizerFactory) TokenPreProcess(org.deeplearning4j.text.tokenization.tokenizer.TokenPreProcess) NGramTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.NGramTokenizerFactory)

Aggregations

TokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory)47 Test (org.junit.Test)42 DefaultTokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory)40 CommonPreprocessor (org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor)29 File (java.io.File)28 ClassPathResource (org.datavec.api.util.ClassPathResource)28 BasicLineIterator (org.deeplearning4j.text.sentenceiterator.BasicLineIterator)24 SentenceIterator (org.deeplearning4j.text.sentenceiterator.SentenceIterator)22 INDArray (org.nd4j.linalg.api.ndarray.INDArray)20 VocabWord (org.deeplearning4j.models.word2vec.VocabWord)19 Word2Vec (org.deeplearning4j.models.word2vec.Word2Vec)12 UimaSentenceIterator (org.deeplearning4j.text.sentenceiterator.UimaSentenceIterator)11 ArrayList (java.util.ArrayList)10 AbstractCache (org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache)8 Ignore (org.junit.Ignore)8 AggregatingSentenceIterator (org.deeplearning4j.text.sentenceiterator.AggregatingSentenceIterator)7 FileSentenceIterator (org.deeplearning4j.text.sentenceiterator.FileSentenceIterator)7 InMemoryLookupTable (org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable)6 WordVectors (org.deeplearning4j.models.embeddings.wordvectors.WordVectors)6 AbstractSequenceIterator (org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator)6