Search in sources :

Example 1 with VocabConstructor

use of org.deeplearning4j.models.word2vec.wordstore.VocabConstructor in project deeplearning4j by deeplearning4j.

the class BaseTextVectorizer method buildVocab.

public void buildVocab() {
    if (vocabCache == null)
        vocabCache = new AbstractCache.Builder<VocabWord>().build();
    SentenceTransformer transformer = new SentenceTransformer.Builder().iterator(this.iterator).tokenizerFactory(tokenizerFactory).build();
    AbstractSequenceIterator<VocabWord> iterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    VocabConstructor<VocabWord> constructor = new VocabConstructor.Builder<VocabWord>().addSource(iterator, minWordFrequency).setTargetVocabCache(vocabCache).setStopWords(stopWords).allowParallelTokenization(isParallel).build();
    constructor.buildJointVocabulary(false, true);
}
Also used : AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) VocabConstructor(org.deeplearning4j.models.word2vec.wordstore.VocabConstructor) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) SentenceTransformer(org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache)

Example 2 with VocabConstructor

use of org.deeplearning4j.models.word2vec.wordstore.VocabConstructor in project deeplearning4j by deeplearning4j.

the class InMemoryLookupTableTest method testConsumeOnNonEqualVocabs.

@Test
public void testConsumeOnNonEqualVocabs() throws Exception {
    TokenizerFactory t = new DefaultTokenizerFactory();
    t.setTokenPreProcessor(new CommonPreprocessor());
    AbstractCache<VocabWord> cacheSource = new AbstractCache.Builder<VocabWord>().build();
    ClassPathResource resource = new ClassPathResource("big/raw_sentences.txt");
    BasicLineIterator underlyingIterator = new BasicLineIterator(resource.getFile());
    SentenceTransformer transformer = new SentenceTransformer.Builder().iterator(underlyingIterator).tokenizerFactory(t).build();
    AbstractSequenceIterator<VocabWord> sequenceIterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    VocabConstructor<VocabWord> vocabConstructor = new VocabConstructor.Builder<VocabWord>().addSource(sequenceIterator, 1).setTargetVocabCache(cacheSource).build();
    vocabConstructor.buildJointVocabulary(false, true);
    assertEquals(244, cacheSource.numWords());
    InMemoryLookupTable<VocabWord> mem1 = (InMemoryLookupTable<VocabWord>) new InMemoryLookupTable.Builder<VocabWord>().vectorLength(100).cache(cacheSource).build();
    mem1.resetWeights(true);
    AbstractCache<VocabWord> cacheTarget = new AbstractCache.Builder<VocabWord>().build();
    FileLabelAwareIterator labelAwareIterator = new FileLabelAwareIterator.Builder().addSourceFolder(new ClassPathResource("/paravec/labeled").getFile()).build();
    transformer = new SentenceTransformer.Builder().iterator(labelAwareIterator).tokenizerFactory(t).build();
    sequenceIterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    VocabConstructor<VocabWord> vocabTransfer = new VocabConstructor.Builder<VocabWord>().addSource(sequenceIterator, 1).setTargetVocabCache(cacheTarget).build();
    vocabTransfer.buildMergedVocabulary(cacheSource, true);
    // those +3 go for 3 additional entries in target VocabCache: labels
    assertEquals(cacheSource.numWords() + 3, cacheTarget.numWords());
    InMemoryLookupTable<VocabWord> mem2 = (InMemoryLookupTable<VocabWord>) new InMemoryLookupTable.Builder<VocabWord>().vectorLength(100).cache(cacheTarget).seed(18).build();
    mem2.resetWeights(true);
    assertNotEquals(mem1.vector("day"), mem2.vector("day"));
    mem2.consume(mem1);
    assertEquals(mem1.vector("day"), mem2.vector("day"));
    assertTrue(mem1.syn0.rows() < mem2.syn0.rows());
    assertEquals(mem1.syn0.rows() + 3, mem2.syn0.rows());
}
Also used : TokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) BasicLineIterator(org.deeplearning4j.text.sentenceiterator.BasicLineIterator) VocabConstructor(org.deeplearning4j.models.word2vec.wordstore.VocabConstructor) FileLabelAwareIterator(org.deeplearning4j.text.documentiterator.FileLabelAwareIterator) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) SentenceTransformer(org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache) ClassPathResource(org.datavec.api.util.ClassPathResource) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) CommonPreprocessor(org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor) AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) Test(org.junit.Test)

Example 3 with VocabConstructor

use of org.deeplearning4j.models.word2vec.wordstore.VocabConstructor in project deeplearning4j by deeplearning4j.

the class InMemoryLookupTableTest method testConsumeOnEqualVocabs.

@Test
public void testConsumeOnEqualVocabs() throws Exception {
    TokenizerFactory t = new DefaultTokenizerFactory();
    t.setTokenPreProcessor(new CommonPreprocessor());
    AbstractCache<VocabWord> cacheSource = new AbstractCache.Builder<VocabWord>().build();
    ClassPathResource resource = new ClassPathResource("big/raw_sentences.txt");
    BasicLineIterator underlyingIterator = new BasicLineIterator(resource.getFile());
    SentenceTransformer transformer = new SentenceTransformer.Builder().iterator(underlyingIterator).tokenizerFactory(t).build();
    AbstractSequenceIterator<VocabWord> sequenceIterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    VocabConstructor<VocabWord> vocabConstructor = new VocabConstructor.Builder<VocabWord>().addSource(sequenceIterator, 1).setTargetVocabCache(cacheSource).build();
    vocabConstructor.buildJointVocabulary(false, true);
    assertEquals(244, cacheSource.numWords());
    InMemoryLookupTable<VocabWord> mem1 = (InMemoryLookupTable<VocabWord>) new InMemoryLookupTable.Builder<VocabWord>().vectorLength(100).cache(cacheSource).seed(17).build();
    mem1.resetWeights(true);
    InMemoryLookupTable<VocabWord> mem2 = (InMemoryLookupTable<VocabWord>) new InMemoryLookupTable.Builder<VocabWord>().vectorLength(100).cache(cacheSource).seed(15).build();
    mem2.resetWeights(true);
    assertNotEquals(mem1.vector("day"), mem2.vector("day"));
    mem2.consume(mem1);
    assertEquals(mem1.vector("day"), mem2.vector("day"));
}
Also used : TokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) BasicLineIterator(org.deeplearning4j.text.sentenceiterator.BasicLineIterator) VocabConstructor(org.deeplearning4j.models.word2vec.wordstore.VocabConstructor) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) SentenceTransformer(org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache) ClassPathResource(org.datavec.api.util.ClassPathResource) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) CommonPreprocessor(org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor) AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) Test(org.junit.Test)

Example 4 with VocabConstructor

use of org.deeplearning4j.models.word2vec.wordstore.VocabConstructor in project deeplearning4j by deeplearning4j.

the class AbstractCoOccurrencesTest method testFit1.

@Test
public void testFit1() throws Exception {
    ClassPathResource resource = new ClassPathResource("other/oneline.txt");
    File file = resource.getFile();
    AbstractCache<VocabWord> vocabCache = new AbstractCache.Builder<VocabWord>().build();
    BasicLineIterator underlyingIterator = new BasicLineIterator(file);
    TokenizerFactory t = new DefaultTokenizerFactory();
    t.setTokenPreProcessor(new CommonPreprocessor());
    SentenceTransformer transformer = new SentenceTransformer.Builder().iterator(underlyingIterator).tokenizerFactory(t).build();
    AbstractSequenceIterator<VocabWord> sequenceIterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    VocabConstructor<VocabWord> constructor = new VocabConstructor.Builder<VocabWord>().addSource(sequenceIterator, 1).setTargetVocabCache(vocabCache).build();
    constructor.buildJointVocabulary(false, true);
    AbstractCoOccurrences<VocabWord> coOccurrences = new AbstractCoOccurrences.Builder<VocabWord>().iterate(sequenceIterator).vocabCache(vocabCache).symmetric(false).windowSize(15).build();
    coOccurrences.fit();
    //List<Pair<VocabWord, VocabWord>> list = coOccurrences.i();
    Iterator<Pair<Pair<VocabWord, VocabWord>, Double>> iterator = coOccurrences.iterator();
    assertNotEquals(null, iterator);
    int cnt = 0;
    List<Pair<VocabWord, VocabWord>> list = new ArrayList<>();
    while (iterator.hasNext()) {
        Pair<Pair<VocabWord, VocabWord>, Double> pair = iterator.next();
        list.add(pair.getFirst());
        cnt++;
    }
    log.info("CoOccurrences: " + list);
    assertEquals(16, list.size());
    assertEquals(16, cnt);
}
Also used : BasicLineIterator(org.deeplearning4j.text.sentenceiterator.BasicLineIterator) ArrayList(java.util.ArrayList) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache) CommonPreprocessor(org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor) Pair(org.deeplearning4j.berkeley.Pair) TokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) VocabConstructor(org.deeplearning4j.models.word2vec.wordstore.VocabConstructor) SentenceTransformer(org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer) ClassPathResource(org.datavec.api.util.ClassPathResource) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) File(java.io.File) Test(org.junit.Test)

Example 5 with VocabConstructor

use of org.deeplearning4j.models.word2vec.wordstore.VocabConstructor in project deeplearning4j by deeplearning4j.

the class SequenceVectorsTest method testAbstractW2VModel.

@Test
public void testAbstractW2VModel() throws Exception {
    ClassPathResource resource = new ClassPathResource("big/raw_sentences.txt");
    File file = resource.getFile();
    logger.info("dtype: {}", Nd4j.dataType());
    AbstractCache<VocabWord> vocabCache = new AbstractCache.Builder<VocabWord>().build();
    /*
            First we build line iterator
         */
    BasicLineIterator underlyingIterator = new BasicLineIterator(file);
    /*
            Now we need the way to convert lines into Sequences of VocabWords.
            In this example that's SentenceTransformer
         */
    TokenizerFactory t = new DefaultTokenizerFactory();
    t.setTokenPreProcessor(new CommonPreprocessor());
    SentenceTransformer transformer = new SentenceTransformer.Builder().iterator(underlyingIterator).tokenizerFactory(t).build();
    /*
            And we pack that transformer into AbstractSequenceIterator
         */
    AbstractSequenceIterator<VocabWord> sequenceIterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    /*
            Now we should build vocabulary out of sequence iterator.
            We can skip this phase, and just set SequenceVectors.resetModel(TRUE), and vocabulary will be mastered internally
        */
    VocabConstructor<VocabWord> constructor = new VocabConstructor.Builder<VocabWord>().addSource(sequenceIterator, 5).setTargetVocabCache(vocabCache).build();
    constructor.buildJointVocabulary(false, true);
    assertEquals(242, vocabCache.numWords());
    assertEquals(634303, vocabCache.totalWordOccurrences());
    VocabWord wordz = vocabCache.wordFor("day");
    logger.info("Wordz: " + wordz);
    /*
            Time to build WeightLookupTable instance for our new model
        */
    WeightLookupTable<VocabWord> lookupTable = new InMemoryLookupTable.Builder<VocabWord>().lr(0.025).vectorLength(150).useAdaGrad(false).cache(vocabCache).build();
    /*
            reset model is viable only if you're setting SequenceVectors.resetModel() to false
            if set to True - it will be called internally
        */
    lookupTable.resetWeights(true);
    /*
            Now we can build SequenceVectors model, that suits our needs
         */
    SequenceVectors<VocabWord> vectors = new SequenceVectors.Builder<VocabWord>(new VectorsConfiguration()).minWordFrequency(5).lookupTable(lookupTable).iterate(sequenceIterator).vocabCache(vocabCache).batchSize(250).iterations(1).epochs(1).resetModel(false).trainElementsRepresentation(true).trainSequencesRepresentation(false).build();
    /*
            Now, after all options are set, we just call fit()
         */
    logger.info("Starting training...");
    vectors.fit();
    logger.info("Model saved...");
    /*
            As soon as fit() exits, model considered built, and we can test it.
            Please note: all similarity context is handled via SequenceElement's labels, so if you're using SequenceVectors to build models for complex
            objects/relations please take care of Labels uniqueness and meaning for yourself.
         */
    double sim = vectors.similarity("day", "night");
    logger.info("Day/night similarity: " + sim);
    assertTrue(sim > 0.6d);
    Collection<String> labels = vectors.wordsNearest("day", 10);
    logger.info("Nearest labels to 'day': " + labels);
}
Also used : BasicLineIterator(org.deeplearning4j.text.sentenceiterator.BasicLineIterator) VectorsConfiguration(org.deeplearning4j.models.embeddings.loader.VectorsConfiguration) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache) CommonPreprocessor(org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor) InMemoryLookupTable(org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable) TokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) VocabConstructor(org.deeplearning4j.models.word2vec.wordstore.VocabConstructor) SentenceTransformer(org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer) ClassPathResource(org.datavec.api.util.ClassPathResource) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) File(java.io.File) Test(org.junit.Test)

Aggregations

AbstractSequenceIterator (org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator)5 SentenceTransformer (org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer)5 VocabWord (org.deeplearning4j.models.word2vec.VocabWord)5 VocabConstructor (org.deeplearning4j.models.word2vec.wordstore.VocabConstructor)5 AbstractCache (org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache)5 ClassPathResource (org.datavec.api.util.ClassPathResource)4 BasicLineIterator (org.deeplearning4j.text.sentenceiterator.BasicLineIterator)4 CommonPreprocessor (org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor)4 DefaultTokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory)4 TokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory)4 Test (org.junit.Test)4 File (java.io.File)2 ArrayList (java.util.ArrayList)1 Pair (org.deeplearning4j.berkeley.Pair)1 InMemoryLookupTable (org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable)1 VectorsConfiguration (org.deeplearning4j.models.embeddings.loader.VectorsConfiguration)1 FileLabelAwareIterator (org.deeplearning4j.text.documentiterator.FileLabelAwareIterator)1