Search in sources :

Example 1 with AbstractSequenceIterator

use of org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator in project deeplearning4j by deeplearning4j.

the class VocabConstructorTest method testBuildJointVocabulary2.

@Test
public void testBuildJointVocabulary2() throws Exception {
    File inputFile = new ClassPathResource("big/raw_sentences.txt").getFile();
    SentenceIterator iter = new BasicLineIterator(inputFile);
    VocabCache<VocabWord> cache = new AbstractCache.Builder<VocabWord>().build();
    SentenceTransformer transformer = new SentenceTransformer.Builder().iterator(iter).tokenizerFactory(t).build();
    AbstractSequenceIterator<VocabWord> sequenceIterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    VocabConstructor<VocabWord> constructor = new VocabConstructor.Builder<VocabWord>().addSource(sequenceIterator, 5).useAdaGrad(false).setTargetVocabCache(cache).build();
    constructor.buildJointVocabulary(false, true);
    //        assertFalse(cache.hasToken("including"));
    assertEquals(242, cache.numWords());
    assertEquals("i", cache.wordAtIndex(1));
    assertEquals("it", cache.wordAtIndex(0));
    assertEquals(634303, cache.totalWordOccurrences());
}
Also used : BasicLineIterator(org.deeplearning4j.text.sentenceiterator.BasicLineIterator) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) SentenceTransformer(org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache) ClassPathResource(org.datavec.api.util.ClassPathResource) SentenceIterator(org.deeplearning4j.text.sentenceiterator.SentenceIterator) AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) File(java.io.File) Test(org.junit.Test)

Example 2 with AbstractSequenceIterator

use of org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator in project deeplearning4j by deeplearning4j.

the class VocabConstructorTest method testBuildJointVocabulary1.

@Test
public void testBuildJointVocabulary1() throws Exception {
    File inputFile = new ClassPathResource("big/raw_sentences.txt").getFile();
    SentenceIterator iter = new BasicLineIterator(inputFile);
    VocabCache<VocabWord> cache = new AbstractCache.Builder<VocabWord>().build();
    SentenceTransformer transformer = new SentenceTransformer.Builder().iterator(iter).tokenizerFactory(t).build();
    /*
            And we pack that transformer into AbstractSequenceIterator
         */
    AbstractSequenceIterator<VocabWord> sequenceIterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    VocabConstructor<VocabWord> constructor = new VocabConstructor.Builder<VocabWord>().addSource(sequenceIterator, 0).useAdaGrad(false).setTargetVocabCache(cache).build();
    constructor.buildJointVocabulary(true, false);
    assertEquals(244, cache.numWords());
    assertEquals(0, cache.totalWordOccurrences());
}
Also used : BasicLineIterator(org.deeplearning4j.text.sentenceiterator.BasicLineIterator) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) SentenceTransformer(org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache) ClassPathResource(org.datavec.api.util.ClassPathResource) SentenceIterator(org.deeplearning4j.text.sentenceiterator.SentenceIterator) AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) File(java.io.File) Test(org.junit.Test)

Example 3 with AbstractSequenceIterator

use of org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator in project deeplearning4j by deeplearning4j.

the class SequenceVectorsTest method testInternalVocabConstruction.

@Test
public void testInternalVocabConstruction() throws Exception {
    ClassPathResource resource = new ClassPathResource("big/raw_sentences.txt");
    File file = resource.getFile();
    BasicLineIterator underlyingIterator = new BasicLineIterator(file);
    TokenizerFactory t = new DefaultTokenizerFactory();
    t.setTokenPreProcessor(new CommonPreprocessor());
    SentenceTransformer transformer = new SentenceTransformer.Builder().iterator(underlyingIterator).tokenizerFactory(t).build();
    AbstractSequenceIterator<VocabWord> sequenceIterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    SequenceVectors<VocabWord> vectors = new SequenceVectors.Builder<VocabWord>(new VectorsConfiguration()).minWordFrequency(5).iterate(sequenceIterator).batchSize(250).iterations(1).epochs(1).resetModel(false).trainElementsRepresentation(true).build();
    logger.info("Fitting model...");
    vectors.fit();
    logger.info("Model ready...");
    double sim = vectors.similarity("day", "night");
    logger.info("Day/night similarity: " + sim);
    assertTrue(sim > 0.6d);
    Collection<String> labels = vectors.wordsNearest("day", 10);
    logger.info("Nearest labels to 'day': " + labels);
}
Also used : BasicLineIterator(org.deeplearning4j.text.sentenceiterator.BasicLineIterator) TokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) VectorsConfiguration(org.deeplearning4j.models.embeddings.loader.VectorsConfiguration) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) SentenceTransformer(org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer) ClassPathResource(org.datavec.api.util.ClassPathResource) DefaultTokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory) CommonPreprocessor(org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor) AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) File(java.io.File) Test(org.junit.Test)

Example 4 with AbstractSequenceIterator

use of org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator in project deeplearning4j by deeplearning4j.

the class SequenceVectorsTest method testDeepWalk.

@Test
@Ignore
public void testDeepWalk() throws Exception {
    Heartbeat.getInstance().disableHeartbeat();
    AbstractCache<Blogger> vocabCache = new AbstractCache.Builder<Blogger>().build();
    Graph<Blogger, Double> graph = buildGraph();
    GraphWalker<Blogger> walker = new PopularityWalker.Builder<>(graph).setNoEdgeHandling(NoEdgeHandling.RESTART_ON_DISCONNECTED).setWalkLength(40).setWalkDirection(WalkDirection.FORWARD_UNIQUE).setRestartProbability(0.05).setPopularitySpread(10).setPopularityMode(PopularityMode.MAXIMUM).setSpreadSpectrum(SpreadSpectrum.PROPORTIONAL).build();
    /*
        GraphWalker<Blogger> walker = new RandomWalker.Builder<Blogger>(graph)
                .setNoEdgeHandling(NoEdgeHandling.RESTART_ON_DISCONNECTED)
                .setWalkLength(40)
                .setWalkDirection(WalkDirection.RANDOM)
                .setRestartProbability(0.05)
                .build();
        */
    GraphTransformer<Blogger> graphTransformer = new GraphTransformer.Builder<>(graph).setGraphWalker(walker).shuffleOnReset(true).setVocabCache(vocabCache).build();
    Blogger blogger = graph.getVertex(0).getValue();
    assertEquals(119, blogger.getElementFrequency(), 0.001);
    logger.info("Blogger: " + blogger);
    AbstractSequenceIterator<Blogger> sequenceIterator = new AbstractSequenceIterator.Builder<>(graphTransformer).build();
    WeightLookupTable<Blogger> lookupTable = new InMemoryLookupTable.Builder<Blogger>().lr(0.025).vectorLength(150).useAdaGrad(false).cache(vocabCache).seed(42).build();
    lookupTable.resetWeights(true);
    SequenceVectors<Blogger> vectors = new SequenceVectors.Builder<Blogger>(new VectorsConfiguration()).lookupTable(lookupTable).iterate(sequenceIterator).vocabCache(vocabCache).batchSize(1000).iterations(1).epochs(10).resetModel(false).trainElementsRepresentation(true).trainSequencesRepresentation(false).elementsLearningAlgorithm(new SkipGram<Blogger>()).learningRate(0.025).layerSize(150).sampling(0).negativeSample(0).windowSize(4).workers(6).seed(42).build();
    vectors.fit();
    vectors.setModelUtils(new FlatModelUtils());
    //     logger.info("12: " + Arrays.toString(vectors.getWordVector("12")));
    double sim = vectors.similarity("12", "72");
    Collection<String> list = vectors.wordsNearest("12", 20);
    logger.info("12->72: " + sim);
    printWords("12", list, vectors);
    assertTrue(sim > 0.10);
    assertFalse(Double.isNaN(sim));
}
Also used : VectorsConfiguration(org.deeplearning4j.models.embeddings.loader.VectorsConfiguration) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache) AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) FlatModelUtils(org.deeplearning4j.models.embeddings.reader.impl.FlatModelUtils) Ignore(org.junit.Ignore) Test(org.junit.Test)

Example 5 with AbstractSequenceIterator

use of org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator in project deeplearning4j by deeplearning4j.

the class BaseTextVectorizer method buildVocab.

public void buildVocab() {
    if (vocabCache == null)
        vocabCache = new AbstractCache.Builder<VocabWord>().build();
    SentenceTransformer transformer = new SentenceTransformer.Builder().iterator(this.iterator).tokenizerFactory(tokenizerFactory).build();
    AbstractSequenceIterator<VocabWord> iterator = new AbstractSequenceIterator.Builder<>(transformer).build();
    VocabConstructor<VocabWord> constructor = new VocabConstructor.Builder<VocabWord>().addSource(iterator, minWordFrequency).setTargetVocabCache(vocabCache).setStopWords(stopWords).allowParallelTokenization(isParallel).build();
    constructor.buildJointVocabulary(false, true);
}
Also used : AbstractSequenceIterator(org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator) VocabConstructor(org.deeplearning4j.models.word2vec.wordstore.VocabConstructor) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) SentenceTransformer(org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache)

Aggregations

AbstractSequenceIterator (org.deeplearning4j.models.sequencevectors.iterators.AbstractSequenceIterator)12 SentenceTransformer (org.deeplearning4j.models.sequencevectors.transformers.impl.SentenceTransformer)11 VocabWord (org.deeplearning4j.models.word2vec.VocabWord)11 Test (org.junit.Test)11 ClassPathResource (org.datavec.api.util.ClassPathResource)10 AbstractCache (org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache)10 BasicLineIterator (org.deeplearning4j.text.sentenceiterator.BasicLineIterator)10 File (java.io.File)6 CommonPreprocessor (org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor)6 DefaultTokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory)6 TokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory)6 VocabConstructor (org.deeplearning4j.models.word2vec.wordstore.VocabConstructor)5 VectorsConfiguration (org.deeplearning4j.models.embeddings.loader.VectorsConfiguration)4 FileLabelAwareIterator (org.deeplearning4j.text.documentiterator.FileLabelAwareIterator)2 SentenceIterator (org.deeplearning4j.text.sentenceiterator.SentenceIterator)2 Ignore (org.junit.Ignore)2 ArrayList (java.util.ArrayList)1 Pair (org.deeplearning4j.berkeley.Pair)1 InMemoryLookupTable (org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable)1 GloVe (org.deeplearning4j.models.embeddings.learning.impl.elements.GloVe)1