Examples with Word2Vec - org.deeplearning4j.models.word2vec.Word2Vec

Example 1 with Word2Vec

use of org.deeplearning4j.models.word2vec.Word2Vec in project deeplearning4j by deeplearning4j.

the class WordVectorSerializer method loadFullModel.

/**
     * This method loads full w2v model, previously saved with writeFullMethod call
     *
     * Deprecation note: Please, consider using readWord2VecModel() or loadStaticModel() method instead
     *
     * @param path - path to previously stored w2v json model
     * @return - Word2Vec instance
     */
@Deprecated
public static Word2Vec loadFullModel(@NonNull String path) throws FileNotFoundException {
    /*
            // TODO: implementation is in process
            We need to restore:
                     1. WeightLookupTable, including syn0 and syn1 matrices
                     2. VocabCache + mark it as SPECIAL, to avoid accidental word removals
         */
    BasicLineIterator iterator = new BasicLineIterator(new File(path));
    // first 3 lines should be processed separately
    String confJson = iterator.nextSentence();
    log.info("Word2Vec conf. JSON: " + confJson);
    VectorsConfiguration configuration = VectorsConfiguration.fromJson(confJson);
    // actually we dont need expTable, since it produces exact results on subsequent runs untill you dont modify expTable size :)
    String eTable = iterator.nextSentence();
    double[] expTable;
    String nTable = iterator.nextSentence();
    if (configuration.getNegative() > 0) {
    // TODO: we probably should parse negTable, but it's not required until vocab changes are introduced. Since on the predefined vocab it will produce exact nTable, the same goes for expTable btw.
    }
    /*
                Since we're restoring vocab from previously serialized model, we can expect minWordFrequency appliance in its vocabulary, so it should NOT be truncated.
                That's why i'm setting minWordFrequency to configuration value, but applying SPECIAL to each word, to avoid truncation
         */
    VocabularyHolder holder = new VocabularyHolder.Builder().minWordFrequency(configuration.getMinWordFrequency()).hugeModelExpected(configuration.isHugeModelExpected()).scavengerActivationThreshold(configuration.getScavengerActivationThreshold()).scavengerRetentionDelay(configuration.getScavengerRetentionDelay()).build();
    AtomicInteger counter = new AtomicInteger(0);
    AbstractCache<VocabWord> vocabCache = new AbstractCache.Builder<VocabWord>().build();
    while (iterator.hasNext()) {
        //    log.info("got line: " + iterator.nextSentence());
        String wordJson = iterator.nextSentence();
        VocabularyWord word = VocabularyWord.fromJson(wordJson);
        word.setSpecial(true);
        VocabWord vw = new VocabWord(word.getCount(), word.getWord());
        vw.setIndex(counter.getAndIncrement());
        vw.setIndex(word.getHuffmanNode().getIdx());
        vw.setCodeLength(word.getHuffmanNode().getLength());
        vw.setPoints(arrayToList(word.getHuffmanNode().getPoint(), word.getHuffmanNode().getLength()));
        vw.setCodes(arrayToList(word.getHuffmanNode().getCode(), word.getHuffmanNode().getLength()));
        vocabCache.addToken(vw);
        vocabCache.addWordToIndex(vw.getIndex(), vw.getLabel());
        vocabCache.putVocabWord(vw.getWord());
    }
    // at this moment vocab is restored, and it's time to rebuild Huffman tree
    // since word counters are equal, huffman tree will be equal too
    //holder.updateHuffmanCodes();
    // we definitely don't need UNK word in this scenarion
    //        holder.transferBackToVocabCache(vocabCache, false);
    // now, it's time to transfer syn0/syn1/syn1 neg values
    InMemoryLookupTable lookupTable = (InMemoryLookupTable) new InMemoryLookupTable.Builder().negative(configuration.getNegative()).useAdaGrad(configuration.isUseAdaGrad()).lr(configuration.getLearningRate()).cache(vocabCache).vectorLength(configuration.getLayersSize()).build();
    // we create all arrays
    lookupTable.resetWeights(true);
    iterator.reset();
    // we should skip 3 lines from file
    iterator.nextSentence();
    iterator.nextSentence();
    iterator.nextSentence();
    // now, for each word from vocabHolder we'll just transfer actual values
    while (iterator.hasNext()) {
        String wordJson = iterator.nextSentence();
        VocabularyWord word = VocabularyWord.fromJson(wordJson);
        // syn0 transfer
        INDArray syn0 = lookupTable.getSyn0().getRow(vocabCache.indexOf(word.getWord()));
        syn0.assign(Nd4j.create(word.getSyn0()));
        // syn1 transfer
        // syn1 values are being accessed via tree points, but since our goal is just deserialization - we can just push it row by row
        INDArray syn1 = lookupTable.getSyn1().getRow(vocabCache.indexOf(word.getWord()));
        syn1.assign(Nd4j.create(word.getSyn1()));
        // syn1Neg transfer
        if (configuration.getNegative() > 0) {
            INDArray syn1Neg = lookupTable.getSyn1Neg().getRow(vocabCache.indexOf(word.getWord()));
            syn1Neg.assign(Nd4j.create(word.getSyn1Neg()));
        }
    }
    Word2Vec vec = new Word2Vec.Builder(configuration).vocabCache(vocabCache).lookupTable(lookupTable).resetModel(false).build();
    vec.setModelUtils(new BasicModelUtils());
    return vec;
}

Also used : BasicLineIterator(org.deeplearning4j.text.sentenceiterator.BasicLineIterator) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) AbstractCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.AbstractCache) VocabularyHolder(org.deeplearning4j.models.word2vec.wordstore.VocabularyHolder) InMemoryLookupTable(org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable) INDArray(org.nd4j.linalg.api.ndarray.INDArray) BasicModelUtils(org.deeplearning4j.models.embeddings.reader.impl.BasicModelUtils) AtomicInteger(java.util.concurrent.atomic.AtomicInteger) VocabularyWord(org.deeplearning4j.models.word2vec.wordstore.VocabularyWord) StaticWord2Vec(org.deeplearning4j.models.word2vec.StaticWord2Vec) Word2Vec(org.deeplearning4j.models.word2vec.Word2Vec) ZipFile(java.util.zip.ZipFile)

Example 2 with Word2Vec

use of org.deeplearning4j.models.word2vec.Word2Vec in project deeplearning4j by deeplearning4j.

the class WordVectorSerializer method readBinaryModel.

/**
     * Read a binary word2vec file.
     *
     * @param modelFile
     *            the File to read
     * @param linebreaks
     *            if true, the reader expects each word/vector to be in a separate line, terminated
     *            by a line break
     * @return a {@link Word2Vec model}
     * @throws NumberFormatException
     * @throws IOException
     * @throws FileNotFoundException
     */
private static Word2Vec readBinaryModel(File modelFile, boolean linebreaks, boolean normalize) throws NumberFormatException, IOException {
    InMemoryLookupTable<VocabWord> lookupTable;
    VocabCache<VocabWord> cache;
    INDArray syn0;
    int words, size;
    int originalFreq = Nd4j.getMemoryManager().getOccasionalGcFrequency();
    boolean originalPeriodic = Nd4j.getMemoryManager().isPeriodicGcActive();
    if (originalPeriodic)
        Nd4j.getMemoryManager().togglePeriodicGc(false);
    Nd4j.getMemoryManager().setOccasionalGcFrequency(50000);
    try (BufferedInputStream bis = new BufferedInputStream(GzipUtils.isCompressedFilename(modelFile.getName()) ? new GZIPInputStream(new FileInputStream(modelFile)) : new FileInputStream(modelFile));
        DataInputStream dis = new DataInputStream(bis)) {
        words = Integer.parseInt(readString(dis));
        size = Integer.parseInt(readString(dis));
        syn0 = Nd4j.create(words, size);
        cache = new AbstractCache<>();
        printOutProjectedMemoryUse(words, size, 1);
        lookupTable = (InMemoryLookupTable<VocabWord>) new InMemoryLookupTable.Builder<VocabWord>().cache(cache).useHierarchicSoftmax(false).vectorLength(size).build();
        String word;
        float[] vector = new float[size];
        for (int i = 0; i < words; i++) {
            word = readString(dis);
            log.trace("Loading " + word + " with word " + i);
            for (int j = 0; j < size; j++) {
                vector[j] = readFloat(dis);
            }
            syn0.putRow(i, normalize ? Transforms.unitVec(Nd4j.create(vector)) : Nd4j.create(vector));
            VocabWord vw = new VocabWord(1.0, word);
            vw.setIndex(cache.numWords());
            cache.addToken(vw);
            cache.addWordToIndex(vw.getIndex(), vw.getLabel());
            cache.putVocabWord(word);
            if (linebreaks) {
                // line break
                dis.readByte();
            }
            Nd4j.getMemoryManager().invokeGcOccasionally();
        }
    } finally {
        if (originalPeriodic)
            Nd4j.getMemoryManager().togglePeriodicGc(true);
        Nd4j.getMemoryManager().setOccasionalGcFrequency(originalFreq);
    }
    lookupTable.setSyn0(syn0);
    Word2Vec ret = new Word2Vec.Builder().useHierarchicSoftmax(false).resetModel(false).layerSize(syn0.columns()).allowParallelTokenization(true).elementsLearningAlgorithm(new SkipGram<VocabWord>()).learningRate(0.025).windowSize(5).workers(1).build();
    ret.setVocab(cache);
    ret.setLookupTable(lookupTable);
    return ret;
}

Also used : VocabWord(org.deeplearning4j.models.word2vec.VocabWord) GZIPInputStream(java.util.zip.GZIPInputStream) INDArray(org.nd4j.linalg.api.ndarray.INDArray) StaticWord2Vec(org.deeplearning4j.models.word2vec.StaticWord2Vec) Word2Vec(org.deeplearning4j.models.word2vec.Word2Vec)

Example 3 with Word2Vec

use of org.deeplearning4j.models.word2vec.Word2Vec in project deeplearning4j by deeplearning4j.

the class WordVectorSerializer method readTextModel.

/**
     * @param modelFile
     * @return
     * @throws FileNotFoundException
     * @throws IOException
     * @throws NumberFormatException
     */
private static Word2Vec readTextModel(File modelFile) throws IOException, NumberFormatException {
    InMemoryLookupTable lookupTable;
    VocabCache cache;
    INDArray syn0;
    Word2Vec ret = new Word2Vec();
    try (BufferedReader reader = new BufferedReader(new InputStreamReader(GzipUtils.isCompressedFilename(modelFile.getName()) ? new GZIPInputStream(new FileInputStream(modelFile)) : new FileInputStream(modelFile), "UTF-8"))) {
        String line = reader.readLine();
        String[] initial = line.split(" ");
        int words = Integer.parseInt(initial[0]);
        int layerSize = Integer.parseInt(initial[1]);
        syn0 = Nd4j.create(words, layerSize);
        cache = new InMemoryLookupCache(false);
        int currLine = 0;
        while ((line = reader.readLine()) != null) {
            String[] split = line.split(" ");
            assert split.length == layerSize + 1;
            String word = split[0].replaceAll(whitespaceReplacement, " ");
            float[] vector = new float[split.length - 1];
            for (int i = 1; i < split.length; i++) {
                vector[i - 1] = Float.parseFloat(split[i]);
            }
            syn0.putRow(currLine, Nd4j.create(vector));
            cache.addWordToIndex(cache.numWords(), word);
            cache.addToken(new VocabWord(1, word));
            cache.putVocabWord(word);
            currLine++;
        }
        lookupTable = (InMemoryLookupTable) new InMemoryLookupTable.Builder().cache(cache).vectorLength(layerSize).build();
        lookupTable.setSyn0(syn0);
        ret.setVocab(cache);
        ret.setLookupTable(lookupTable);
    }
    return ret;
}

Also used : VocabWord(org.deeplearning4j.models.word2vec.VocabWord) InMemoryLookupCache(org.deeplearning4j.models.word2vec.wordstore.inmemory.InMemoryLookupCache) GZIPInputStream(java.util.zip.GZIPInputStream) InMemoryLookupTable(org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable) INDArray(org.nd4j.linalg.api.ndarray.INDArray) VocabCache(org.deeplearning4j.models.word2vec.wordstore.VocabCache) StaticWord2Vec(org.deeplearning4j.models.word2vec.StaticWord2Vec) Word2Vec(org.deeplearning4j.models.word2vec.Word2Vec)

Example 4 with Word2Vec

use of org.deeplearning4j.models.word2vec.Word2Vec in project deeplearning4j by deeplearning4j.

the class WordVectorSerializer method readWord2VecFromText.

/**
     * This method allows you to read ParagraphVectors from externaly originated vectors and syn1.
     * So, technically this method is compatible with any other w2v implementation
     *
     * @param vectors   text file with words and their wieghts, aka Syn0
     * @param hs    text file HS layers, aka Syn1
     * @param h_codes   text file with Huffman tree codes
     * @param h_points  text file with Huffman tree points
     * @return
     */
public static Word2Vec readWord2VecFromText(@NonNull File vectors, @NonNull File hs, @NonNull File h_codes, @NonNull File h_points, @NonNull VectorsConfiguration configuration) throws IOException {
    // first we load syn0
    Pair<InMemoryLookupTable, VocabCache> pair = loadTxt(vectors);
    InMemoryLookupTable lookupTable = pair.getFirst();
    lookupTable.setNegative(configuration.getNegative());
    if (configuration.getNegative() > 0)
        lookupTable.initNegative();
    VocabCache<VocabWord> vocab = (VocabCache<VocabWord>) pair.getSecond();
    // now we load syn1
    BufferedReader reader = new BufferedReader(new FileReader(hs));
    String line = null;
    List<INDArray> rows = new ArrayList<>();
    while ((line = reader.readLine()) != null) {
        String[] split = line.split(" ");
        double[] array = new double[split.length];
        for (int i = 0; i < split.length; i++) {
            array[i] = Double.parseDouble(split[i]);
        }
        rows.add(Nd4j.create(array));
    }
    reader.close();
    // it's possible to have full model without syn1
    if (rows.size() > 0) {
        INDArray syn1 = Nd4j.vstack(rows);
        lookupTable.setSyn1(syn1);
    }
    // now we transform mappings into huffman tree points
    reader = new BufferedReader(new FileReader(h_points));
    while ((line = reader.readLine()) != null) {
        String[] split = line.split(" ");
        VocabWord word = vocab.wordFor(decodeB64(split[0]));
        List<Integer> points = new ArrayList<>();
        for (int i = 1; i < split.length; i++) {
            points.add(Integer.parseInt(split[i]));
        }
        word.setPoints(points);
    }
    reader.close();
    // now we transform mappings into huffman tree codes
    reader = new BufferedReader(new FileReader(h_codes));
    while ((line = reader.readLine()) != null) {
        String[] split = line.split(" ");
        VocabWord word = vocab.wordFor(decodeB64(split[0]));
        List<Byte> codes = new ArrayList<>();
        for (int i = 1; i < split.length; i++) {
            codes.add(Byte.parseByte(split[i]));
        }
        word.setCodes(codes);
        word.setCodeLength((short) codes.size());
    }
    reader.close();
    Word2Vec.Builder builder = new Word2Vec.Builder(configuration).vocabCache(vocab).lookupTable(lookupTable).resetModel(false);
    TokenizerFactory factory = getTokenizerFactory(configuration);
    if (factory != null)
        builder.tokenizerFactory(factory);
    Word2Vec w2v = builder.build();
    return w2v;
}

Also used : TokenizerFactory(org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory) ArrayList(java.util.ArrayList) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) AtomicInteger(java.util.concurrent.atomic.AtomicInteger) InMemoryLookupTable(org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable) INDArray(org.nd4j.linalg.api.ndarray.INDArray) VocabCache(org.deeplearning4j.models.word2vec.wordstore.VocabCache) StaticWord2Vec(org.deeplearning4j.models.word2vec.StaticWord2Vec) Word2Vec(org.deeplearning4j.models.word2vec.Word2Vec)

Example 5 with Word2Vec

use of org.deeplearning4j.models.word2vec.Word2Vec in project deeplearning4j by deeplearning4j.

the class WordVectorSerializer method readWord2Vec.

/**
     * This method restores Word2Vec model previously saved with writeWord2VecModel
     *
     * PLEASE NOTE: This method loads FULL model, so don't use it if you're only going to use weights.
     *
     * @param file
     * @return
     * @throws IOException
     */
@Deprecated
public static Word2Vec readWord2Vec(File file) throws IOException {
    File tmpFileSyn0 = File.createTempFile("word2vec", "0");
    File tmpFileSyn1 = File.createTempFile("word2vec", "1");
    File tmpFileC = File.createTempFile("word2vec", "c");
    File tmpFileH = File.createTempFile("word2vec", "h");
    File tmpFileF = File.createTempFile("word2vec", "f");
    tmpFileSyn0.deleteOnExit();
    tmpFileSyn1.deleteOnExit();
    tmpFileH.deleteOnExit();
    tmpFileC.deleteOnExit();
    tmpFileF.deleteOnExit();
    int originalFreq = Nd4j.getMemoryManager().getOccasionalGcFrequency();
    boolean originalPeriodic = Nd4j.getMemoryManager().isPeriodicGcActive();
    if (originalPeriodic)
        Nd4j.getMemoryManager().togglePeriodicGc(false);
    Nd4j.getMemoryManager().setOccasionalGcFrequency(50000);
    try {
        ZipFile zipFile = new ZipFile(file);
        ZipEntry syn0 = zipFile.getEntry("syn0.txt");
        InputStream stream = zipFile.getInputStream(syn0);
        Files.copy(stream, Paths.get(tmpFileSyn0.getAbsolutePath()), StandardCopyOption.REPLACE_EXISTING);
        ZipEntry syn1 = zipFile.getEntry("syn1.txt");
        stream = zipFile.getInputStream(syn1);
        Files.copy(stream, Paths.get(tmpFileSyn1.getAbsolutePath()), StandardCopyOption.REPLACE_EXISTING);
        ZipEntry codes = zipFile.getEntry("codes.txt");
        stream = zipFile.getInputStream(codes);
        Files.copy(stream, Paths.get(tmpFileC.getAbsolutePath()), StandardCopyOption.REPLACE_EXISTING);
        ZipEntry huffman = zipFile.getEntry("huffman.txt");
        stream = zipFile.getInputStream(huffman);
        Files.copy(stream, Paths.get(tmpFileH.getAbsolutePath()), StandardCopyOption.REPLACE_EXISTING);
        ZipEntry config = zipFile.getEntry("config.json");
        stream = zipFile.getInputStream(config);
        StringBuilder builder = new StringBuilder();
        try (BufferedReader reader = new BufferedReader(new InputStreamReader(stream))) {
            String line;
            while ((line = reader.readLine()) != null) {
                builder.append(line);
            }
        }
        VectorsConfiguration configuration = VectorsConfiguration.fromJson(builder.toString().trim());
        // we read first 4 files as w2v model
        Word2Vec w2v = readWord2VecFromText(tmpFileSyn0, tmpFileSyn1, tmpFileC, tmpFileH, configuration);
        // we read frequencies from frequencies.txt, however it's possible that we might not have this file
        ZipEntry frequencies = zipFile.getEntry("frequencies.txt");
        if (frequencies != null) {
            stream = zipFile.getInputStream(frequencies);
            try (BufferedReader reader = new BufferedReader(new InputStreamReader(stream))) {
                String line;
                while ((line = reader.readLine()) != null) {
                    String[] split = line.split(" ");
                    VocabWord word = w2v.getVocab().tokenFor(decodeB64(split[0]));
                    word.setElementFrequency((long) Double.parseDouble(split[1]));
                    word.setSequencesCount((long) Double.parseDouble(split[2]));
                }
            }
        }
        ZipEntry zsyn1Neg = zipFile.getEntry("syn1Neg.txt");
        if (zsyn1Neg != null) {
            stream = zipFile.getInputStream(zsyn1Neg);
            try (InputStreamReader isr = new InputStreamReader(stream);
                BufferedReader reader = new BufferedReader(isr)) {
                String line = null;
                List<INDArray> rows = new ArrayList<>();
                while ((line = reader.readLine()) != null) {
                    String[] split = line.split(" ");
                    double[] array = new double[split.length];
                    for (int i = 0; i < split.length; i++) {
                        array[i] = Double.parseDouble(split[i]);
                    }
                    rows.add(Nd4j.create(array));
                }
                // it's possible to have full model without syn1Neg
                if (rows.size() > 0) {
                    INDArray syn1Neg = Nd4j.vstack(rows);
                    ((InMemoryLookupTable) w2v.getLookupTable()).setSyn1Neg(syn1Neg);
                }
            }
        }
        return w2v;
    } finally {
        if (originalPeriodic)
            Nd4j.getMemoryManager().togglePeriodicGc(true);
        Nd4j.getMemoryManager().setOccasionalGcFrequency(originalFreq);
    }
}

Also used : GZIPInputStream(java.util.zip.GZIPInputStream) ZipEntry(java.util.zip.ZipEntry) ArrayList(java.util.ArrayList) VocabWord(org.deeplearning4j.models.word2vec.VocabWord) InMemoryLookupTable(org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable) ZipFile(java.util.zip.ZipFile) INDArray(org.nd4j.linalg.api.ndarray.INDArray) StaticWord2Vec(org.deeplearning4j.models.word2vec.StaticWord2Vec) Word2Vec(org.deeplearning4j.models.word2vec.Word2Vec) ZipFile(java.util.zip.ZipFile)

Aggregations

Word2Vec (org.deeplearning4j.models.word2vec.Word2Vec)19 INDArray (org.nd4j.linalg.api.ndarray.INDArray)13 TokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory)12 VocabWord (org.deeplearning4j.models.word2vec.VocabWord)11 Test (org.junit.Test)11 SentenceIterator (org.deeplearning4j.text.sentenceiterator.SentenceIterator)10 DefaultTokenizerFactory (org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory)10 CommonPreprocessor (org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor)9 File (java.io.File)8 ClassPathResource (org.datavec.api.util.ClassPathResource)8 StaticWord2Vec (org.deeplearning4j.models.word2vec.StaticWord2Vec)8 ArrayList (java.util.ArrayList)7 BasicLineIterator (org.deeplearning4j.text.sentenceiterator.BasicLineIterator)7 InMemoryLookupTable (org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable)6 GZIPInputStream (java.util.zip.GZIPInputStream)5 UimaSentenceIterator (org.deeplearning4j.text.sentenceiterator.UimaSentenceIterator)5 ZipFile (java.util.zip.ZipFile)4 BasicModelUtils (org.deeplearning4j.models.embeddings.reader.impl.BasicModelUtils)4 AtomicInteger (java.util.concurrent.atomic.AtomicInteger)3 ZipEntry (java.util.zip.ZipEntry)3