Search in sources :

Example 1 with MultiLevelMphf

use of zemberek.core.hash.MultiLevelMphf in project zemberek-nlp by ahmetaa.

the class CompressedCharNgramModel method load.

public static CompressedCharNgramModel load(InputStream is) throws IOException {
    try (DataInputStream dis = new DataInputStream(new BufferedInputStream(is))) {
        int order = dis.readInt();
        String modelId = dis.readUTF();
        MultiLevelMphf[] mphfs = new MultiLevelMphf[order + 1];
        ProbData[] probDatas = new ProbData[order + 1];
        DoubleLookup[] lookups = new DoubleLookup[order + 1];
        for (int i = 1; i <= order; i++) {
            lookups[i] = DoubleLookup.getLookup(dis);
            probDatas[i] = new ProbData(dis);
            mphfs[i] = MultiLevelMphf.deserialize(dis);
        }
        return new CompressedCharNgramModel(order, modelId, mphfs, probDatas, lookups);
    }
}
Also used : BufferedInputStream(java.io.BufferedInputStream) DoubleLookup(zemberek.core.quantization.DoubleLookup) DataInputStream(java.io.DataInputStream) MultiLevelMphf(zemberek.core.hash.MultiLevelMphf)

Example 2 with MultiLevelMphf

use of zemberek.core.hash.MultiLevelMphf in project zemberek-nlp by ahmetaa.

the class CompressedCharNgramModel method compress.

public static void compress(MapBasedCharNgramLanguageModel model, File output) throws IOException {
    Mphf[] mphfs = new MultiLevelMphf[model.getOrder() + 1];
    DoubleLookup[] lookups = new DoubleLookup[model.getOrder() + 1];
    try (DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(output)))) {
        dos.writeInt(model.getOrder());
        dos.writeUTF(model.getId());
        for (int i = 1; i <= model.getOrder(); i++) {
            Histogram<Double> histogram = new Histogram<>();
            histogram.add(model.gramLogProbs[i].values.values());
            double[] lookup = new double[histogram.size()];
            int j = 0;
            for (Double key : histogram) {
                lookup[j] = key;
                j++;
            }
            Quantizer quantizer = BinningQuantizer.linearBinning(lookup, 8);
            lookups[i] = quantizer.getDequantizer();
            List<String> keys = Lists.newArrayList(model.gramLogProbs[i].values.keySet());
            int[] fingerprints = new int[keys.size()];
            int[] probabilityIndexes = new int[keys.size()];
            mphfs[i] = MultiLevelMphf.generate(new StringListKeyProvider(keys));
            for (final String key : keys) {
                final int index = mphfs[i].get(key);
                fingerprints[index] = MultiLevelMphf.hash(key, -1) & FINGER_PRINT_MASK;
                probabilityIndexes[index] = quantizer.getQuantizationIndex(model.gramLogProbs[i].values.get(key));
            }
            lookups[i].save(dos);
            dos.writeInt(keys.size());
            for (int k = 0; k < keys.size(); k++) {
                dos.writeShort(fingerprints[k] & 0xffff);
                dos.writeByte(probabilityIndexes[k]);
            }
            mphfs[i].serialize(dos);
        }
    }
}
Also used : Histogram(zemberek.core.collections.Histogram) MultiLevelMphf(zemberek.core.hash.MultiLevelMphf) Mphf(zemberek.core.hash.Mphf) DataOutputStream(java.io.DataOutputStream) MultiLevelMphf(zemberek.core.hash.MultiLevelMphf) FileOutputStream(java.io.FileOutputStream) Quantizer(zemberek.core.quantization.Quantizer) BinningQuantizer(zemberek.core.quantization.BinningQuantizer) DoubleLookup(zemberek.core.quantization.DoubleLookup) BufferedOutputStream(java.io.BufferedOutputStream)

Example 3 with MultiLevelMphf

use of zemberek.core.hash.MultiLevelMphf in project zemberek-nlp by ahmetaa.

the class LossyIntLookup method generate.

/**
 * Generates a LossyIntLookup from a String->Float lookup
 */
public static LossyIntLookup generate(FloatValueMap<String> lookup) {
    List<String> keyList = lookup.getKeyList();
    StringHashKeyProvider provider = new StringHashKeyProvider(keyList);
    MultiLevelMphf mphf = MultiLevelMphf.generate(provider);
    int[] data = new int[keyList.size() * 2];
    for (String s : keyList) {
        int index = mphf.get(s);
        // fingerprint
        data[index * 2] = getFingerprint(s);
        // data in int form
        data[index * 2 + 1] = Float.floatToIntBits(lookup.get(s));
    }
    return new LossyIntLookup(mphf, data);
}
Also used : StringHashKeyProvider(zemberek.core.hash.StringHashKeyProvider) MultiLevelMphf(zemberek.core.hash.MultiLevelMphf)

Aggregations

MultiLevelMphf (zemberek.core.hash.MultiLevelMphf)3 DoubleLookup (zemberek.core.quantization.DoubleLookup)2 BufferedInputStream (java.io.BufferedInputStream)1 BufferedOutputStream (java.io.BufferedOutputStream)1 DataInputStream (java.io.DataInputStream)1 DataOutputStream (java.io.DataOutputStream)1 FileOutputStream (java.io.FileOutputStream)1 Histogram (zemberek.core.collections.Histogram)1 Mphf (zemberek.core.hash.Mphf)1 StringHashKeyProvider (zemberek.core.hash.StringHashKeyProvider)1 BinningQuantizer (zemberek.core.quantization.BinningQuantizer)1 Quantizer (zemberek.core.quantization.Quantizer)1