Search in sources :

Example 11 with RAMOutputStream

use of org.apache.lucene.store.RAMOutputStream in project lucene-solr by apache.

the class Lucene54DocValuesConsumer method addTermsDict.

/** expert: writes a value dictionary for a sorted/sortedset field */
private void addTermsDict(FieldInfo field, final Iterable<BytesRef> values) throws IOException {
    // first check if it's a "fixed-length" terms dict, and compressibility if so
    int minLength = Integer.MAX_VALUE;
    int maxLength = Integer.MIN_VALUE;
    long numValues = 0;
    BytesRefBuilder previousValue = new BytesRefBuilder();
    // only valid for fixed-width data, as we have a choice there
    long prefixSum = 0;
    for (BytesRef v : values) {
        minLength = Math.min(minLength, v.length);
        maxLength = Math.max(maxLength, v.length);
        if (minLength == maxLength) {
            int termPosition = (int) (numValues & INTERVAL_MASK);
            if (termPosition == 0) {
                // first term in block, save it away to compare against the last term later
                previousValue.copyBytes(v);
            } else if (termPosition == INTERVAL_COUNT - 1) {
                // last term in block, accumulate shared prefix against first term
                prefixSum += StringHelper.bytesDifference(previousValue.get(), v);
            }
        }
        numValues++;
    }
    // so if we share at least 3 bytes on average, always compress.
    if (minLength == maxLength && prefixSum <= 3 * (numValues >> INTERVAL_SHIFT)) {
        // no index needed: not very compressible, direct addressing by mult
        addBinaryField(field, values);
    } else if (numValues < REVERSE_INTERVAL_COUNT) {
        // low cardinality: waste a few KB of ram, but can't really use fancy index etc
        addBinaryField(field, values);
    } else {
        // we don't have to handle the empty case
        assert numValues > 0;
        // header
        meta.writeVInt(field.number);
        meta.writeByte(Lucene54DocValuesFormat.BINARY);
        meta.writeVInt(BINARY_PREFIX_COMPRESSED);
        meta.writeLong(-1L);
        // now write the bytes: sharing prefixes within a block
        final long startFP = data.getFilePointer();
        // currently, we have to store the delta from expected for every 1/nth term
        // we could avoid this, but it's not much and less overall RAM than the previous approach!
        RAMOutputStream addressBuffer = new RAMOutputStream();
        MonotonicBlockPackedWriter termAddresses = new MonotonicBlockPackedWriter(addressBuffer, MONOTONIC_BLOCK_SIZE);
        // buffers up 16 terms
        RAMOutputStream bytesBuffer = new RAMOutputStream();
        // buffers up block header
        RAMOutputStream headerBuffer = new RAMOutputStream();
        BytesRefBuilder lastTerm = new BytesRefBuilder();
        lastTerm.grow(maxLength);
        long count = 0;
        int[] suffixDeltas = new int[INTERVAL_COUNT];
        for (BytesRef v : values) {
            int termPosition = (int) (count & INTERVAL_MASK);
            if (termPosition == 0) {
                termAddresses.add(data.getFilePointer() - startFP);
                // abs-encode first term
                headerBuffer.writeVInt(v.length);
                headerBuffer.writeBytes(v.bytes, v.offset, v.length);
                lastTerm.copyBytes(v);
            } else {
                // prefix-code: we only share at most 255 characters, to encode the length as a single
                // byte and have random access. Larger terms just get less compression.
                int sharedPrefix = Math.min(255, StringHelper.bytesDifference(lastTerm.get(), v));
                bytesBuffer.writeByte((byte) sharedPrefix);
                bytesBuffer.writeBytes(v.bytes, v.offset + sharedPrefix, v.length - sharedPrefix);
                // we can encode one smaller, because terms are unique.
                suffixDeltas[termPosition] = v.length - sharedPrefix - 1;
            }
            count++;
            // flush block
            if ((count & INTERVAL_MASK) == 0) {
                flushTermsDictBlock(headerBuffer, bytesBuffer, suffixDeltas);
            }
        }
        // flush trailing crap
        int leftover = (int) (count & INTERVAL_MASK);
        if (leftover > 0) {
            Arrays.fill(suffixDeltas, leftover, suffixDeltas.length, 0);
            flushTermsDictBlock(headerBuffer, bytesBuffer, suffixDeltas);
        }
        final long indexStartFP = data.getFilePointer();
        // write addresses of indexed terms
        termAddresses.finish();
        addressBuffer.writeTo(data);
        addressBuffer = null;
        termAddresses = null;
        meta.writeVInt(minLength);
        meta.writeVInt(maxLength);
        meta.writeVLong(count);
        meta.writeLong(startFP);
        meta.writeLong(indexStartFP);
        meta.writeVInt(PackedInts.VERSION_CURRENT);
        meta.writeVInt(MONOTONIC_BLOCK_SIZE);
        addReverseTermIndex(field, values, maxLength);
    }
}
Also used : MonotonicBlockPackedWriter(org.apache.lucene.util.packed.MonotonicBlockPackedWriter) BytesRefBuilder(org.apache.lucene.util.BytesRefBuilder) RAMOutputStream(org.apache.lucene.store.RAMOutputStream) BytesRef(org.apache.lucene.util.BytesRef)

Example 12 with RAMOutputStream

use of org.apache.lucene.store.RAMOutputStream in project lucene-solr by apache.

the class BKDWriter method packIndex.

/** Packs the two arrays, representing a balanced binary tree, into a compact byte[] structure. */
private byte[] packIndex(long[] leafBlockFPs, byte[] splitPackedValues) throws IOException {
    int numLeaves = leafBlockFPs.length;
    // levels of the binary tree:
    if (numDims == 1 && numLeaves > 1) {
        int levelCount = 2;
        while (true) {
            if (numLeaves >= levelCount && numLeaves <= 2 * levelCount) {
                int lastLevel = 2 * (numLeaves - levelCount);
                assert lastLevel >= 0;
                if (lastLevel != 0) {
                    // Last level is partially filled, so we must rotate the leaf FPs to match.  We do this here, after loading
                    // at read-time, so that we can still delta code them on disk at write:
                    long[] newLeafBlockFPs = new long[numLeaves];
                    System.arraycopy(leafBlockFPs, lastLevel, newLeafBlockFPs, 0, leafBlockFPs.length - lastLevel);
                    System.arraycopy(leafBlockFPs, 0, newLeafBlockFPs, leafBlockFPs.length - lastLevel, lastLevel);
                    leafBlockFPs = newLeafBlockFPs;
                }
                break;
            }
            levelCount *= 2;
        }
    }
    /** Reused while packing the index */
    RAMOutputStream writeBuffer = new RAMOutputStream();
    // This is the "file" we append the byte[] to:
    List<byte[]> blocks = new ArrayList<>();
    byte[] lastSplitValues = new byte[bytesPerDim * numDims];
    //System.out.println("\npack index");
    int totalSize = recursePackIndex(writeBuffer, leafBlockFPs, splitPackedValues, 0l, blocks, 1, lastSplitValues, new boolean[numDims], false);
    // Compact the byte[] blocks into single byte index:
    byte[] index = new byte[totalSize];
    int upto = 0;
    for (byte[] block : blocks) {
        System.arraycopy(block, 0, index, upto, block.length);
        upto += block.length;
    }
    assert upto == totalSize;
    return index;
}
Also used : RAMOutputStream(org.apache.lucene.store.RAMOutputStream) ArrayList(java.util.ArrayList)

Example 13 with RAMOutputStream

use of org.apache.lucene.store.RAMOutputStream in project lucene-solr by apache.

the class Lucene70DocValuesConsumer method addTermsDict.

private void addTermsDict(SortedSetDocValues values) throws IOException {
    final long size = values.getValueCount();
    meta.writeVLong(size);
    meta.writeInt(Lucene70DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
    RAMOutputStream addressBuffer = new RAMOutputStream();
    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
    long numBlocks = (size + Lucene70DocValuesFormat.TERMS_DICT_BLOCK_MASK) >>> Lucene70DocValuesFormat.TERMS_DICT_BLOCK_SHIFT;
    DirectMonotonicWriter writer = DirectMonotonicWriter.getInstance(meta, addressBuffer, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
    BytesRefBuilder previous = new BytesRefBuilder();
    long ord = 0;
    long start = data.getFilePointer();
    int maxLength = 0;
    TermsEnum iterator = values.termsEnum();
    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
        if ((ord & Lucene70DocValuesFormat.TERMS_DICT_BLOCK_MASK) == 0) {
            writer.add(data.getFilePointer() - start);
            data.writeVInt(term.length);
            data.writeBytes(term.bytes, term.offset, term.length);
        } else {
            final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
            final int suffixLength = term.length - prefixLength;
            // terms are unique
            assert suffixLength > 0;
            data.writeByte((byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) << 4)));
            if (prefixLength >= 15) {
                data.writeVInt(prefixLength - 15);
            }
            if (suffixLength >= 16) {
                data.writeVInt(suffixLength - 16);
            }
            data.writeBytes(term.bytes, term.offset + prefixLength, term.length - prefixLength);
        }
        maxLength = Math.max(maxLength, term.length);
        previous.copyBytes(term);
        ++ord;
    }
    writer.finish();
    meta.writeInt(maxLength);
    meta.writeLong(start);
    meta.writeLong(data.getFilePointer() - start);
    start = data.getFilePointer();
    addressBuffer.writeTo(data);
    meta.writeLong(start);
    meta.writeLong(data.getFilePointer() - start);
    // Now write the reverse terms index
    writeTermsIndex(values);
}
Also used : BytesRefBuilder(org.apache.lucene.util.BytesRefBuilder) RAMOutputStream(org.apache.lucene.store.RAMOutputStream) DirectMonotonicWriter(org.apache.lucene.util.packed.DirectMonotonicWriter) BytesRef(org.apache.lucene.util.BytesRef) TermsEnum(org.apache.lucene.index.TermsEnum)

Example 14 with RAMOutputStream

use of org.apache.lucene.store.RAMOutputStream in project lucene-solr by apache.

the class Lucene70DocValuesConsumer method writeTermsIndex.

private void writeTermsIndex(SortedSetDocValues values) throws IOException {
    final long size = values.getValueCount();
    meta.writeInt(Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_SHIFT);
    long start = data.getFilePointer();
    long numBlocks = 1L + ((size + Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_MASK) >>> Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_SHIFT);
    RAMOutputStream addressBuffer = new RAMOutputStream();
    DirectMonotonicWriter writer = DirectMonotonicWriter.getInstance(meta, addressBuffer, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
    TermsEnum iterator = values.termsEnum();
    BytesRefBuilder previous = new BytesRefBuilder();
    long offset = 0;
    long ord = 0;
    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
        if ((ord & Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_MASK) == 0) {
            writer.add(offset);
            int sortKeyLength = StringHelper.sortKeyLength(previous.get(), term);
            offset += sortKeyLength;
            data.writeBytes(term.bytes, term.offset, sortKeyLength);
        } else if ((ord & Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_MASK) == Lucene70DocValuesFormat.TERMS_DICT_REVERSE_INDEX_MASK) {
            previous.copyBytes(term);
        }
        ++ord;
    }
    writer.add(offset);
    writer.finish();
    meta.writeLong(start);
    meta.writeLong(data.getFilePointer() - start);
    start = data.getFilePointer();
    addressBuffer.writeTo(data);
    meta.writeLong(start);
    meta.writeLong(data.getFilePointer() - start);
}
Also used : BytesRefBuilder(org.apache.lucene.util.BytesRefBuilder) RAMOutputStream(org.apache.lucene.store.RAMOutputStream) DirectMonotonicWriter(org.apache.lucene.util.packed.DirectMonotonicWriter) BytesRef(org.apache.lucene.util.BytesRef) TermsEnum(org.apache.lucene.index.TermsEnum)

Example 15 with RAMOutputStream

use of org.apache.lucene.store.RAMOutputStream in project lucene-solr by apache.

the class TestLucene70DocValuesFormat method testSortedNumericAroundBlockSize.

@Slow
public void testSortedNumericAroundBlockSize() throws IOException {
    final int frontier = 1 << Lucene70DocValuesFormat.DIRECT_MONOTONIC_BLOCK_SHIFT;
    for (int maxDoc = frontier - 1; maxDoc <= frontier + 1; ++maxDoc) {
        final Directory dir = newDirectory();
        IndexWriter w = new IndexWriter(dir, newIndexWriterConfig().setMergePolicy(newLogMergePolicy()));
        RAMFile buffer = new RAMFile();
        RAMOutputStream out = new RAMOutputStream(buffer, false);
        Document doc = new Document();
        SortedNumericDocValuesField field1 = new SortedNumericDocValuesField("snum", 0L);
        doc.add(field1);
        SortedNumericDocValuesField field2 = new SortedNumericDocValuesField("snum", 0L);
        doc.add(field2);
        for (int i = 0; i < maxDoc; ++i) {
            long s1 = random().nextInt(100);
            long s2 = random().nextInt(100);
            field1.setLongValue(s1);
            field2.setLongValue(s2);
            w.addDocument(doc);
            out.writeVLong(Math.min(s1, s2));
            out.writeVLong(Math.max(s1, s2));
        }
        out.close();
        w.forceMerge(1);
        DirectoryReader r = DirectoryReader.open(w);
        w.close();
        LeafReader sr = getOnlyLeafReader(r);
        assertEquals(maxDoc, sr.maxDoc());
        SortedNumericDocValues values = sr.getSortedNumericDocValues("snum");
        assertNotNull(values);
        RAMInputStream in = new RAMInputStream("", buffer);
        for (int i = 0; i < maxDoc; ++i) {
            assertEquals(i, values.nextDoc());
            assertEquals(2, values.docValueCount());
            assertEquals(in.readVLong(), values.nextValue());
            assertEquals(in.readVLong(), values.nextValue());
        }
        r.close();
        dir.close();
    }
}
Also used : RAMFile(org.apache.lucene.store.RAMFile) SortedNumericDocValuesField(org.apache.lucene.document.SortedNumericDocValuesField) SortedNumericDocValues(org.apache.lucene.index.SortedNumericDocValues) LeafReader(org.apache.lucene.index.LeafReader) IndexWriter(org.apache.lucene.index.IndexWriter) RandomIndexWriter(org.apache.lucene.index.RandomIndexWriter) DirectoryReader(org.apache.lucene.index.DirectoryReader) RAMInputStream(org.apache.lucene.store.RAMInputStream) RAMOutputStream(org.apache.lucene.store.RAMOutputStream) Document(org.apache.lucene.document.Document) Directory(org.apache.lucene.store.Directory)

Aggregations

RAMOutputStream (org.apache.lucene.store.RAMOutputStream)27 RAMFile (org.apache.lucene.store.RAMFile)21 IndexOutput (org.apache.lucene.store.IndexOutput)16 RAMInputStream (org.apache.lucene.store.RAMInputStream)15 BufferedChecksumIndexInput (org.apache.lucene.store.BufferedChecksumIndexInput)11 ChecksumIndexInput (org.apache.lucene.store.ChecksumIndexInput)11 IndexInput (org.apache.lucene.store.IndexInput)7 CorruptIndexException (org.apache.lucene.index.CorruptIndexException)5 BytesRef (org.apache.lucene.util.BytesRef)5 BytesRefBuilder (org.apache.lucene.util.BytesRefBuilder)5 IOException (java.io.IOException)4 Document (org.apache.lucene.document.Document)4 DirectoryReader (org.apache.lucene.index.DirectoryReader)4 IndexWriter (org.apache.lucene.index.IndexWriter)4 LeafReader (org.apache.lucene.index.LeafReader)4 RandomIndexWriter (org.apache.lucene.index.RandomIndexWriter)4 Directory (org.apache.lucene.store.Directory)4 TreeSet (java.util.TreeSet)2 SortedNumericDocValuesField (org.apache.lucene.document.SortedNumericDocValuesField)2 SortedSetDocValuesField (org.apache.lucene.document.SortedSetDocValuesField)2