Search in sources :

Example 1 with SummarizerFactory

use of org.apache.accumulo.core.summary.SummarizerFactory in project accumulo by apache.

the class MajorCompactionRequest method getSummaries.

/**
 * Returns all summaries present in each file.
 *
 * <p>
 * This method can only be called from {@link CompactionStrategy#gatherInformation(MajorCompactionRequest)}. Unfortunately, {@code gatherInformation()} is not
 * called before {@link CompactionStrategy#shouldCompact(MajorCompactionRequest)}. Therefore {@code shouldCompact()} should just return true when a
 * compactions strategy wants to use summary information.
 *
 * <p>
 * When using summaries to make compaction decisions, its important to ensure that all summary data fits in the tablet server summary cache. The size of this
 * cache is configured by code tserver.cache.summary.size}. Also its important to use the summarySelector predicate to only retrieve the needed summary data.
 * Otherwise uneeded summary data could be brought into the cache.
 *
 * <p>
 * Some files may contain data outside of a tablets range. When {@link Summarizer}'s generate small amounts of summary data, multiple summaries may be stored
 * within a file for different row ranges. This will allow more accurate summaries to be returned for the case where a file has data outside a tablets range.
 * However, some summary data outside of the tablets range may still be included. When this happens {@link FileStatistics#getExtra()} will be non zero. Also,
 * its good to be aware of the other potential causes of inaccuracies {@link FileStatistics#getInaccurate()}
 *
 * <p>
 * When this method is called with multiple files, it will automatically merge summary data using {@link Combiner#merge(Map, Map)}. If summary information is
 * needed for each file, then just call this method for each file.
 *
 * <p>
 * Writing a compaction strategy that uses summary information is a bit tricky. See the source code for {@link TooManyDeletesCompactionStrategy} as an example
 * of a compaction strategy.
 *
 * @see Summarizer
 * @see TableOperations#addSummarizers(String, SummarizerConfiguration...)
 * @see AccumuloFileOutputFormat#setSummarizers(org.apache.hadoop.mapred.JobConf, SummarizerConfiguration...)
 * @see WriterOptions#withSummarizers(SummarizerConfiguration...)
 */
public List<Summary> getSummaries(Collection<FileRef> files, Predicate<SummarizerConfiguration> summarySelector) throws IOException {
    Preconditions.checkState(volumeManager != null, "Getting summaries is not supported at this time.  Its only supported when CompactionStrategy.gatherInformation() is called.");
    SummaryCollection sc = new SummaryCollection();
    SummarizerFactory factory = new SummarizerFactory(tableConfig);
    for (FileRef file : files) {
        FileSystem fs = volumeManager.getVolumeByPath(file.path()).getFileSystem();
        Configuration conf = CachedConfiguration.getInstance();
        SummaryCollection fsc = SummaryReader.load(fs, conf, tableConfig, factory, file.path(), summarySelector, summaryCache, indexCache).getSummaries(Collections.singletonList(new Gatherer.RowRange(extent)));
        sc.merge(fsc, factory);
    }
    return sc.getSummaries();
}
Also used : SummarizerConfiguration(org.apache.accumulo.core.client.summary.SummarizerConfiguration) Configuration(org.apache.hadoop.conf.Configuration) AccumuloConfiguration(org.apache.accumulo.core.conf.AccumuloConfiguration) CachedConfiguration(org.apache.accumulo.core.util.CachedConfiguration) FileRef(org.apache.accumulo.server.fs.FileRef) FileSystem(org.apache.hadoop.fs.FileSystem) SummarizerFactory(org.apache.accumulo.core.summary.SummarizerFactory) SummaryCollection(org.apache.accumulo.core.summary.SummaryCollection)

Example 2 with SummarizerFactory

use of org.apache.accumulo.core.summary.SummarizerFactory in project accumulo by apache.

the class RFileSummariesRetriever method read.

@Override
public Collection<Summary> read() throws IOException {
    SummarizerFactory factory = new SummarizerFactory();
    AccumuloConfiguration acuconf = DefaultConfiguration.getInstance();
    Configuration conf = in.getFileSystem().getConf();
    RFileSource[] sources = in.getSources();
    try {
        SummaryCollection all = new SummaryCollection();
        for (RFileSource source : sources) {
            SummaryReader fileSummary = SummaryReader.load(conf, acuconf, source.getInputStream(), source.getLength(), summarySelector, factory);
            SummaryCollection sc = fileSummary.getSummaries(Collections.singletonList(new Gatherer.RowRange(startRow, endRow)));
            all.merge(sc, factory);
        }
        return all.getSummaries();
    } finally {
        for (RFileSource source : sources) {
            source.getInputStream().close();
        }
    }
}
Also used : SummarizerConfiguration(org.apache.accumulo.core.client.summary.SummarizerConfiguration) DefaultConfiguration(org.apache.accumulo.core.conf.DefaultConfiguration) AccumuloConfiguration(org.apache.accumulo.core.conf.AccumuloConfiguration) Configuration(org.apache.hadoop.conf.Configuration) SummaryReader(org.apache.accumulo.core.summary.SummaryReader) SummarizerFactory(org.apache.accumulo.core.summary.SummarizerFactory) SummaryCollection(org.apache.accumulo.core.summary.SummaryCollection) AccumuloConfiguration(org.apache.accumulo.core.conf.AccumuloConfiguration)

Aggregations

SummarizerConfiguration (org.apache.accumulo.core.client.summary.SummarizerConfiguration)2 AccumuloConfiguration (org.apache.accumulo.core.conf.AccumuloConfiguration)2 SummarizerFactory (org.apache.accumulo.core.summary.SummarizerFactory)2 SummaryCollection (org.apache.accumulo.core.summary.SummaryCollection)2 Configuration (org.apache.hadoop.conf.Configuration)2 DefaultConfiguration (org.apache.accumulo.core.conf.DefaultConfiguration)1 SummaryReader (org.apache.accumulo.core.summary.SummaryReader)1 CachedConfiguration (org.apache.accumulo.core.util.CachedConfiguration)1 FileRef (org.apache.accumulo.server.fs.FileRef)1 FileSystem (org.apache.hadoop.fs.FileSystem)1