Examples with DataSchema - io.druid.segment.indexing.DataSchema

Example 31 with DataSchema

use of io.druid.segment.indexing.DataSchema in project druid by druid-io.

the class IndexGeneratorJobTest method setUp.

@Before
public void setUp() throws Exception {
    mapper = HadoopDruidIndexerConfig.JSON_MAPPER;
    mapper.registerSubtypes(new NamedType(HashBasedNumberedShardSpec.class, "hashed"));
    mapper.registerSubtypes(new NamedType(SingleDimensionShardSpec.class, "single"));
    dataFile = temporaryFolder.newFile();
    tmpDir = temporaryFolder.newFolder();
    HashMap<String, Object> inputSpec = new HashMap<String, Object>();
    inputSpec.put("paths", dataFile.getCanonicalPath());
    inputSpec.put("type", "static");
    if (inputFormatName != null) {
        inputSpec.put("inputFormat", inputFormatName);
    }
    if (SequenceFileInputFormat.class.getName().equals(inputFormatName)) {
        writeDataToLocalSequenceFile(dataFile, data);
    } else {
        FileUtils.writeLines(dataFile, data);
    }
    config = new HadoopDruidIndexerConfig(new HadoopIngestionSpec(new DataSchema(datasourceName, mapper.convertValue(inputRowParser, Map.class), aggs, new UniformGranularitySpec(Granularities.DAY, Granularities.NONE, ImmutableList.of(this.interval)), mapper), new HadoopIOConfig(ImmutableMap.copyOf(inputSpec), null, tmpDir.getCanonicalPath()), new HadoopTuningConfig(tmpDir.getCanonicalPath(), null, null, null, null, maxRowsInMemory, false, false, false, false, //verifies that set num reducers is ignored
    ImmutableMap.of(JobContext.NUM_REDUCES, "0"), false, useCombiner, null, buildV9Directly, null, forceExtendableShardSpecs, false)));
    config.setShardSpecs(loadShardSpecs(partitionType, shardInfoForEachSegment));
    config = HadoopDruidIndexerConfig.fromSpec(config.getSchema());
}

Also used : HashBasedNumberedShardSpec(io.druid.timeline.partition.HashBasedNumberedShardSpec) HashMap(java.util.HashMap) NamedType(com.fasterxml.jackson.databind.jsontype.NamedType) SequenceFileInputFormat(org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat) DataSchema(io.druid.segment.indexing.DataSchema) UniformGranularitySpec(io.druid.segment.indexing.granularity.UniformGranularitySpec) SingleDimensionShardSpec(io.druid.timeline.partition.SingleDimensionShardSpec) Before(org.junit.Before)

Example 32 with DataSchema

use of io.druid.segment.indexing.DataSchema in project druid by druid-io.

the class JobHelperTest method setup.

@Before
public void setup() throws Exception {
    tmpDir = temporaryFolder.newFile();
    dataFile = temporaryFolder.newFile();
    config = new HadoopDruidIndexerConfig(new HadoopIngestionSpec(new DataSchema("website", HadoopDruidIndexerConfig.JSON_MAPPER.convertValue(new StringInputRowParser(new CSVParseSpec(new TimestampSpec("timestamp", "yyyyMMddHH", null), new DimensionsSpec(DimensionsSpec.getDefaultSchemas(ImmutableList.of("host")), null, null), null, ImmutableList.of("timestamp", "host", "visited_num")), null), Map.class), new AggregatorFactory[] { new LongSumAggregatorFactory("visited_num", "visited_num") }, new UniformGranularitySpec(Granularities.DAY, Granularities.NONE, ImmutableList.of(this.interval)), HadoopDruidIndexerConfig.JSON_MAPPER), new HadoopIOConfig(ImmutableMap.<String, Object>of("paths", dataFile.getCanonicalPath(), "type", "static"), null, tmpDir.getCanonicalPath()), new HadoopTuningConfig(tmpDir.getCanonicalPath(), null, null, null, null, null, false, false, false, false, //Map of job properties
    ImmutableMap.of("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem", "fs.s3.awsAccessKeyId", "THISISMYACCESSKEY"), false, false, null, null, null, false, false)));
}

Also used : LongSumAggregatorFactory(io.druid.query.aggregation.LongSumAggregatorFactory) AggregatorFactory(io.druid.query.aggregation.AggregatorFactory) LongSumAggregatorFactory(io.druid.query.aggregation.LongSumAggregatorFactory) DataSchema(io.druid.segment.indexing.DataSchema) UniformGranularitySpec(io.druid.segment.indexing.granularity.UniformGranularitySpec) CSVParseSpec(io.druid.data.input.impl.CSVParseSpec) StringInputRowParser(io.druid.data.input.impl.StringInputRowParser) TimestampSpec(io.druid.data.input.impl.TimestampSpec) DimensionsSpec(io.druid.data.input.impl.DimensionsSpec) Before(org.junit.Before)

Example 33 with DataSchema

use of io.druid.segment.indexing.DataSchema in project druid by druid-io.

the class RealtimeManager method start.

@LifecycleStart
public void start() throws IOException {
    for (final FireDepartment fireDepartment : fireDepartments) {
        final DataSchema schema = fireDepartment.getDataSchema();
        final FireChief chief = new FireChief(fireDepartment, conglomerate);
        Map<Integer, FireChief> partitionChiefs = chiefs.get(schema.getDataSource());
        if (partitionChiefs == null) {
            partitionChiefs = new HashMap<>();
            chiefs.put(schema.getDataSource(), partitionChiefs);
        }
        partitionChiefs.put(fireDepartment.getTuningConfig().getShardSpec().getPartitionNum(), chief);
        chief.setName(String.format("chief-%s[%s]", schema.getDataSource(), fireDepartment.getTuningConfig().getShardSpec().getPartitionNum()));
        chief.setDaemon(true);
        chief.start();
    }
}

Also used : DataSchema(io.druid.segment.indexing.DataSchema) LifecycleStart(io.druid.java.util.common.lifecycle.LifecycleStart)

Example 34 with DataSchema

use of io.druid.segment.indexing.DataSchema in project hive by apache.

the class DruidOutputFormat method getHiveRecordWriter.

@Override
public FileSinkOperator.RecordWriter getHiveRecordWriter(JobConf jc, Path finalOutPath, Class<? extends Writable> valueClass, boolean isCompressed, Properties tableProperties, Progressable progress) throws IOException {
    final String segmentGranularity = tableProperties.getProperty(Constants.DRUID_SEGMENT_GRANULARITY) != null ? tableProperties.getProperty(Constants.DRUID_SEGMENT_GRANULARITY) : HiveConf.getVar(jc, HiveConf.ConfVars.HIVE_DRUID_INDEXING_GRANULARITY);
    final int targetNumShardsPerGranularity = Integer.parseUnsignedInt(tableProperties.getProperty(Constants.DRUID_TARGET_SHARDS_PER_GRANULARITY, "0"));
    final int maxPartitionSize = targetNumShardsPerGranularity > 0 ? -1 : HiveConf.getIntVar(jc, HiveConf.ConfVars.HIVE_DRUID_MAX_PARTITION_SIZE);
    // If datasource is in the table properties, it is an INSERT/INSERT OVERWRITE as the datasource
    // name was already persisted. Otherwise, it is a CT/CTAS and we need to get the name from the
    // job properties that are set by configureOutputJobProperties in the DruidStorageHandler
    final String dataSource = tableProperties.getProperty(Constants.DRUID_DATA_SOURCE) == null ? jc.get(Constants.DRUID_DATA_SOURCE) : tableProperties.getProperty(Constants.DRUID_DATA_SOURCE);
    final String segmentDirectory = jc.get(Constants.DRUID_SEGMENT_INTERMEDIATE_DIRECTORY);
    final GranularitySpec granularitySpec = new UniformGranularitySpec(Granularity.fromString(segmentGranularity), Granularity.fromString(tableProperties.getProperty(Constants.DRUID_QUERY_GRANULARITY) == null ? "NONE" : tableProperties.getProperty(Constants.DRUID_QUERY_GRANULARITY)), null);
    final String columnNameProperty = tableProperties.getProperty(serdeConstants.LIST_COLUMNS);
    final String columnTypeProperty = tableProperties.getProperty(serdeConstants.LIST_COLUMN_TYPES);
    if (StringUtils.isEmpty(columnNameProperty) || StringUtils.isEmpty(columnTypeProperty)) {
        throw new IllegalStateException(String.format("List of columns names [%s] or columns type [%s] is/are not present", columnNameProperty, columnTypeProperty));
    }
    ArrayList<String> columnNames = new ArrayList<String>();
    for (String name : columnNameProperty.split(",")) {
        columnNames.add(name);
    }
    if (!columnNames.contains(DruidStorageHandlerUtils.DEFAULT_TIMESTAMP_COLUMN)) {
        throw new IllegalStateException("Timestamp column (' " + DruidStorageHandlerUtils.DEFAULT_TIMESTAMP_COLUMN + "') not specified in create table; list of columns is : " + tableProperties.getProperty(serdeConstants.LIST_COLUMNS));
    }
    ArrayList<TypeInfo> columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
    final boolean approximationAllowed = HiveConf.getBoolVar(jc, HiveConf.ConfVars.HIVE_DRUID_APPROX_RESULT);
    // Default, all columns that are not metrics or timestamp, are treated as dimensions
    final List<DimensionSchema> dimensions = new ArrayList<>();
    ImmutableList.Builder<AggregatorFactory> aggregatorFactoryBuilder = ImmutableList.builder();
    for (int i = 0; i < columnTypes.size(); i++) {
        final PrimitiveObjectInspector.PrimitiveCategory primitiveCategory = ((PrimitiveTypeInfo) columnTypes.get(i)).getPrimitiveCategory();
        AggregatorFactory af;
        switch(primitiveCategory) {
            case BYTE:
            case SHORT:
            case INT:
            case LONG:
                af = new LongSumAggregatorFactory(columnNames.get(i), columnNames.get(i));
                break;
            case FLOAT:
            case DOUBLE:
                af = new DoubleSumAggregatorFactory(columnNames.get(i), columnNames.get(i));
                break;
            case DECIMAL:
                if (approximationAllowed) {
                    af = new DoubleSumAggregatorFactory(columnNames.get(i), columnNames.get(i));
                } else {
                    throw new UnsupportedOperationException(String.format("Druid does not support decimal column type." + "Either cast column [%s] to double or Enable Approximate Result for Druid by setting property [%s] to true", columnNames.get(i), HiveConf.ConfVars.HIVE_DRUID_APPROX_RESULT.varname));
                }
                break;
            case TIMESTAMP:
                // Granularity column
                String tColumnName = columnNames.get(i);
                if (!tColumnName.equals(Constants.DRUID_TIMESTAMP_GRANULARITY_COL_NAME)) {
                    throw new IOException("Dimension " + tColumnName + " does not have STRING type: " + primitiveCategory);
                }
                continue;
            case TIMESTAMPLOCALTZ:
                // Druid timestamp column
                String tLocalTZColumnName = columnNames.get(i);
                if (!tLocalTZColumnName.equals(DruidStorageHandlerUtils.DEFAULT_TIMESTAMP_COLUMN)) {
                    throw new IOException("Dimension " + tLocalTZColumnName + " does not have STRING type: " + primitiveCategory);
                }
                continue;
            default:
                // Dimension
                String dColumnName = columnNames.get(i);
                if (PrimitiveObjectInspectorUtils.getPrimitiveGrouping(primitiveCategory) != PrimitiveGrouping.STRING_GROUP && primitiveCategory != PrimitiveObjectInspector.PrimitiveCategory.BOOLEAN) {
                    throw new IOException("Dimension " + dColumnName + " does not have STRING type: " + primitiveCategory);
                }
                dimensions.add(new StringDimensionSchema(dColumnName));
                continue;
        }
        aggregatorFactoryBuilder.add(af);
    }
    List<AggregatorFactory> aggregatorFactories = aggregatorFactoryBuilder.build();
    final InputRowParser inputRowParser = new MapInputRowParser(new TimeAndDimsParseSpec(new TimestampSpec(DruidStorageHandlerUtils.DEFAULT_TIMESTAMP_COLUMN, "auto", null), new DimensionsSpec(dimensions, Lists.newArrayList(Constants.DRUID_TIMESTAMP_GRANULARITY_COL_NAME, Constants.DRUID_SHARD_KEY_COL_NAME), null)));
    Map<String, Object> inputParser = DruidStorageHandlerUtils.JSON_MAPPER.convertValue(inputRowParser, Map.class);
    final DataSchema dataSchema = new DataSchema(Preconditions.checkNotNull(dataSource, "Data source name is null"), inputParser, aggregatorFactories.toArray(new AggregatorFactory[aggregatorFactories.size()]), granularitySpec, DruidStorageHandlerUtils.JSON_MAPPER);
    final String workingPath = jc.get(Constants.DRUID_JOB_WORKING_DIRECTORY);
    final String version = jc.get(Constants.DRUID_SEGMENT_VERSION);
    String basePersistDirectory = HiveConf.getVar(jc, HiveConf.ConfVars.HIVE_DRUID_BASE_PERSIST_DIRECTORY);
    if (Strings.isNullOrEmpty(basePersistDirectory)) {
        basePersistDirectory = System.getProperty("java.io.tmpdir");
    }
    Integer maxRowInMemory = HiveConf.getIntVar(jc, HiveConf.ConfVars.HIVE_DRUID_MAX_ROW_IN_MEMORY);
    IndexSpec indexSpec;
    if ("concise".equals(HiveConf.getVar(jc, HiveConf.ConfVars.HIVE_DRUID_BITMAP_FACTORY_TYPE))) {
        indexSpec = new IndexSpec(new ConciseBitmapSerdeFactory(), null, null, null);
    } else {
        indexSpec = new IndexSpec(new RoaringBitmapSerdeFactory(true), null, null, null);
    }
    RealtimeTuningConfig realtimeTuningConfig = new RealtimeTuningConfig(maxRowInMemory, null, null, new File(basePersistDirectory, dataSource), new CustomVersioningPolicy(version), null, null, null, indexSpec, true, 0, 0, true, null, 0L);
    LOG.debug(String.format("running with Data schema [%s] ", dataSchema));
    return new DruidRecordWriter(dataSchema, realtimeTuningConfig, DruidStorageHandlerUtils.createSegmentPusherForDirectory(segmentDirectory, jc), maxPartitionSize, new Path(workingPath, SEGMENTS_DESCRIPTOR_DIR_NAME), finalOutPath.getFileSystem(jc));
}

Also used : IndexSpec(io.druid.segment.IndexSpec) MapInputRowParser(io.druid.data.input.impl.MapInputRowParser) ImmutableList(com.google.common.collect.ImmutableList) ArrayList(java.util.ArrayList) LongSumAggregatorFactory(io.druid.query.aggregation.LongSumAggregatorFactory) StringDimensionSchema(io.druid.data.input.impl.StringDimensionSchema) DimensionSchema(io.druid.data.input.impl.DimensionSchema) PrimitiveTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo) TimeAndDimsParseSpec(io.druid.data.input.impl.TimeAndDimsParseSpec) UniformGranularitySpec(io.druid.segment.indexing.granularity.UniformGranularitySpec) RoaringBitmapSerdeFactory(io.druid.segment.data.RoaringBitmapSerdeFactory) ConciseBitmapSerdeFactory(io.druid.segment.data.ConciseBitmapSerdeFactory) TimestampSpec(io.druid.data.input.impl.TimestampSpec) Path(org.apache.hadoop.fs.Path) DoubleSumAggregatorFactory(io.druid.query.aggregation.DoubleSumAggregatorFactory) IOException(java.io.IOException) DoubleSumAggregatorFactory(io.druid.query.aggregation.DoubleSumAggregatorFactory) AggregatorFactory(io.druid.query.aggregation.AggregatorFactory) LongSumAggregatorFactory(io.druid.query.aggregation.LongSumAggregatorFactory) RealtimeTuningConfig(io.druid.segment.indexing.RealtimeTuningConfig) PrimitiveTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo) TypeInfo(org.apache.hadoop.hive.serde2.typeinfo.TypeInfo) StringDimensionSchema(io.druid.data.input.impl.StringDimensionSchema) DataSchema(io.druid.segment.indexing.DataSchema) GranularitySpec(io.druid.segment.indexing.granularity.GranularitySpec) UniformGranularitySpec(io.druid.segment.indexing.granularity.UniformGranularitySpec) DimensionsSpec(io.druid.data.input.impl.DimensionsSpec) PrimitiveObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector) MapInputRowParser(io.druid.data.input.impl.MapInputRowParser) InputRowParser(io.druid.data.input.impl.InputRowParser) CustomVersioningPolicy(io.druid.segment.realtime.plumber.CustomVersioningPolicy) File(java.io.File)

Aggregations

DataSchema (io.druid.segment.indexing.DataSchema)34 UniformGranularitySpec (io.druid.segment.indexing.granularity.UniformGranularitySpec)29 Interval (org.joda.time.Interval)18 Test (org.junit.Test)18 RealtimeTuningConfig (io.druid.segment.indexing.RealtimeTuningConfig)12 File (java.io.File)11 DimensionsSpec (io.druid.data.input.impl.DimensionsSpec)10 TimestampSpec (io.druid.data.input.impl.TimestampSpec)10 AggregatorFactory (io.druid.query.aggregation.AggregatorFactory)10 LongSumAggregatorFactory (io.druid.query.aggregation.LongSumAggregatorFactory)9 DefaultObjectMapper (io.druid.jackson.DefaultObjectMapper)8 RealtimeIOConfig (io.druid.segment.indexing.RealtimeIOConfig)8 StringInputRowParser (io.druid.data.input.impl.StringInputRowParser)7 CountAggregatorFactory (io.druid.query.aggregation.CountAggregatorFactory)7 DoubleSumAggregatorFactory (io.druid.query.aggregation.DoubleSumAggregatorFactory)7 Before (org.junit.Before)7 ObjectMapper (com.fasterxml.jackson.databind.ObjectMapper)6 ImmutableMap (com.google.common.collect.ImmutableMap)6 FireDepartment (io.druid.segment.realtime.FireDepartment)6 Period (org.joda.time.Period)6