Search in sources :

Example 1 with DefaultFileSchema

use of com.thinkbiganalytics.discovery.model.DefaultFileSchema in project kylo by Teradata.

the class CSVFileSchemaParser method populateSchema.

private DefaultFileSchema populateSchema(CSVParser parser) {
    DefaultFileSchema fileSchema = new DefaultFileSchema();
    int i = 0;
    ArrayList<Field> fields = new ArrayList<>();
    for (CSVRecord record : parser) {
        if (i > 9) {
            break;
        }
        int size = record.size();
        for (int j = 0; j < size; j++) {
            DefaultField field = null;
            if (i == 0) {
                field = new DefaultField();
                if (headerRow) {
                    field.setName(record.get(j));
                } else {
                    field.setName("Col_" + (j + 1));
                }
                fields.add(field);
            } else {
                try {
                    field = (DefaultField) fields.get(j);
                    field.getSampleValues().add(StringUtils.defaultString(record.get(j), ""));
                } catch (IndexOutOfBoundsException e) {
                    LOG.warn("Sample file has potential sparse column problem at row [?] field [?]", i + 1, j + 1);
                }
            }
        }
        i++;
    }
    fileSchema.setFields(fields);
    return fileSchema;
}
Also used : DefaultField(com.thinkbiganalytics.discovery.model.DefaultField) Field(com.thinkbiganalytics.discovery.schema.Field) DefaultFileSchema(com.thinkbiganalytics.discovery.model.DefaultFileSchema) ArrayList(java.util.ArrayList) CSVRecord(org.apache.commons.csv.CSVRecord) DefaultField(com.thinkbiganalytics.discovery.model.DefaultField)

Example 2 with DefaultFileSchema

use of com.thinkbiganalytics.discovery.model.DefaultFileSchema in project kylo by Teradata.

the class CSVFileSchemaParser method parse.

@Override
public Schema parse(InputStream is, Charset charset, TableSchemaType target) throws IOException {
    Validate.notNull(target, "target must not be null");
    Validate.notNull(is, "stream must not be null");
    Validate.notNull(charset, "charset must not be null");
    validate();
    // Parse the file
    String sampleData = ParserHelper.extractSampleLines(is, charset, numRowsToSample);
    Validate.notEmpty(sampleData, "No data in file");
    CSVFormat format = createCSVFormat(sampleData);
    try (Reader reader = new StringReader(sampleData)) {
        CSVParser parser = format.parse(reader);
        DefaultFileSchema fileSchema = populateSchema(parser);
        fileSchema.setCharset(charset.name());
        // Convert to target schema with proper derived types
        Schema targetSchema = convertToTarget(target, fileSchema);
        return targetSchema;
    }
}
Also used : CSVParser(org.apache.commons.csv.CSVParser) DefaultFileSchema(com.thinkbiganalytics.discovery.model.DefaultFileSchema) Schema(com.thinkbiganalytics.discovery.schema.Schema) DefaultHiveSchema(com.thinkbiganalytics.discovery.model.DefaultHiveSchema) DefaultTableSchema(com.thinkbiganalytics.discovery.model.DefaultTableSchema) StringReader(java.io.StringReader) DefaultFileSchema(com.thinkbiganalytics.discovery.model.DefaultFileSchema) Reader(java.io.Reader) StringReader(java.io.StringReader) CSVFormat(org.apache.commons.csv.CSVFormat)

Aggregations

DefaultFileSchema (com.thinkbiganalytics.discovery.model.DefaultFileSchema)2 DefaultField (com.thinkbiganalytics.discovery.model.DefaultField)1 DefaultHiveSchema (com.thinkbiganalytics.discovery.model.DefaultHiveSchema)1 DefaultTableSchema (com.thinkbiganalytics.discovery.model.DefaultTableSchema)1 Field (com.thinkbiganalytics.discovery.schema.Field)1 Schema (com.thinkbiganalytics.discovery.schema.Schema)1 Reader (java.io.Reader)1 StringReader (java.io.StringReader)1 ArrayList (java.util.ArrayList)1 CSVFormat (org.apache.commons.csv.CSVFormat)1 CSVParser (org.apache.commons.csv.CSVParser)1 CSVRecord (org.apache.commons.csv.CSVRecord)1