Search in sources :

Example 1 with STRING

use of org.talend.dataprep.api.type.Type.STRING in project data-prep by Talend.

the class XlsSchemaParser method guessColumnType.

/**
 * @param colId the column id.
 * @param columnRows all rows with previously guessed type: key=row number, value= guessed type
 * @param averageHeaderSize
 * @return
 */
private Type guessColumnType(Integer colId, SortedMap<Integer, String> columnRows, int averageHeaderSize) {
    // calculate number per type
    Map<String, Long> perTypeNumber = columnRows.tailMap(averageHeaderSize).values().stream().collect(Collectors.groupingBy(w -> w, Collectors.counting()));
    OptionalLong maxOccurrence = perTypeNumber.values().stream().mapToLong(Long::longValue).max();
    if (!maxOccurrence.isPresent()) {
        return ANY;
    }
    List<String> duplicatedMax = new ArrayList<>();
    perTypeNumber.forEach((type1, aLong) -> {
        if (aLong >= maxOccurrence.getAsLong()) {
            duplicatedMax.add(type1);
        }
    });
    String guessedType;
    if (duplicatedMax.size() == 1) {
        guessedType = duplicatedMax.get(0);
    } else {
        // as we have more than one type we guess ANY
        guessedType = ANY.getName();
    }
    LOGGER.debug("guessed type for column #{} is {}", colId, guessedType);
    return Type.get(guessedType);
}
Also used : CellType(org.apache.poi.ss.usermodel.CellType) TDPException(org.talend.dataprep.exception.TDPException) DataSetErrorCodes(org.talend.dataprep.exception.error.DataSetErrorCodes) PushbackInputStream(java.io.PushbackInputStream) LoggerFactory(org.slf4j.LoggerFactory) Schema(org.talend.dataprep.schema.Schema) StringUtils(org.apache.commons.lang3.StringUtils) ArrayList(java.util.ArrayList) STRING(org.talend.dataprep.api.type.Type.STRING) Value(org.springframework.beans.factory.annotation.Value) OptionalLong(java.util.OptionalLong) HSSFDateUtil(org.apache.poi.hssf.usermodel.HSSFDateUtil) ExceptionContext(org.talend.daikon.exception.ExceptionContext) Service(org.springframework.stereotype.Service) Markers(org.talend.dataprep.log.Markers) DATE(org.talend.dataprep.api.type.Type.DATE) Map(java.util.Map) DataprepBundle.message(org.talend.dataprep.i18n.DataprepBundle.message) Cell(org.apache.poi.ss.usermodel.Cell) ANY(org.talend.dataprep.api.type.Type.ANY) WorkbookFactory(org.apache.poi.ss.usermodel.WorkbookFactory) Sheet(org.apache.poi.ss.usermodel.Sheet) Logger(org.slf4j.Logger) Iterator(java.util.Iterator) BOOLEAN(org.talend.dataprep.api.type.Type.BOOLEAN) NUMERIC(org.talend.dataprep.api.type.Type.NUMERIC) IOException(java.io.IOException) SchemaParser(org.talend.dataprep.schema.SchemaParser) StreamingReader(org.talend.dataprep.schema.xls.streaming.StreamingReader) Collectors(java.util.stream.Collectors) FormulaEvaluator(org.apache.poi.ss.usermodel.FormulaEvaluator) FileMagic(org.apache.poi.poifs.filesystem.FileMagic) List(java.util.List) Workbook(org.apache.poi.ss.usermodel.Workbook) Type(org.talend.dataprep.api.type.Type) TreeMap(java.util.TreeMap) Marker(org.slf4j.Marker) StreamingSheet(org.talend.dataprep.schema.xls.streaming.StreamingSheet) Row(org.apache.poi.ss.usermodel.Row) CommonErrorCodes(org.talend.dataprep.exception.error.CommonErrorCodes) Collections(java.util.Collections) SortedMap(java.util.SortedMap) InputStream(java.io.InputStream) ColumnMetadata(org.talend.dataprep.api.dataset.ColumnMetadata) OptionalLong(java.util.OptionalLong) ArrayList(java.util.ArrayList) OptionalLong(java.util.OptionalLong)

Aggregations

IOException (java.io.IOException)1 InputStream (java.io.InputStream)1 PushbackInputStream (java.io.PushbackInputStream)1 ArrayList (java.util.ArrayList)1 Collections (java.util.Collections)1 Iterator (java.util.Iterator)1 List (java.util.List)1 Map (java.util.Map)1 OptionalLong (java.util.OptionalLong)1 SortedMap (java.util.SortedMap)1 TreeMap (java.util.TreeMap)1 Collectors (java.util.stream.Collectors)1 StringUtils (org.apache.commons.lang3.StringUtils)1 HSSFDateUtil (org.apache.poi.hssf.usermodel.HSSFDateUtil)1 FileMagic (org.apache.poi.poifs.filesystem.FileMagic)1 Cell (org.apache.poi.ss.usermodel.Cell)1 CellType (org.apache.poi.ss.usermodel.CellType)1 FormulaEvaluator (org.apache.poi.ss.usermodel.FormulaEvaluator)1 Row (org.apache.poi.ss.usermodel.Row)1 Sheet (org.apache.poi.ss.usermodel.Sheet)1