Search in sources :

Example 1 with ConfigException

use of org.embulk.config.ConfigException in project embulk by embulk.

the class RenameFilterPlugin method applyRegexReplaceRule.

private Schema applyRegexReplaceRule(Schema inputSchema, RegexReplaceRule rule) {
    final String match = rule.getMatch();
    final String replace = rule.getReplace();
    Schema.Builder builder = Schema.builder();
    for (Column column : inputSchema.getColumns()) {
        // TODO(dmikurube): Check if we need a kind of sanitization?
        try {
            builder.add(column.getName().replaceAll(match, replace), column.getType());
        } catch (PatternSyntaxException ex) {
            throw new ConfigException(ex);
        }
    }
    return builder.build();
}
Also used : Column(org.embulk.spi.Column) Schema(org.embulk.spi.Schema) ConfigException(org.embulk.config.ConfigException) PatternSyntaxException(java.util.regex.PatternSyntaxException)

Example 2 with ConfigException

use of org.embulk.config.ConfigException in project embulk by embulk.

the class RenameFilterPlugin method applyUniqueNumberSuffixRule.

/**
 * Resolves conflicting column names by suffixing numbers.
 *
 * Conflicts are resolved by the following rules. The rules should not be changed casually because changing the
 * rules breaks compatibility.
 *
 * 1. Count all duplicates in the original column names. Indexes are counted up per original column name.
 * 2. Fix new column names from the left to the right
 *   - Try to append the current index for the original column name (with truncation if requested (not implemented))
 *     - Fix the new name if no duplication is found with fixed column names on the left and original column names
 *     - Retry with an index incremented if a duplication is found with fixed column names on the left
 *
 * Examples:
 *     [c, c1, c1,   c2, c,   c3]
 * ==> [c, c1, c1_2, c2, c_2, c3]
 *
 * If a newly suffixed name newly conflicts with other columns, the index is just skipped. For example:
 *     [c, c,   c_0, c_1, c_2]
 * ==> [c, c_3, c_0, c_1, c_2]
 *
 * If truncation is requested simultaneously with uniqueness (not implemented), it should work like:
 *     [co, c, co  , c  , co  , c  , ..., co  , c  , co  , c   , co  , c   ]
 * ==> [co, c, co_2, c_2, co_3, c_3, ..., co_9, c_9, c_10, c_11, c_12, c_13] (max_length:4)
 *
 *     [co, co  , co  , ..., co  , c, c  , ..., c  , co  , c  , co  , c  , co  , c   ]
 * ==> [co, co_2, co_3, ..., co_9, c, c_2, ..., c_7, c_10, c_8, c_11, c_9, c_12, c_13] (max_length:4)
 *
 * Note that a delimiter should not be omitted. Recurring conflicts may confuse users.
 *     [c, c,  c,  ..., c,   c,   c,   c,   c1, c1,  c1]
 * NG: [c, c2, c3, ..., c10, c11, c12, c13, c1, c12, c13] (not unique!)
 * ==> [c, c2, c3, ..., c10, c11, c12, c13, c1, c14, c15] (confusing)
 */
private Schema applyUniqueNumberSuffixRule(Schema inputSchema, UniqueNumberSuffixRule rule) {
    final String delimiter = rule.getDelimiter();
    final Optional<Integer> digits = rule.getDigits();
    final Optional<Integer> maxLength = rule.getMaxLength();
    final int offset = rule.getOffset();
    // |delimiter| must consist of just 1 character to check quickly that it does not contain any digit.
    if (delimiter == null || delimiter.length() != 1 || Character.isDigit(delimiter.charAt(0))) {
        throw new ConfigException("\"delimiter\" in rule \"unique_number_suffix\" must contain just 1 non-digit character");
    }
    if (maxLength.isPresent() && maxLength.get() < minimumMaxLengthInUniqueNumberSuffix) {
        throw new ConfigException("\"max_length\" in rule \"unique_number_suffix\" must be larger than " + (minimumMaxLengthInUniqueNumberSuffix - 1));
    }
    if (maxLength.isPresent() && digits.isPresent() && maxLength.get() < digits.get() + delimiter.length()) {
        throw new ConfigException("\"max_length\" in rule \"unique_number_suffix\" must be larger than \"digits\"");
    }
    int digitsOfNumberOfColumns = Integer.toString(inputSchema.getColumnCount() + offset - 1).length();
    if (maxLength.isPresent() && maxLength.get() <= digitsOfNumberOfColumns) {
        throw new ConfigException("\"max_length\" in rule \"unique_number_suffix\" must be larger than digits of ((number of columns) + \"offset\" - 1)");
    }
    if (digits.isPresent() && digits.get() <= digitsOfNumberOfColumns) {
        throw new ConfigException("\"digits\" in rule \"unique_number_suffix\" must be larger than digits of ((number of columns) + \"offset\" - 1)");
    }
    // Columns should not be truncated here initially. Uniqueness should be identified before truncated.
    // Iterate for initial states.
    HashSet<String> originalColumnNames = new HashSet<>();
    HashMap<String, Integer> columnNameCountups = new HashMap<>();
    for (Column column : inputSchema.getColumns()) {
        originalColumnNames.add(column.getName());
        columnNameCountups.put(column.getName(), offset);
    }
    Schema.Builder outputBuilder = Schema.builder();
    HashSet<String> fixedColumnNames = new HashSet<>();
    for (Column column : inputSchema.getColumns()) {
        String truncatedName = column.getName();
        if (column.getName().length() > maxLength.or(Integer.MAX_VALUE)) {
            truncatedName = column.getName().substring(0, maxLength.get());
        }
        // Conflicts with original names do not matter here.
        if (!fixedColumnNames.contains(truncatedName)) {
            // The original name is counted up.
            columnNameCountups.put(column.getName(), columnNameCountups.get(column.getName()) + 1);
            // The truncated name is fixed.
            fixedColumnNames.add(truncatedName);
            outputBuilder.add(truncatedName, column.getType());
            continue;
        }
        int index = columnNameCountups.get(column.getName());
        String concatenatedName;
        do {
            // This can be replaced with String#format(Locale.ENGLISH, ...), but Java's String#format does not
            // have variable widths ("%*d" in C's printf). It cannot be very simple with String#format.
            String differentiatorString = Integer.toString(index);
            if (digits.isPresent() && (digits.get() > differentiatorString.length())) {
                differentiatorString = Strings.repeat("0", digits.get() - differentiatorString.length()) + differentiatorString;
            }
            differentiatorString = delimiter + differentiatorString;
            concatenatedName = column.getName() + differentiatorString;
            if (concatenatedName.length() > maxLength.or(Integer.MAX_VALUE)) {
                concatenatedName = column.getName().substring(0, maxLength.get() - differentiatorString.length()) + differentiatorString;
            }
            ++index;
        // Conflicts with original names matter when creating new names with suffixes.
        } while (fixedColumnNames.contains(concatenatedName) || originalColumnNames.contains(concatenatedName));
        // The original name is counted up.
        columnNameCountups.put(column.getName(), index);
        // The concatenated&truncated name is fixed.
        fixedColumnNames.add(concatenatedName);
        outputBuilder.add(concatenatedName, column.getType());
    }
    return outputBuilder.build();
}
Also used : HashMap(java.util.HashMap) Column(org.embulk.spi.Column) Schema(org.embulk.spi.Schema) ConfigException(org.embulk.config.ConfigException) HashSet(java.util.HashSet)

Example 3 with ConfigException

use of org.embulk.config.ConfigException in project embulk by embulk.

the class TestRenameFilterPlugin method checkConfigExceptionIfUnknownRenamingOperatorName.

@Test
public void checkConfigExceptionIfUnknownRenamingOperatorName() {
    ConfigSource pluginConfig = Exec.newConfigSource().set("rules", ImmutableList.of(ImmutableMap.of("rule", "some_unknown_renaming_operator")));
    try {
        filter.transaction(pluginConfig, SCHEMA, new FilterPlugin.Control() {

            public void run(TaskSource task, Schema schema) {
            }
        });
        fail();
    } catch (Throwable t) {
        assertTrue(t instanceof ConfigException);
    }
}
Also used : ConfigSource(org.embulk.config.ConfigSource) FilterPlugin(org.embulk.spi.FilterPlugin) Schema(org.embulk.spi.Schema) ConfigException(org.embulk.config.ConfigException) SchemaConfigException(org.embulk.spi.SchemaConfigException) TaskSource(org.embulk.config.TaskSource) Test(org.junit.Test)

Example 4 with ConfigException

use of org.embulk.config.ConfigException in project embulk by embulk.

the class TestRenameFilterPlugin method checkConfigExceptionIfUnknownListTypeOfRenamingOperator.

@Test
public void checkConfigExceptionIfUnknownListTypeOfRenamingOperator() {
    // A list [] shouldn't come as a renaming rule.
    ConfigSource pluginConfig = Exec.newConfigSource().set("rules", ImmutableList.of(ImmutableList.of("listed_operator1", "listed_operator2")));
    try {
        filter.transaction(pluginConfig, SCHEMA, new FilterPlugin.Control() {

            public void run(TaskSource task, Schema schema) {
            }
        });
        fail();
    } catch (Throwable t) {
        assertTrue(t instanceof ConfigException);
    }
}
Also used : ConfigSource(org.embulk.config.ConfigSource) FilterPlugin(org.embulk.spi.FilterPlugin) Schema(org.embulk.spi.Schema) ConfigException(org.embulk.config.ConfigException) SchemaConfigException(org.embulk.spi.SchemaConfigException) TaskSource(org.embulk.config.TaskSource) Test(org.junit.Test)

Example 5 with ConfigException

use of org.embulk.config.ConfigException in project embulk by embulk.

the class TestRenameFilterPlugin method checkConfigExceptionIfUnknownStringTypeOfRenamingOperator.

@Test
public void checkConfigExceptionIfUnknownStringTypeOfRenamingOperator() {
    // A simple string shouldn't come as a renaming rule.
    ConfigSource pluginConfig = Exec.newConfigSource().set("rules", ImmutableList.of("string_rule"));
    try {
        filter.transaction(pluginConfig, SCHEMA, new FilterPlugin.Control() {

            public void run(TaskSource task, Schema schema) {
            }
        });
        fail();
    } catch (Throwable t) {
        assertTrue(t instanceof ConfigException);
    }
}
Also used : ConfigSource(org.embulk.config.ConfigSource) FilterPlugin(org.embulk.spi.FilterPlugin) Schema(org.embulk.spi.Schema) ConfigException(org.embulk.config.ConfigException) SchemaConfigException(org.embulk.spi.SchemaConfigException) TaskSource(org.embulk.config.TaskSource) Test(org.junit.Test)

Aggregations

ConfigException (org.embulk.config.ConfigException)11 Schema (org.embulk.spi.Schema)8 Column (org.embulk.spi.Column)5 SchemaConfigException (org.embulk.spi.SchemaConfigException)5 ConfigSource (org.embulk.config.ConfigSource)3 TaskSource (org.embulk.config.TaskSource)3 FilterPlugin (org.embulk.spi.FilterPlugin)3 Test (org.junit.Test)3 ImmutableList (com.google.common.collect.ImmutableList)2 HashMap (java.util.HashMap)1 HashSet (java.util.HashSet)1 PatternSyntaxException (java.util.regex.PatternSyntaxException)1 TimestampFormatter (org.embulk.spi.time.TimestampFormatter)1 TimestampParser (org.embulk.spi.time.TimestampParser)1 BooleanType (org.embulk.spi.type.BooleanType)1 DoubleType (org.embulk.spi.type.DoubleType)1 JsonType (org.embulk.spi.type.JsonType)1 LongType (org.embulk.spi.type.LongType)1 StringType (org.embulk.spi.type.StringType)1 TimestampType (org.embulk.spi.type.TimestampType)1