Search in sources :

Example 1 with InlineTable

use of org.dmg.pmml.InlineTable in project jpmml-sparkml by jpmml.

the class VectorIndexerModelConverter method encodeFeatures.

@Override
public List<Feature> encodeFeatures(SparkMLEncoder encoder) {
    VectorIndexerModel transformer = getTransformer();
    List<Feature> features = encoder.getFeatures(transformer.getInputCol());
    int numFeatures = transformer.numFeatures();
    if (numFeatures != features.size()) {
        throw new IllegalArgumentException("Expected " + numFeatures + " features, got " + features.size() + " features");
    }
    Map<Integer, Map<Double, Integer>> categoryMaps = transformer.javaCategoryMaps();
    List<Feature> result = new ArrayList<>();
    for (int i = 0; i < numFeatures; i++) {
        Feature feature = features.get(i);
        Map<Double, Integer> categoryMap = categoryMaps.get(i);
        if (categoryMap != null) {
            List<String> categories = new ArrayList<>();
            List<String> values = new ArrayList<>();
            DocumentBuilder documentBuilder = DOMUtil.createDocumentBuilder();
            InlineTable inlineTable = new InlineTable();
            List<String> columns = Arrays.asList("input", "output");
            List<Map.Entry<Double, Integer>> entries = new ArrayList<>(categoryMap.entrySet());
            Collections.sort(entries, VectorIndexerModelConverter.COMPARATOR);
            for (Map.Entry<Double, Integer> entry : entries) {
                String category = ValueUtil.formatValue(entry.getKey());
                categories.add(category);
                String value = ValueUtil.formatValue(entry.getValue());
                values.add(value);
                Row row = DOMUtil.createRow(documentBuilder, columns, Arrays.asList(category, value));
                inlineTable.addRows(row);
            }
            encoder.toCategorical(feature.getName(), categories);
            MapValues mapValues = new MapValues().addFieldColumnPairs(new FieldColumnPair(feature.getName(), columns.get(0))).setOutputColumn(columns.get(1)).setInlineTable(inlineTable);
            DerivedField derivedField = encoder.createDerivedField(formatName(transformer, i), OpType.CATEGORICAL, DataType.INTEGER, mapValues);
            result.add(new CategoricalFeature(encoder, derivedField, values));
        } else {
            result.add((ContinuousFeature) feature);
        }
    }
    return result;
}
Also used : InlineTable(org.dmg.pmml.InlineTable) ArrayList(java.util.ArrayList) FieldColumnPair(org.dmg.pmml.FieldColumnPair) Feature(org.jpmml.converter.Feature) ContinuousFeature(org.jpmml.converter.ContinuousFeature) CategoricalFeature(org.jpmml.converter.CategoricalFeature) CategoricalFeature(org.jpmml.converter.CategoricalFeature) DocumentBuilder(javax.xml.parsers.DocumentBuilder) MapValues(org.dmg.pmml.MapValues) VectorIndexerModel(org.apache.spark.ml.feature.VectorIndexerModel) Row(org.dmg.pmml.Row) Map(java.util.Map) DerivedField(org.dmg.pmml.DerivedField)

Example 2 with InlineTable

use of org.dmg.pmml.InlineTable in project jpmml-r by jpmml.

the class FormulaUtil method createMapValues.

private static MapValues createMapValues(FieldName name, Map<String, String> mapping, List<String> categories) {
    Set<String> inputs = new LinkedHashSet<>(mapping.keySet());
    Set<String> outputs = new LinkedHashSet<>(mapping.values());
    for (String category : categories) {
        // Assume disjoint input and output value spaces
        if (outputs.contains(category)) {
            continue;
        }
        mapping.put(category, category);
    }
    List<String> columns = Arrays.asList("from", "to");
    InlineTable inlineTable = new InlineTable();
    DocumentBuilder documentBuilder = DOMUtil.createDocumentBuilder();
    Collection<Map.Entry<String, String>> entries = mapping.entrySet();
    for (Map.Entry<String, String> entry : entries) {
        Row row = DOMUtil.createRow(documentBuilder, columns, Arrays.asList(entry.getKey(), entry.getValue()));
        inlineTable.addRows(row);
    }
    MapValues mapValues = new MapValues().addFieldColumnPairs(new FieldColumnPair(name, columns.get(0))).setOutputColumn(columns.get(1)).setInlineTable(inlineTable);
    return mapValues;
}
Also used : LinkedHashSet(java.util.LinkedHashSet) InlineTable(org.dmg.pmml.InlineTable) FieldColumnPair(org.dmg.pmml.FieldColumnPair) DocumentBuilder(javax.xml.parsers.DocumentBuilder) MapValues(org.dmg.pmml.MapValues) Row(org.dmg.pmml.Row) LinkedHashMap(java.util.LinkedHashMap) Map(java.util.Map)

Example 3 with InlineTable

use of org.dmg.pmml.InlineTable in project jpmml-sparkml by jpmml.

the class CountVectorizerModelConverter method encodeFeatures.

@Override
public List<Feature> encodeFeatures(SparkMLEncoder encoder) {
    CountVectorizerModel transformer = getTransformer();
    DocumentFeature documentFeature = (DocumentFeature) encoder.getOnlyFeature(transformer.getInputCol());
    ParameterField documentField = new ParameterField(FieldName.create("document"));
    ParameterField termField = new ParameterField(FieldName.create("term"));
    TextIndex textIndex = new TextIndex(documentField.getName()).setTokenize(Boolean.TRUE).setWordSeparatorCharacterRE(documentFeature.getWordSeparatorRE()).setLocalTermWeights(transformer.getBinary() ? TextIndex.LocalTermWeights.BINARY : null).setExpression(new FieldRef(termField.getName()));
    Set<DocumentFeature.StopWordSet> stopWordSets = documentFeature.getStopWordSets();
    for (DocumentFeature.StopWordSet stopWordSet : stopWordSets) {
        if (stopWordSet.isEmpty()) {
            continue;
        }
        DocumentBuilder documentBuilder = DOMUtil.createDocumentBuilder();
        String tokenRE;
        String wordSeparatorRE = documentFeature.getWordSeparatorRE();
        switch(wordSeparatorRE) {
            case "\\s+":
                tokenRE = "(^|\\s+)\\p{Punct}*(" + JOINER.join(stopWordSet) + ")\\p{Punct}*(\\s+|$)";
                break;
            case "\\W+":
                tokenRE = "(\\W+)(" + JOINER.join(stopWordSet) + ")(\\W+)";
                break;
            default:
                throw new IllegalArgumentException("Expected \"\\s+\" or \"\\W+\" as splitter regex pattern, got \"" + wordSeparatorRE + "\"");
        }
        InlineTable inlineTable = new InlineTable().addRows(DOMUtil.createRow(documentBuilder, Arrays.asList("string", "stem", "regex"), Arrays.asList(tokenRE, " ", "true")));
        TextIndexNormalization textIndexNormalization = new TextIndexNormalization().setCaseSensitive(stopWordSet.isCaseSensitive()).setRecursive(// Handles consecutive matches. See http://stackoverflow.com/a/25085385
        Boolean.TRUE).setInlineTable(inlineTable);
        textIndex.addTextIndexNormalizations(textIndexNormalization);
    }
    DefineFunction defineFunction = new DefineFunction("tf" + "@" + String.valueOf(CountVectorizerModelConverter.SEQUENCE.getAndIncrement()), OpType.CONTINUOUS, null).setDataType(DataType.INTEGER).addParameterFields(documentField, termField).setExpression(textIndex);
    encoder.addDefineFunction(defineFunction);
    List<Feature> result = new ArrayList<>();
    String[] vocabulary = transformer.vocabulary();
    for (int i = 0; i < vocabulary.length; i++) {
        String term = vocabulary[i];
        if (TermUtil.hasPunctuation(term)) {
            throw new IllegalArgumentException(term);
        }
        result.add(new TermFeature(encoder, defineFunction, documentFeature, term));
    }
    return result;
}
Also used : InlineTable(org.dmg.pmml.InlineTable) FieldRef(org.dmg.pmml.FieldRef) TextIndex(org.dmg.pmml.TextIndex) DocumentFeature(org.jpmml.sparkml.DocumentFeature) ArrayList(java.util.ArrayList) Feature(org.jpmml.converter.Feature) DocumentFeature(org.jpmml.sparkml.DocumentFeature) TermFeature(org.jpmml.sparkml.TermFeature) TermFeature(org.jpmml.sparkml.TermFeature) TextIndexNormalization(org.dmg.pmml.TextIndexNormalization) CountVectorizerModel(org.apache.spark.ml.feature.CountVectorizerModel) DocumentBuilder(javax.xml.parsers.DocumentBuilder) DefineFunction(org.dmg.pmml.DefineFunction) ParameterField(org.dmg.pmml.ParameterField)

Example 4 with InlineTable

use of org.dmg.pmml.InlineTable in project jpmml-sparkml by jpmml.

the class ClassificationModelConverter method registerOutputFields.

@Override
public List<OutputField> registerOutputFields(Label label, SparkMLEncoder encoder) {
    T model = getTransformer();
    CategoricalLabel categoricalLabel = (CategoricalLabel) label;
    List<OutputField> result = new ArrayList<>();
    String predictionCol = model.getPredictionCol();
    OutputField pmmlPredictedField = ModelUtil.createPredictedField(FieldName.create("pmml(" + predictionCol + ")"), categoricalLabel.getDataType(), OpType.CATEGORICAL);
    result.add(pmmlPredictedField);
    List<String> categories = new ArrayList<>();
    DocumentBuilder documentBuilder = DOMUtil.createDocumentBuilder();
    InlineTable inlineTable = new InlineTable();
    List<String> columns = Arrays.asList("input", "output");
    for (int i = 0; i < categoricalLabel.size(); i++) {
        String value = categoricalLabel.getValue(i);
        String category = String.valueOf(i);
        categories.add(category);
        Row row = DOMUtil.createRow(documentBuilder, columns, Arrays.asList(value, category));
        inlineTable.addRows(row);
    }
    MapValues mapValues = new MapValues().addFieldColumnPairs(new FieldColumnPair(pmmlPredictedField.getName(), columns.get(0))).setOutputColumn(columns.get(1)).setInlineTable(inlineTable);
    final OutputField predictedField = new OutputField(FieldName.create(predictionCol), DataType.DOUBLE).setOpType(OpType.CATEGORICAL).setResultFeature(ResultFeature.TRANSFORMED_VALUE).setExpression(mapValues);
    result.add(predictedField);
    Feature feature = new CategoricalFeature(encoder, predictedField.getName(), predictedField.getDataType(), categories) {

        @Override
        public ContinuousFeature toContinuousFeature() {
            PMMLEncoder encoder = ensureEncoder();
            return new ContinuousFeature(encoder, getName(), getDataType());
        }
    };
    encoder.putOnlyFeature(predictionCol, feature);
    if (model instanceof HasProbabilityCol) {
        HasProbabilityCol hasProbabilityCol = (HasProbabilityCol) model;
        String probabilityCol = hasProbabilityCol.getProbabilityCol();
        List<Feature> features = new ArrayList<>();
        for (int i = 0; i < categoricalLabel.size(); i++) {
            String value = categoricalLabel.getValue(i);
            OutputField probabilityField = ModelUtil.createProbabilityField(FieldName.create(probabilityCol + "(" + value + ")"), DataType.DOUBLE, value);
            result.add(probabilityField);
            features.add(new ContinuousFeature(encoder, probabilityField.getName(), probabilityField.getDataType()));
        }
        encoder.putFeatures(probabilityCol, features);
    }
    return result;
}
Also used : InlineTable(org.dmg.pmml.InlineTable) HasProbabilityCol(org.apache.spark.ml.param.shared.HasProbabilityCol) PMMLEncoder(org.jpmml.converter.PMMLEncoder) ArrayList(java.util.ArrayList) FieldColumnPair(org.dmg.pmml.FieldColumnPair) ResultFeature(org.dmg.pmml.ResultFeature) ContinuousFeature(org.jpmml.converter.ContinuousFeature) Feature(org.jpmml.converter.Feature) CategoricalFeature(org.jpmml.converter.CategoricalFeature) CategoricalFeature(org.jpmml.converter.CategoricalFeature) ContinuousFeature(org.jpmml.converter.ContinuousFeature) DocumentBuilder(javax.xml.parsers.DocumentBuilder) MapValues(org.dmg.pmml.MapValues) CategoricalLabel(org.jpmml.converter.CategoricalLabel) OutputField(org.dmg.pmml.OutputField) Row(org.dmg.pmml.Row)

Aggregations

DocumentBuilder (javax.xml.parsers.DocumentBuilder)4 InlineTable (org.dmg.pmml.InlineTable)4 ArrayList (java.util.ArrayList)3 FieldColumnPair (org.dmg.pmml.FieldColumnPair)3 MapValues (org.dmg.pmml.MapValues)3 Row (org.dmg.pmml.Row)3 Feature (org.jpmml.converter.Feature)3 Map (java.util.Map)2 CategoricalFeature (org.jpmml.converter.CategoricalFeature)2 ContinuousFeature (org.jpmml.converter.ContinuousFeature)2 LinkedHashMap (java.util.LinkedHashMap)1 LinkedHashSet (java.util.LinkedHashSet)1 CountVectorizerModel (org.apache.spark.ml.feature.CountVectorizerModel)1 VectorIndexerModel (org.apache.spark.ml.feature.VectorIndexerModel)1 HasProbabilityCol (org.apache.spark.ml.param.shared.HasProbabilityCol)1 DefineFunction (org.dmg.pmml.DefineFunction)1 DerivedField (org.dmg.pmml.DerivedField)1 FieldRef (org.dmg.pmml.FieldRef)1 OutputField (org.dmg.pmml.OutputField)1 ParameterField (org.dmg.pmml.ParameterField)1