Search in sources :

Example 1 with StringClusters

use of org.talend.dataquality.record.linkage.analyzer.StringClusters in project data-prep by Talend.

the class ClusterParameters method getParameters.

@Override
public GenericParameter getParameters(final String columnId, final DataSet content) {
    // Analyze clusters service
    StringsClusterAnalyzer clusterAnalyzer = new StringsClusterAnalyzer();
    clusterAnalyzer.withPostMerges(new PostMerge(AttributeMatcherType.DOUBLE_METAPHONE, 0.8f));
    clusterAnalyzer.init();
    content.getRecords().forEach(row -> {
        String value = row.get(columnId);
        clusterAnalyzer.analyze(value);
    });
    // TDP-5860 : this use Soundex (Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English)
    // So it can log IllegalArgumentException if a character is not mapped
    // see SoundexMatcher on DQ side
    clusterAnalyzer.end();
    // Build results
    final Clusters.Builder builder = Clusters.builder().title(DataprepBundle.message("parameter.textclustering.title.1")).title(DataprepBundle.message("parameter.textclustering.title.2"));
    final StringClusters result = clusterAnalyzer.getResult().get(0);
    for (StringClusters.StringCluster cluster : result) {
        // String clustering may cluster null / empty values, however not interesting for data prep.
        if (!StringUtils.isEmpty(cluster.survivedValue)) {
            final ClusterItem.Builder currentCluster = ClusterItem.builder();
            for (String value : cluster.originalValues) {
                currentCluster.parameter(new ConstantParameter(value, ParameterType.BOOLEAN));
            }
            currentCluster.replace(Parameter.parameter(LocaleContextHolder.getLocale()).setName("replaceValue").setType(ParameterType.STRING).setDefaultValue(cluster.survivedValue).build(null));
            builder.cluster(currentCluster);
        }
    }
    return new GenericParameter("cluster", builder.build());
}
Also used : StringClusters(org.talend.dataquality.record.linkage.analyzer.StringClusters) StringsClusterAnalyzer(org.talend.dataquality.record.linkage.analyzer.StringsClusterAnalyzer) GenericParameter(org.talend.dataprep.transformation.api.action.dynamic.GenericParameter) ClusterItem(org.talend.dataprep.parameters.ClusterItem) PostMerge(org.talend.dataquality.record.linkage.analyzer.PostMerge) StringClusters(org.talend.dataquality.record.linkage.analyzer.StringClusters) Clusters(org.talend.dataprep.parameters.Clusters)

Aggregations

ClusterItem (org.talend.dataprep.parameters.ClusterItem)1 Clusters (org.talend.dataprep.parameters.Clusters)1 GenericParameter (org.talend.dataprep.transformation.api.action.dynamic.GenericParameter)1 PostMerge (org.talend.dataquality.record.linkage.analyzer.PostMerge)1 StringClusters (org.talend.dataquality.record.linkage.analyzer.StringClusters)1 StringsClusterAnalyzer (org.talend.dataquality.record.linkage.analyzer.StringsClusterAnalyzer)1