Search in sources :

Example 1 with TextClassificationException

use of org.dkpro.tc.api.exception.TextClassificationException in project dkpro-tc by dkpro.

the class BrownClusterFeature method init.

private void init() throws TextClassificationException {
    if (map != null) {
        return;
    }
    map = new HashMap<String, String>();
    try {
        BufferedReader bf = openFile();
        String line = null;
        while ((line = bf.readLine()) != null) {
            String[] split = line.split("\t");
            map.put(split[1], split[0]);
        }
    } catch (Exception e) {
        throw new TextClassificationException(e);
    }
}
Also used : TextClassificationException(org.dkpro.tc.api.exception.TextClassificationException) BufferedReader(java.io.BufferedReader) ResourceInitializationException(org.apache.uima.resource.ResourceInitializationException) TextClassificationException(org.dkpro.tc.api.exception.TextClassificationException)

Example 2 with TextClassificationException

use of org.dkpro.tc.api.exception.TextClassificationException in project dkpro-tc by dkpro.

the class SimilarityPairFeatureExtractor method extract.

@Override
public Set<Feature> extract(JCas view1, JCas view2) throws TextClassificationException {
    try {
        double similarity;
        switch(textSimilarityResource.getMode()) {
            case text:
                similarity = textSimilarityResource.getSimilarity(view1.getDocumentText(), view2.getDocumentText());
                break;
            case jcas:
                similarity = ((JCasTextSimilarityMeasure) textSimilarityResource).getSimilarity(view1, view2);
                break;
            default:
                List<String> f1 = getItems(view1);
                List<String> f2 = getItems(view2);
                // Remove "_" tokens
                for (int i = f1.size() - 1; i >= 0; i--) {
                    if (f1.get(i) == null || f1.get(i).equals("_")) {
                        f1.remove(i);
                    }
                }
                for (int i = f2.size() - 1; i >= 0; i--) {
                    if (f2.get(i) == null || f2.get(i).equals("_")) {
                        f2.remove(i);
                    }
                }
                similarity = textSimilarityResource.getSimilarity(f1, f2);
        }
        return new Feature("Similarity" + textSimilarityResource.getName(), similarity, FeatureType.NUMERIC).asSet();
    } catch (FeaturePathException e) {
        throw new TextClassificationException(e);
    } catch (SimilarityException e) {
        throw new TextClassificationException(e);
    }
}
Also used : TextClassificationException(org.dkpro.tc.api.exception.TextClassificationException) FeaturePathException(de.tudarmstadt.ukp.dkpro.core.api.featurepath.FeaturePathException) SimilarityException(dkpro.similarity.algorithms.api.SimilarityException) Feature(org.dkpro.tc.api.features.Feature)

Example 3 with TextClassificationException

use of org.dkpro.tc.api.exception.TextClassificationException in project dkpro-tc by dkpro.

the class CosineFeatureExtractor method extract.

@Override
public Set<Feature> extract(JCas view1, JCas view2) throws TextClassificationException {
    try {
        TextClassificationTarget aTarget1 = JCasUtil.selectSingle(view1, TextClassificationTarget.class);
        TextClassificationTarget aTarget2 = JCasUtil.selectSingle(view2, TextClassificationTarget.class);
        // Note: getSimilarity(String, String) is *not* a convenience
        // method for getSimilarity(Collection<String>, Collection<String>).
        Set<String> text1 = NGramUtils.getDocumentNgrams(view1, aTarget1, true, false, 1, 1, stopwords, ngramAnnotationType).getKeys();
        Set<String> text2 = NGramUtils.getDocumentNgrams(view2, aTarget2, true, false, 1, 1, stopwords, ngramAnnotationType).getKeys();
        double similarity = measure.getSimilarity(text1, text2);
        // Temporary fix for DKPro Similarity Issue 30
        if (Double.isNaN(similarity)) {
            similarity = 0.0;
        }
        return new Feature("Similarity" + measure.getName(), similarity, FeatureType.NUMERIC).asSet();
    } catch (SimilarityException e) {
        throw new TextClassificationException(e);
    }
}
Also used : TextClassificationException(org.dkpro.tc.api.exception.TextClassificationException) TextClassificationTarget(org.dkpro.tc.api.type.TextClassificationTarget) SimilarityException(dkpro.similarity.algorithms.api.SimilarityException) Feature(org.dkpro.tc.api.features.Feature)

Example 4 with TextClassificationException

use of org.dkpro.tc.api.exception.TextClassificationException in project dkpro-tc by dkpro.

the class PhoneticNGramMC method getDocumentPhoneticNgrams.

public static FrequencyDistribution<String> getDocumentPhoneticNgrams(JCas jcas, Annotation target, int minN, int maxN) throws TextClassificationException {
    StringEncoder encoder;
    String languageCode = jcas.getDocumentLanguage();
    if (languageCode.equals("en")) {
        encoder = new Soundex();
    } else if (languageCode.equals("de")) {
        encoder = new ColognePhonetic();
    } else {
        throw new TextClassificationException("Language code '" + languageCode + "' not supported by phonetic ngrams FE.");
    }
    FrequencyDistribution<String> phoneticNgrams = new FrequencyDistribution<String>();
    for (Sentence s : selectCovered(jcas, Sentence.class, target)) {
        List<String> phoneticStrings = new ArrayList<String>();
        for (Token t : selectCovered(jcas, Token.class, s)) {
            try {
                phoneticStrings.add(encoder.encode(t.getCoveredText()));
            } catch (EncoderException e) {
                throw new TextClassificationException(e);
            }
        }
        String[] array = phoneticStrings.toArray(new String[phoneticStrings.size()]);
        for (List<String> ngram : new NGramStringListIterable(array, minN, maxN)) {
            phoneticNgrams.inc(StringUtils.join(ngram, NGRAM_GLUE));
        }
    }
    return phoneticNgrams;
}
Also used : TextClassificationException(org.dkpro.tc.api.exception.TextClassificationException) ArrayList(java.util.ArrayList) Token(de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token) ColognePhonetic(org.apache.commons.codec.language.ColognePhonetic) NGramStringListIterable(de.tudarmstadt.ukp.dkpro.core.ngrams.util.NGramStringListIterable) StringEncoder(org.apache.commons.codec.StringEncoder) Soundex(org.apache.commons.codec.language.Soundex) EncoderException(org.apache.commons.codec.EncoderException) Sentence(de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence) FrequencyDistribution(de.tudarmstadt.ukp.dkpro.core.api.frequency.util.FrequencyDistribution)

Example 5 with TextClassificationException

use of org.dkpro.tc.api.exception.TextClassificationException in project dkpro-tc by dkpro.

the class LuceneCPMetaCollectorBase method process.

@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
    JCas view1;
    JCas view2;
    try {
        view1 = jcas.getView(Constants.PART_ONE);
        view2 = jcas.getView(Constants.PART_TWO);
    } catch (Exception e) {
        throw new AnalysisEngineProcessException(e);
    }
    List<JCas> jcases = new ArrayList<JCas>();
    jcases.add(view1);
    jcases.add(view2);
    FrequencyDistribution<String> view1NGrams;
    FrequencyDistribution<String> view2NGrams;
    FrequencyDistribution<String> documentNGrams;
    try {
        TextClassificationTarget aTarget1 = JCasUtil.selectSingle(view1, TextClassificationTarget.class);
        TextClassificationTarget aTarget2 = JCasUtil.selectSingle(view2, TextClassificationTarget.class);
        view1NGrams = getNgramsFDView1(view1, aTarget1);
        view2NGrams = getNgramsFDView2(view2, aTarget2);
        documentNGrams = getNgramsFD(jcases);
    } catch (TextClassificationException e) {
        throw new AnalysisEngineProcessException(e);
    }
    for (String ngram : documentNGrams.getKeys()) {
        for (int i = 0; i < documentNGrams.getCount(ngram); i++) {
            addField(getFieldName(), ngram);
        }
    }
    for (String ngram : view1NGrams.getKeys()) {
        for (int i = 0; i < view1NGrams.getCount(ngram); i++) {
            addField(getFieldNameView1(), ngram);
        }
    }
    for (String ngram : view2NGrams.getKeys()) {
        for (int i = 0; i < view2NGrams.getCount(ngram); i++) {
            addField(getFieldNameView2(), ngram);
        }
    }
    for (String ngram1 : view1NGrams.getKeys()) {
        for (String ngram2 : view2NGrams.getKeys()) {
            int combinedSize = ngram1.split(NGRAM_GLUE).length + ngram2.split(NGRAM_GLUE).length;
            if (combinedSize <= getNgramMaxNCombo() && combinedSize >= getNgramMinNCombo()) {
                // set count = 1, for doc freq and not total term freq
                long count = view1NGrams.getCount(ngram1) * view2NGrams.getCount(ngram2);
                for (int i = 0; i < count; i++) {
                    addField(getFieldNameCombo(), ngram1 + ComboUtils.JOINT + ngram2);
                }
            }
        }
    }
}
Also used : TextClassificationException(org.dkpro.tc.api.exception.TextClassificationException) ArrayList(java.util.ArrayList) TextClassificationTarget(org.dkpro.tc.api.type.TextClassificationTarget) JCas(org.apache.uima.jcas.JCas) AnalysisEngineProcessException(org.apache.uima.analysis_engine.AnalysisEngineProcessException) TextClassificationException(org.dkpro.tc.api.exception.TextClassificationException) AnalysisEngineProcessException(org.apache.uima.analysis_engine.AnalysisEngineProcessException)

Aggregations

TextClassificationException (org.dkpro.tc.api.exception.TextClassificationException)25 ArrayList (java.util.ArrayList)10 TextClassificationTarget (org.dkpro.tc.api.type.TextClassificationTarget)7 AnalysisEngineProcessException (org.apache.uima.analysis_engine.AnalysisEngineProcessException)6 IOException (java.io.IOException)5 Feature (org.dkpro.tc.api.features.Feature)5 File (java.io.File)4 JCas (org.apache.uima.jcas.JCas)4 ResourceInitializationException (org.apache.uima.resource.ResourceInitializationException)4 FeatureExtractorResource_ImplBase (org.dkpro.tc.api.features.FeatureExtractorResource_ImplBase)4 JCasId (org.dkpro.tc.api.type.JCasId)4 TextClassificationOutcome (org.dkpro.tc.api.type.TextClassificationOutcome)4 CASException (org.apache.uima.cas.CASException)3 PairFeatureExtractor (org.dkpro.tc.api.features.PairFeatureExtractor)3 FrequencyDistribution (de.tudarmstadt.ukp.dkpro.core.api.frequency.util.FrequencyDistribution)2 Token (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token)2 SimilarityException (dkpro.similarity.algorithms.api.SimilarityException)2 HashSet (java.util.HashSet)2 FeatureExtractor (org.dkpro.tc.api.features.FeatureExtractor)2 Instance (org.dkpro.tc.api.features.Instance)2