Search in sources :

Example 1 with ColognePhonetic

use of org.apache.commons.codec.language.ColognePhonetic in project dkpro-tc by dkpro.

the class PhoneticNGramMC method getDocumentPhoneticNgrams.

public static FrequencyDistribution<String> getDocumentPhoneticNgrams(JCas jcas, Annotation target, int minN, int maxN) throws TextClassificationException {
    StringEncoder encoder;
    String languageCode = jcas.getDocumentLanguage();
    if (languageCode.equals("en")) {
        encoder = new Soundex();
    } else if (languageCode.equals("de")) {
        encoder = new ColognePhonetic();
    } else {
        throw new TextClassificationException("Language code '" + languageCode + "' not supported by phonetic ngrams FE.");
    }
    FrequencyDistribution<String> phoneticNgrams = new FrequencyDistribution<String>();
    for (Sentence s : selectCovered(jcas, Sentence.class, target)) {
        List<String> phoneticStrings = new ArrayList<String>();
        for (Token t : selectCovered(jcas, Token.class, s)) {
            try {
                phoneticStrings.add(encoder.encode(t.getCoveredText()));
            } catch (EncoderException e) {
                throw new TextClassificationException(e);
            }
        }
        String[] array = phoneticStrings.toArray(new String[phoneticStrings.size()]);
        for (List<String> ngram : new NGramStringListIterable(array, minN, maxN)) {
            phoneticNgrams.inc(StringUtils.join(ngram, NGRAM_GLUE));
        }
    }
    return phoneticNgrams;
}
Also used : TextClassificationException(org.dkpro.tc.api.exception.TextClassificationException) ArrayList(java.util.ArrayList) Token(de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token) ColognePhonetic(org.apache.commons.codec.language.ColognePhonetic) NGramStringListIterable(de.tudarmstadt.ukp.dkpro.core.ngrams.util.NGramStringListIterable) StringEncoder(org.apache.commons.codec.StringEncoder) Soundex(org.apache.commons.codec.language.Soundex) EncoderException(org.apache.commons.codec.EncoderException) Sentence(de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence) FrequencyDistribution(de.tudarmstadt.ukp.dkpro.core.api.frequency.util.FrequencyDistribution)

Aggregations

FrequencyDistribution (de.tudarmstadt.ukp.dkpro.core.api.frequency.util.FrequencyDistribution)1 Sentence (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence)1 Token (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token)1 NGramStringListIterable (de.tudarmstadt.ukp.dkpro.core.ngrams.util.NGramStringListIterable)1 ArrayList (java.util.ArrayList)1 EncoderException (org.apache.commons.codec.EncoderException)1 StringEncoder (org.apache.commons.codec.StringEncoder)1 ColognePhonetic (org.apache.commons.codec.language.ColognePhonetic)1 Soundex (org.apache.commons.codec.language.Soundex)1 TextClassificationException (org.dkpro.tc.api.exception.TextClassificationException)1