Search in sources :

Example 6 with NGramStringListIterable

use of de.tudarmstadt.ukp.dkpro.core.ngrams.util.NGramStringListIterable in project dkpro-tc by dkpro.

the class NGramUtils method getAnnotationNgrams.

public static FrequencyDistribution<String> getAnnotationNgrams(JCas jcas, Annotation focusAnnotation, boolean lowerCaseNGrams, boolean filterPartialMatches, int minN, int maxN, Set<String> stopwords) {
    FrequencyDistribution<String> annoNgrams = new FrequencyDistribution<String>();
    // if not, extract them from all tokens in the focusAnnotation
    if (selectCovered(jcas, Sentence.class, focusAnnotation).size() > 0) {
        for (Sentence s : selectCovered(jcas, Sentence.class, focusAnnotation)) {
            for (List<String> ngram : new NGramStringListIterable(toText(selectCovered(Token.class, s)), minN, maxN)) {
                if (lowerCaseNGrams) {
                    ngram = lower(ngram);
                }
                if (passesNgramFilter(ngram, stopwords, filterPartialMatches)) {
                    String ngramString = StringUtils.join(ngram, NGRAM_GLUE);
                    annoNgrams.inc(ngramString);
                }
            }
        }
    } else {
        for (List<String> ngram : new NGramStringListIterable(toText(selectCovered(Token.class, focusAnnotation)), minN, maxN)) {
            if (lowerCaseNGrams) {
                ngram = lower(ngram);
            }
            if (passesNgramFilter(ngram, stopwords, filterPartialMatches)) {
                String ngramString = StringUtils.join(ngram, NGRAM_GLUE);
                annoNgrams.inc(ngramString);
            }
        }
    }
    return annoNgrams;
}
Also used : Sentence(de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence) NGramStringListIterable(de.tudarmstadt.ukp.dkpro.core.ngrams.util.NGramStringListIterable) FrequencyDistribution(de.tudarmstadt.ukp.dkpro.core.api.frequency.util.FrequencyDistribution)

Aggregations

FrequencyDistribution (de.tudarmstadt.ukp.dkpro.core.api.frequency.util.FrequencyDistribution)6 NGramStringListIterable (de.tudarmstadt.ukp.dkpro.core.ngrams.util.NGramStringListIterable)6 Sentence (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence)5 ArrayList (java.util.ArrayList)4 POS (de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS)2 Token (de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token)2 EncoderException (org.apache.commons.codec.EncoderException)1 StringEncoder (org.apache.commons.codec.StringEncoder)1 ColognePhonetic (org.apache.commons.codec.language.ColognePhonetic)1 Soundex (org.apache.commons.codec.language.Soundex)1 TextClassificationException (org.dkpro.tc.api.exception.TextClassificationException)1