Search in sources :

Example 6 with IStemmer

use of morfologik.stemming.IStemmer in project languagetool by languagetool-org.

the class CatalanTagger method tag.

@Override
public List<AnalyzedTokenReadings> tag(final List<String> sentenceTokens) throws IOException {
    final List<AnalyzedTokenReadings> tokenReadings = new ArrayList<>();
    int pos = 0;
    final IStemmer dictLookup = new DictionaryLookup(getDictionary());
    for (String word : sentenceTokens) {
        // This hack allows all rules and dictionary entries to work with
        // typewriter apostrophe
        boolean containsTypewriterApostrophe = false;
        if (word.length() > 1) {
            if (word.contains("'")) {
                containsTypewriterApostrophe = true;
            }
            word = word.replace("’", "'");
        }
        final List<AnalyzedToken> l = new ArrayList<>();
        final String lowerWord = word.toLowerCase(conversionLocale);
        final boolean isLowercase = word.equals(lowerWord);
        final boolean isMixedCase = StringTools.isMixedCase(word);
        List<AnalyzedToken> taggerTokens = asAnalyzedTokenListForTaggedWords(word, getWordTagger().tag(word));
        // normal case:
        addTokens(taggerTokens, l);
        // word with lowercase word tags:
        if (!isLowercase && !isMixedCase) {
            List<AnalyzedToken> lowerTaggerTokens = asAnalyzedTokenListForTaggedWords(word, getWordTagger().tag(lowerWord));
            addTokens(lowerTaggerTokens, l);
        }
        // additional tagging with prefixes
        if (l.isEmpty() && !isMixedCase) {
            addTokens(additionalTags(word, dictLookup), l);
        }
        if (l.isEmpty()) {
            l.add(new AnalyzedToken(word, null, null));
        }
        AnalyzedTokenReadings atr = new AnalyzedTokenReadings(l, pos);
        if (containsTypewriterApostrophe) {
            List<ChunkTag> listChunkTags = new ArrayList<>();
            listChunkTags.add(new ChunkTag("containsTypewriterApostrophe"));
            atr.setChunkTags(listChunkTags);
        }
        tokenReadings.add(atr);
        pos += word.length();
    }
    return tokenReadings;
}
Also used : ChunkTag(org.languagetool.chunking.ChunkTag) AnalyzedToken(org.languagetool.AnalyzedToken) IStemmer(morfologik.stemming.IStemmer) ArrayList(java.util.ArrayList) DictionaryLookup(morfologik.stemming.DictionaryLookup) AnalyzedTokenReadings(org.languagetool.AnalyzedTokenReadings)

Example 7 with IStemmer

use of morfologik.stemming.IStemmer in project languagetool by languagetool-org.

the class PolishSynthesizer method synthesize.

@Override
public final String[] synthesize(final AnalyzedToken token, final String posTag) throws IOException {
    if (posTag == null) {
        return null;
    }
    final IStemmer synthesizer = new DictionaryLookup(getDictionary());
    boolean isNegated = false;
    if (token.getPOSTag() != null) {
        isNegated = posTag.indexOf(NEGATION_TAG) > 0 || token.getPOSTag().indexOf(NEGATION_TAG) > 0 && !(posTag.indexOf(COMP_TAG) > 0) && !(posTag.indexOf(SUP_TAG) > 0);
    }
    if (posTag.indexOf('+') > 0) {
        return synthesize(token, posTag, true);
    }
    final List<String> forms = getWordForms(token, posTag, isNegated, synthesizer);
    return forms.toArray(new String[forms.size()]);
}
Also used : IStemmer(morfologik.stemming.IStemmer) DictionaryLookup(morfologik.stemming.DictionaryLookup)

Aggregations

IStemmer (morfologik.stemming.IStemmer)7 ArrayList (java.util.ArrayList)6 DictionaryLookup (morfologik.stemming.DictionaryLookup)5 Matcher (java.util.regex.Matcher)3 Pattern (java.util.regex.Pattern)2 WordData (morfologik.stemming.WordData)2 AnalyzedToken (org.languagetool.AnalyzedToken)2 IOException (java.io.IOException)1 InputStream (java.io.InputStream)1 HashSet (java.util.HashSet)1 Nullable (org.jetbrains.annotations.Nullable)1 AnalyzedTokenReadings (org.languagetool.AnalyzedTokenReadings)1 ChunkTag (org.languagetool.chunking.ChunkTag)1