Search in sources :

Example 1 with PhoneticAttribute

use of zemberek.core.turkish.PhoneticAttribute in project zemberek-nlp by ahmetaa.

the class SuffixSurfaceNodeGenerator method generate.

public List<SuffixSurfaceNode> generate(EnumSet<PhoneticAttribute> attrs, EnumSet<PhoneticExpectation> expectations, SuffixData suffixData, SuffixForm suffixForm) {
    List<SuffixToken> tokenList = Lists.newArrayList(new SuffixStringTokenizer(suffixForm.generation));
    // zero length token
    if (tokenList.size() == 0) {
        return Lists.newArrayList(new SuffixSurfaceNode(suffixForm, "", attrs.clone(), expectations.clone(), suffixData, suffixForm.terminationType));
    }
    List<SuffixSurfaceNode> forms = new ArrayList<SuffixSurfaceNode>(1);
    // generation of forms. normally only one form is generated. But in situations like cI~k, two Forms are generated.
    TurkishLetterSequence seq = new TurkishLetterSequence();
    int index = 0;
    for (SuffixToken token : tokenList) {
        EnumSet<PhoneticAttribute> formAttrs = defineMorphemicAttributes(seq, attrs);
        switch(token.type) {
            case LETTER:
                seq.append(token.letter);
                if (index == tokenList.size() - 1) {
                    forms.add(new SuffixSurfaceNode(suffixForm, seq.toString(), defineMorphemicAttributes(seq, attrs), suffixForm.terminationType));
                }
                break;
            case A_WOVEL:
                if (index == 0 && attrs.contains(LastLetterVowel)) {
                    break;
                }
                TurkicLetter lA = TurkicLetter.UNDEFINED;
                if (formAttrs.contains(LastVowelBack)) {
                    lA = L_a;
                } else if (formAttrs.contains(LastVowelFrontal)) {
                    lA = L_e;
                }
                if (lA == TurkicLetter.UNDEFINED) {
                    throw new IllegalArgumentException("Cannot generate A form!");
                }
                seq.append(lA);
                if (index == tokenList.size() - 1) {
                    forms.add(new SuffixSurfaceNode(suffixForm, seq.toString(), defineMorphemicAttributes(seq, attrs), suffixForm.terminationType));
                }
                break;
            case I_WOVEL:
                if (index == 0 && attrs.contains(LastLetterVowel)) {
                    break;
                }
                TurkicLetter li = TurkicLetter.UNDEFINED;
                if (formAttrs.containsAll(Arrays.asList(LastVowelBack, LastVowelRounded))) {
                    li = L_u;
                } else if (formAttrs.containsAll(Arrays.asList(LastVowelBack, LastVowelUnrounded))) {
                    li = L_ii;
                } else if (formAttrs.containsAll(Arrays.asList(LastVowelFrontal, LastVowelRounded))) {
                    li = L_uu;
                } else if (formAttrs.containsAll(Arrays.asList(LastVowelFrontal, LastVowelUnrounded))) {
                    li = L_i;
                }
                if (li == TurkicLetter.UNDEFINED) {
                    throw new IllegalArgumentException("Cannot generate I form!");
                }
                seq.append(li);
                if (index == tokenList.size() - 1) {
                    forms.add(new SuffixSurfaceNode(suffixForm, seq.toString(), defineMorphemicAttributes(seq, attrs), suffixForm.terminationType));
                }
                break;
            case APPEND:
                if (formAttrs.contains(LastLetterVowel)) {
                    seq.append(token.letter);
                }
                if (index == tokenList.size() - 1) {
                    forms.add(new SuffixSurfaceNode(suffixForm, seq.toString(), defineMorphemicAttributes(seq, attrs), suffixForm.terminationType));
                }
                break;
            case DEVOICE_FIRST:
                TurkicLetter ld = token.letter;
                if (formAttrs.contains(LastLetterVoiceless)) {
                    ld = Turkish.Alphabet.devoice(token.letter);
                }
                seq.append(ld);
                if (index == tokenList.size() - 1) {
                    forms.add(new SuffixSurfaceNode(suffixForm, seq.toString(), defineMorphemicAttributes(seq, attrs), suffixForm.terminationType));
                }
                break;
            case VOICE_LAST:
                ld = token.letter;
                seq.append(ld);
                if (index == tokenList.size() - 1) {
                    forms.add(new SuffixSurfaceNode(suffixForm, seq.toString(), defineMorphemicAttributes(seq, attrs), EnumSet.of(PhoneticExpectation.ConsonantStart), suffixData, suffixForm.terminationType));
                    seq.changeLast(Turkish.Alphabet.voice(token.letter));
                    forms.add(new SuffixSurfaceNode(suffixForm, seq.toString(), defineMorphemicAttributes(seq, attrs), EnumSet.of(PhoneticExpectation.VowelStart), suffixData, TerminationType.NON_TERMINAL));
                }
                break;
        }
        index++;
    }
    return forms;
}
Also used : TurkicLetter(zemberek.core.turkish.TurkicLetter) TurkishLetterSequence(zemberek.core.turkish.TurkishLetterSequence) ArrayList(java.util.ArrayList) PhoneticAttribute(zemberek.core.turkish.PhoneticAttribute) SuffixSurfaceNode(zemberek.morphology.lexicon.graph.SuffixSurfaceNode)

Example 2 with PhoneticAttribute

use of zemberek.core.turkish.PhoneticAttribute in project zemberek-nlp by ahmetaa.

the class AttributesHelper method getMorphemicAttributes.

public static AttributeSet<PhoneticAttribute> getMorphemicAttributes(CharSequence seq, AttributeSet<PhoneticAttribute> predecessorAttrs) {
    if (seq.length() == 0) {
        return predecessorAttrs.copy();
    }
    AttributeSet<PhoneticAttribute> attrs = new AttributeSet<>();
    if (alphabet.containsVowel(seq)) {
        TurkicLetter last = alphabet.getLastLetter(seq);
        if (last.isVowel()) {
            attrs.add(LastLetterVowel);
        } else {
            attrs.add(LastLetterConsonant);
        }
        TurkicLetter lastVowel = last.isVowel() ? last : alphabet.getLastVowel(seq);
        if (lastVowel.isFrontal()) {
            attrs.add(LastVowelFrontal);
        } else {
            attrs.add(LastVowelBack);
        }
        if (lastVowel.isRounded()) {
            attrs.add(LastVowelRounded);
        } else {
            attrs.add(LastVowelUnrounded);
        }
        if (alphabet.getFirstLetter(seq).isVowel()) {
            attrs.add(FirstLetterVowel);
        } else {
            attrs.add(FirstLetterConsonant);
        }
    } else {
        // we transfer vowel attributes from the predecessor attributes.
        attrs.copyFrom(predecessorAttrs);
        attrs.addAll(NO_VOWEL_ATTRIBUTES);
        attrs.remove(LastLetterVowel);
        attrs.remove(ExpectsConsonant);
    }
    TurkicLetter last = alphabet.getLastLetter(seq);
    if (last.isVoiceless()) {
        attrs.add(LastLetterVoiceless);
        if (last.isStopConsonant()) {
            // kitap
            attrs.add(LastLetterVoicelessStop);
        }
    } else {
        attrs.add(LastLetterVoiced);
    }
    return attrs;
}
Also used : TurkicLetter(zemberek.core.turkish.TurkicLetter) AttributeSet(zemberek.morphology._morphotactics.AttributeSet) PhoneticAttribute(zemberek.core.turkish.PhoneticAttribute)

Example 3 with PhoneticAttribute

use of zemberek.core.turkish.PhoneticAttribute in project zemberek-nlp by ahmetaa.

the class InterpretingAnalyzer method advance.

// for all allowed matching outgoing transitions, new paths are generated.
// Transition conditions are used for checking if a search path is allowed to pass a transition.
private List<SearchPath> advance(SearchPath path, AnalysisDebugData debugData) {
    List<SearchPath> newPaths = new ArrayList<>(2);
    // for all outgoing transitions.
    for (MorphemeTransition transition : path.currentState.getOutgoing()) {
        SuffixTransition suffixTransition = (SuffixTransition) transition;
        // if tail is empty and this transitions surface is not empty, no need to check.
        if (path.tail.isEmpty() && suffixTransition.hasSurfaceForm()) {
            if (debugData != null) {
                debugData.rejectedTransitions.put(path, new RejectedTransition(suffixTransition, "Empty surface expected."));
            }
            continue;
        }
        String surface = SurfaceTransition.generate(suffixTransition, path.phoneticAttributes);
        // no need to go further if generated surface form is not a prefix of the paths's tail.
        if (!path.tail.startsWith(surface)) {
            if (debugData != null) {
                debugData.rejectedTransitions.put(path, new RejectedTransition(suffixTransition, "Surface Mismatch:" + surface));
            }
            continue;
        }
        // if transition condition fails, add it to debug data.
        if (debugData != null && suffixTransition.getCondition() != null) {
            Condition condition = suffixTransition.getCondition();
            Condition failed;
            if (condition instanceof CombinedCondition) {
                failed = ((CombinedCondition) condition).getFailingCondition(path);
            } else {
                failed = condition.accept(path) ? null : condition;
            }
            if (failed != null) {
                debugData.rejectedTransitions.put(path, new RejectedTransition(suffixTransition, "Condition → " + failed.toString()));
            }
        }
        // check conditions.
        if (!suffixTransition.canPass(path)) {
            continue;
        }
        // epsilon transition. Add and continue. Use existing attributes.
        if (!suffixTransition.hasSurfaceForm()) {
            newPaths.add(path.getCopy(new SurfaceTransition("", suffixTransition), path.phoneticAttributes));
            continue;
        }
        SurfaceTransition surfaceTransition = new SurfaceTransition(surface, suffixTransition);
        // if tail is equal to surface, no need to calculate phonetic attributes.
        AttributeSet<PhoneticAttribute> attributes = path.tail.equals(surface) ? path.phoneticAttributes.copy() : AttributesHelper.getMorphemicAttributes(surface, path.phoneticAttributes);
        // This is required for suffixes like `cik` and `ciğ`
        // an extra attribute is added if "cik" or "ciğ" is generated and matches the tail.
        // if "cik" is generated, ExpectsConsonant attribute is added, so only a consonant starting
        // suffix can follow. Likewise, if "ciğ" is produced, a vowel starting suffix is allowed.
        attributes.remove(PhoneticAttribute.CannotTerminate);
        SuffixTemplateToken lastToken = suffixTransition.getLastTemplateToken();
        if (lastToken.type == TemplateTokenType.LAST_VOICED) {
            attributes.add(PhoneticAttribute.ExpectsConsonant);
        } else if (lastToken.type == TemplateTokenType.LAST_NOT_VOICED) {
            attributes.add(PhoneticAttribute.ExpectsVowel);
            attributes.add(PhoneticAttribute.CannotTerminate);
        }
        SearchPath p = path.getCopy(surfaceTransition, attributes);
        newPaths.add(p);
    }
    return newPaths;
}
Also used : Condition(zemberek.morphology._morphotactics.Condition) CombinedCondition(zemberek.morphology._morphotactics.CombinedCondition) SuffixTransition(zemberek.morphology._morphotactics.SuffixTransition) ArrayList(java.util.ArrayList) CombinedCondition(zemberek.morphology._morphotactics.CombinedCondition) MorphemeTransition(zemberek.morphology._morphotactics.MorphemeTransition) PhoneticAttribute(zemberek.core.turkish.PhoneticAttribute) SuffixTemplateToken(zemberek.morphology._analyzer.SurfaceTransition.SuffixTemplateToken)

Example 4 with PhoneticAttribute

use of zemberek.core.turkish.PhoneticAttribute in project zemberek-nlp by ahmetaa.

the class StemTransitionGenerator method handleSpecialRoots.

private List<StemTransition> handleSpecialRoots(DictionaryItem item) {
    String id = item.getId();
    AttributeSet<PhoneticAttribute> originalAttrs = calculateAttributes(item.pronunciation);
    StemTransition original, modified;
    MorphemeState unmodifiedRootState = morphotactics.getRootState(item, originalAttrs);
    switch(id) {
        case "içeri_Noun":
        case "içeri_Adj":
        case "dışarı_Adj":
        case "dışarı_Noun":
        case "dışarı_Postp":
        case "yukarı_Noun":
        case "yukarı_Adj":
        case "şura_Noun":
        case "bura_Noun":
        case "ora_Noun":
            original = new StemTransition(item.root, item, originalAttrs, unmodifiedRootState);
            MorphemeState rootForModified;
            switch(item.primaryPos) {
                case Noun:
                    rootForModified = morphotactics.nounLastVowelDropRoot_S;
                    break;
                case Adjective:
                    rootForModified = morphotactics.adjLastVowelDropRoot_S;
                    break;
                // TODO: check postpositive case. Maybe it is not required.
                case PostPositive:
                    rootForModified = morphotactics.adjLastVowelDropRoot_S;
                    break;
                default:
                    throw new IllegalStateException("No root morpheme state found for " + item);
            }
            String m = item.root.substring(0, item.root.length() - 1);
            modified = new StemTransition(m, item, calculateAttributes(m), rootForModified);
            modified.getPhoneticAttributes().add(PhoneticAttribute.ExpectsConsonant);
            modified.getPhoneticAttributes().add(PhoneticAttribute.CannotTerminate);
            return Lists.newArrayList(original, modified);
        case "ben_Pron_Pers":
        case "sen_Pron_Pers":
            original = new StemTransition(item.root, item, originalAttrs, unmodifiedRootState);
            if (item.lemma.equals("ben")) {
                modified = new StemTransition("ban", item, calculateAttributes("ban"), morphotactics.pronPers_Mod_S);
            } else {
                modified = new StemTransition("san", item, calculateAttributes("san"), morphotactics.pronPers_Mod_S);
            }
            original.getPhoneticAttributes().add(PhoneticAttribute.UnModifiedPronoun);
            modified.getPhoneticAttributes().add(PhoneticAttribute.ModifiedPronoun);
            return Lists.newArrayList(original, modified);
        case "demek_Verb":
        case "yemek_Verb":
            original = new StemTransition(item.root, item, originalAttrs, morphotactics.vDeYeRoot_S);
            switch(item.lemma) {
                case "demek":
                    modified = new StemTransition("di", item, calculateAttributes("di"), morphotactics.vDeYeRoot_S);
                    break;
                default:
                    modified = new StemTransition("yi", item, calculateAttributes("yi"), morphotactics.vDeYeRoot_S);
            }
            return Lists.newArrayList(original, modified);
        case "birbiri_Pron_Quant":
        case "çoğu_Pron_Quant":
        case "öbürü_Pron_Quant":
        case "birçoğu_Pron_Quant":
            original = new StemTransition(item.root, item, originalAttrs, morphotactics.pronQuant_S);
            switch(item.lemma) {
                case "birbiri":
                    modified = new StemTransition("birbir", item, calculateAttributes("birbir"), morphotactics.pronQuantModified_S);
                    break;
                case "çoğu":
                    modified = new StemTransition("çok", item, calculateAttributes("çok"), morphotactics.pronQuantModified_S);
                    break;
                case "öbürü":
                    modified = new StemTransition("öbür", item, calculateAttributes("öbür"), morphotactics.pronQuantModified_S);
                    break;
                default:
                    modified = new StemTransition("birçok", item, calculateAttributes("birçok"), morphotactics.pronQuantModified_S);
                    break;
            }
            original.getPhoneticAttributes().add(PhoneticAttribute.UnModifiedPronoun);
            modified.getPhoneticAttributes().add(PhoneticAttribute.ModifiedPronoun);
            return Lists.newArrayList(original, modified);
        default:
            throw new IllegalArgumentException("Lexicon Item with special stem change cannot be handled:" + item);
    }
}
Also used : StemTransition(zemberek.morphology._morphotactics.StemTransition) PhoneticAttribute(zemberek.core.turkish.PhoneticAttribute) MorphemeState(zemberek.morphology._morphotactics.MorphemeState)

Example 5 with PhoneticAttribute

use of zemberek.core.turkish.PhoneticAttribute in project zemberek-nlp by ahmetaa.

the class StemTransitionGenerator method generateModifiedRootNodes.

private List<StemTransition> generateModifiedRootNodes(DictionaryItem dicItem) {
    StringBuilder modifiedSeq = new StringBuilder(dicItem.pronunciation);
    AttributeSet<PhoneticAttribute> originalAttrs = calculateAttributes(dicItem.pronunciation);
    AttributeSet<PhoneticAttribute> modifiedAttrs = originalAttrs.copy();
    MorphemeState modifiedRootState = null;
    MorphemeState unmodifiedRootState = null;
    for (RootAttribute attribute : dicItem.attributes) {
        // generate other boundary attributes and modified root state.
        switch(attribute) {
            case Voicing:
                char last = alphabet.getLastChar(modifiedSeq);
                char voiced = alphabet.voice(last);
                if (last == voiced) {
                    throw new LexiconException("Voicing letter is not proper in:" + dicItem);
                }
                if (dicItem.lemma.endsWith("nk")) {
                    voiced = 'g';
                }
                modifiedSeq.setCharAt(modifiedSeq.length() - 1, voiced);
                modifiedAttrs.remove(PhoneticAttribute.LastLetterVoicelessStop);
                originalAttrs.add(PhoneticAttribute.ExpectsConsonant);
                modifiedAttrs.add(PhoneticAttribute.ExpectsVowel);
                // TODO: find a better way for this.
                modifiedAttrs.add(PhoneticAttribute.CannotTerminate);
                break;
            case Doubling:
                modifiedSeq.append(alphabet.getLastChar(modifiedSeq));
                originalAttrs.add(PhoneticAttribute.ExpectsConsonant);
                modifiedAttrs.add(PhoneticAttribute.ExpectsVowel);
                modifiedAttrs.add(PhoneticAttribute.CannotTerminate);
                break;
            case LastVowelDrop:
                TurkicLetter lastLetter = alphabet.getLastLetter(modifiedSeq);
                if (lastLetter.isVowel()) {
                    modifiedSeq.deleteCharAt(modifiedSeq.length() - 1);
                    modifiedAttrs.add(PhoneticAttribute.ExpectsConsonant);
                    modifiedAttrs.add(PhoneticAttribute.CannotTerminate);
                } else {
                    modifiedSeq.deleteCharAt(modifiedSeq.length() - 2);
                    if (!dicItem.primaryPos.equals(PrimaryPos.Verb)) {
                        originalAttrs.add(PhoneticAttribute.ExpectsConsonant);
                    } else {
                        unmodifiedRootState = morphotactics.verbLastVowelDropUnmodRoot_S;
                        modifiedRootState = morphotactics.verbLastVowelDropModRoot_S;
                    }
                    modifiedAttrs.add(PhoneticAttribute.ExpectsVowel);
                    modifiedAttrs.add(PhoneticAttribute.CannotTerminate);
                }
                break;
            case InverseHarmony:
                originalAttrs.add(PhoneticAttribute.LastVowelFrontal);
                originalAttrs.remove(PhoneticAttribute.LastVowelBack);
                modifiedAttrs.add(PhoneticAttribute.LastVowelFrontal);
                modifiedAttrs.remove(PhoneticAttribute.LastVowelBack);
                break;
            case ProgressiveVowelDrop:
                modifiedSeq.deleteCharAt(modifiedSeq.length() - 1);
                if (alphabet.containsVowel(modifiedSeq)) {
                    modifiedAttrs = calculateAttributes(modifiedSeq);
                }
                modifiedAttrs.add(PhoneticAttribute.LastLetterDropped);
                break;
            default:
                break;
        }
    }
    if (unmodifiedRootState == null) {
        unmodifiedRootState = morphotactics.getRootState(dicItem, originalAttrs);
    }
    StemTransition original = new StemTransition(dicItem.root, dicItem, originalAttrs, unmodifiedRootState);
    // if modified root state is not defined in the switch block, get it from morphotactics.
    if (modifiedRootState == null) {
        modifiedRootState = morphotactics.getRootState(dicItem, modifiedAttrs);
    }
    StemTransition modified = new StemTransition(modifiedSeq.toString(), dicItem, modifiedAttrs, modifiedRootState);
    if (original.equals(modified)) {
        return Collections.singletonList(original);
    }
    return Lists.newArrayList(original, modified);
}
Also used : RootAttribute(zemberek.core.turkish.RootAttribute) TurkicLetter(zemberek.core.turkish.TurkicLetter) StemTransition(zemberek.morphology._morphotactics.StemTransition) LexiconException(zemberek.morphology.lexicon.LexiconException) PhoneticAttribute(zemberek.core.turkish.PhoneticAttribute) MorphemeState(zemberek.morphology._morphotactics.MorphemeState)

Aggregations

PhoneticAttribute (zemberek.core.turkish.PhoneticAttribute)16 TurkicLetter (zemberek.core.turkish.TurkicLetter)6 ArrayList (java.util.ArrayList)4 PhoneticExpectation (zemberek.core.turkish.PhoneticExpectation)3 RootAttribute (zemberek.core.turkish.RootAttribute)3 StemTransition (zemberek.morphology._morphotactics.StemTransition)3 LexiconException (zemberek.morphology.lexicon.LexiconException)3 StemNode (zemberek.morphology.lexicon.graph.StemNode)3 StemTransition (zemberek.morphology.morphotactics.StemTransition)3 TurkishLetterSequence (zemberek.core.turkish.TurkishLetterSequence)2 MorphemeState (zemberek.morphology._morphotactics.MorphemeState)2 RejectedTransition (zemberek.morphology.analysis.AnalysisDebugData.RejectedTransition)2 SuffixTemplateToken (zemberek.morphology.analysis.SurfaceTransition.SuffixTemplateToken)2 SuffixData (zemberek.morphology.lexicon.graph.SuffixData)2 CombinedCondition (zemberek.morphology.morphotactics.CombinedCondition)2 Condition (zemberek.morphology.morphotactics.Condition)2 MorphemeState (zemberek.morphology.morphotactics.MorphemeState)2 MorphemeTransition (zemberek.morphology.morphotactics.MorphemeTransition)2 SuffixTransition (zemberek.morphology.morphotactics.SuffixTransition)2 EnumSet (java.util.EnumSet)1