Search in sources :

Example 1 with CharacterGraphDecoder

use of zemberek.normalization.CharacterGraphDecoder in project zemberek-nlp by ahmetaa.

the class DictionaryOperations method findBadDictionaryItems.

private static void findBadDictionaryItems() throws IOException {
    CharacterGraphDecoder decoder = new CharacterGraphDecoder(0f);
    CharMatcher matcher = CharacterGraphDecoder.DIACRITICS_IGNORING_MATCHER;
    List<String> words = TextIO.loadLinesFromResource("tr/proper-from-corpus.dict", "#").stream().map(s -> s.trim().replaceAll("[ ]+.+?$", "").toLowerCase(Turkish.LOCALE)).collect(Collectors.toList());
    decoder.addWords(words);
    Set<String> res = new LinkedHashSet<>();
    for (String word : words) {
        if (word.length() < 5) {
            continue;
        }
        List<String> matches = decoder.getSuggestions(word, matcher);
        // matches.sort(Turkish.STRING_COMPARATOR_ASC);
        String s = String.join(" ", matches);
        if (matches.size() > 1) {
            res.add(word + " - " + s);
        }
    }
    List<String> r = new ArrayList<>(res);
    r.sort(Turkish.STRING_COMPARATOR_ASC);
    Files.write(Paths.get("similar-words-0-distance"), r);
}
Also used : URLDecoder(java.net.URLDecoder) HashMap(java.util.HashMap) Regexps(zemberek.core.text.Regexps) TurkishDictionaryLoader(zemberek.morphology.lexicon.tr.TurkishDictionaryLoader) ArrayList(java.util.ArrayList) DictionaryItem(zemberek.morphology.lexicon.DictionaryItem) HashSet(java.util.HashSet) Turkish(zemberek.core.turkish.Turkish) TreeMultimap(com.google.common.collect.TreeMultimap) CharMatcher(zemberek.normalization.CharacterGraphDecoder.CharMatcher) Map(java.util.Map) PrimaryPos(zemberek.core.turkish.PrimaryPos) Log(zemberek.core.logging.Log) CharacterGraphDecoder(zemberek.normalization.CharacterGraphDecoder) Path(java.nio.file.Path) LinkedHashSet(java.util.LinkedHashSet) SecondaryPos(zemberek.core.turkish.SecondaryPos) PrintWriter(java.io.PrintWriter) Files(java.nio.file.Files) TurkishMorphology(zemberek.morphology.TurkishMorphology) Set(java.util.Set) IOException(java.io.IOException) RootAttribute(zemberek.core.turkish.RootAttribute) Collectors(java.util.stream.Collectors) List(java.util.List) Paths(java.nio.file.Paths) TextIO(zemberek.core.text.TextIO) TurkishAlphabet(zemberek.core.turkish.TurkishAlphabet) Pattern(java.util.regex.Pattern) RootLexicon(zemberek.morphology.lexicon.RootLexicon) Comparator(java.util.Comparator) Collections(java.util.Collections) LinkedHashSet(java.util.LinkedHashSet) CharacterGraphDecoder(zemberek.normalization.CharacterGraphDecoder) CharMatcher(zemberek.normalization.CharacterGraphDecoder.CharMatcher) ArrayList(java.util.ArrayList)

Aggregations

TreeMultimap (com.google.common.collect.TreeMultimap)1 IOException (java.io.IOException)1 PrintWriter (java.io.PrintWriter)1 URLDecoder (java.net.URLDecoder)1 Files (java.nio.file.Files)1 Path (java.nio.file.Path)1 Paths (java.nio.file.Paths)1 ArrayList (java.util.ArrayList)1 Collections (java.util.Collections)1 Comparator (java.util.Comparator)1 HashMap (java.util.HashMap)1 HashSet (java.util.HashSet)1 LinkedHashSet (java.util.LinkedHashSet)1 List (java.util.List)1 Map (java.util.Map)1 Set (java.util.Set)1 Pattern (java.util.regex.Pattern)1 Collectors (java.util.stream.Collectors)1 Log (zemberek.core.logging.Log)1 Regexps (zemberek.core.text.Regexps)1