Search in sources :

Example 1 with CharMatcher

use of zemberek.normalization.CharacterGraphDecoder.CharMatcher in project zemberek-nlp by ahmetaa.

the class DictionaryOperations method findBadDictionaryItems.

private static void findBadDictionaryItems() throws IOException {
    CharacterGraphDecoder decoder = new CharacterGraphDecoder(0f);
    CharMatcher matcher = CharacterGraphDecoder.DIACRITICS_IGNORING_MATCHER;
    List<String> words = TextIO.loadLinesFromResource("tr/proper-from-corpus.dict", "#").stream().map(s -> s.trim().replaceAll("[ ]+.+?$", "").toLowerCase(Turkish.LOCALE)).collect(Collectors.toList());
    decoder.addWords(words);
    Set<String> res = new LinkedHashSet<>();
    for (String word : words) {
        if (word.length() < 5) {
            continue;
        }
        List<String> matches = decoder.getSuggestions(word, matcher);
        // matches.sort(Turkish.STRING_COMPARATOR_ASC);
        String s = String.join(" ", matches);
        if (matches.size() > 1) {
            res.add(word + " - " + s);
        }
    }
    List<String> r = new ArrayList<>(res);
    r.sort(Turkish.STRING_COMPARATOR_ASC);
    Files.write(Paths.get("similar-words-0-distance"), r);
}
Also used : URLDecoder(java.net.URLDecoder) HashMap(java.util.HashMap) Regexps(zemberek.core.text.Regexps) TurkishDictionaryLoader(zemberek.morphology.lexicon.tr.TurkishDictionaryLoader) ArrayList(java.util.ArrayList) DictionaryItem(zemberek.morphology.lexicon.DictionaryItem) HashSet(java.util.HashSet) Turkish(zemberek.core.turkish.Turkish) TreeMultimap(com.google.common.collect.TreeMultimap) CharMatcher(zemberek.normalization.CharacterGraphDecoder.CharMatcher) Map(java.util.Map) PrimaryPos(zemberek.core.turkish.PrimaryPos) Log(zemberek.core.logging.Log) CharacterGraphDecoder(zemberek.normalization.CharacterGraphDecoder) Path(java.nio.file.Path) LinkedHashSet(java.util.LinkedHashSet) SecondaryPos(zemberek.core.turkish.SecondaryPos) PrintWriter(java.io.PrintWriter) Files(java.nio.file.Files) TurkishMorphology(zemberek.morphology.TurkishMorphology) Set(java.util.Set) IOException(java.io.IOException) RootAttribute(zemberek.core.turkish.RootAttribute) Collectors(java.util.stream.Collectors) List(java.util.List) Paths(java.nio.file.Paths) TextIO(zemberek.core.text.TextIO) TurkishAlphabet(zemberek.core.turkish.TurkishAlphabet) Pattern(java.util.regex.Pattern) RootLexicon(zemberek.morphology.lexicon.RootLexicon) Comparator(java.util.Comparator) Collections(java.util.Collections) LinkedHashSet(java.util.LinkedHashSet) CharacterGraphDecoder(zemberek.normalization.CharacterGraphDecoder) CharMatcher(zemberek.normalization.CharacterGraphDecoder.CharMatcher) ArrayList(java.util.ArrayList)

Aggregations

TreeMultimap (com.google.common.collect.TreeMultimap)1 IOException (java.io.IOException)1 PrintWriter (java.io.PrintWriter)1 URLDecoder (java.net.URLDecoder)1 Files (java.nio.file.Files)1 Path (java.nio.file.Path)1 Paths (java.nio.file.Paths)1 ArrayList (java.util.ArrayList)1 Collections (java.util.Collections)1 Comparator (java.util.Comparator)1 HashMap (java.util.HashMap)1 HashSet (java.util.HashSet)1 LinkedHashSet (java.util.LinkedHashSet)1 List (java.util.List)1 Map (java.util.Map)1 Set (java.util.Set)1 Pattern (java.util.regex.Pattern)1 Collectors (java.util.stream.Collectors)1 Log (zemberek.core.logging.Log)1 Regexps (zemberek.core.text.Regexps)1