Search in sources :

Example 11 with TurkishTokenizer

use of zemberek.tokenization.TurkishTokenizer in project zemberek-nlp by ahmetaa.

the class WordHistogram method deascify.

static List<String> deascify(Path input) throws IOException {
    List<String> chunks = Files.readAllLines(input, StandardCharsets.UTF_8);
    List<String> result = new ArrayList<>();
    TurkishTokenizer tokenizer = TurkishTokenizer.DEFAULT;
    for (String chunk : chunks) {
        List<String> words = tokenizer.tokenizeToStrings(chunk);
        String tokenStr = String.join(" ", words);
        String withoutSpaces = chunk.replaceAll("\\s+", "");
        String turkishChrs = chunk.replaceAll("[^çÇöÖğĞüÜıİşŞâî]", "");
        double ratio = turkishChrs.length() * 1d / withoutSpaces.length();
        if (ratio < 0.01) {
            result.add(Deasciifier.deasciify(tokenStr));
        } else {
            result.add(chunk);
        }
    }
    return result;
}
Also used : TurkishTokenizer(zemberek.tokenization.TurkishTokenizer) ArrayList(java.util.ArrayList)

Aggregations

TurkishTokenizer (zemberek.tokenization.TurkishTokenizer)11 Token (zemberek.tokenization.Token)6 TurkishMorphology (zemberek.morphology.TurkishMorphology)4 ArrayList (java.util.ArrayList)3 Stopwatch (com.google.common.base.Stopwatch)2 Path (java.nio.file.Path)2 Ignore (org.junit.Ignore)2 Test (org.junit.Test)2 Histogram (zemberek.core.collections.Histogram)2 SentenceAnalysis (zemberek.morphology.analysis.SentenceAnalysis)2 File (java.io.File)1 IOException (java.io.IOException)1 PrintWriter (java.io.PrintWriter)1 LinkedHashSet (java.util.LinkedHashSet)1 WebCorpus (zemberek.corpus.WebCorpus)1 WebDocument (zemberek.corpus.WebDocument)1 SentenceWordAnalysis (zemberek.morphology.analysis.SentenceWordAnalysis)1 SingleAnalysis (zemberek.morphology.analysis.SingleAnalysis)1 WordAnalysis (zemberek.morphology.analysis.WordAnalysis)1 TurkishSpellChecker (zemberek.normalization.TurkishSpellChecker)1