use of zemberek.tokenization.TurkishTokenizer in project zemberek-nlp by ahmetaa.
the class WordHistogram method deascify.
static List<String> deascify(Path input) throws IOException {
List<String> chunks = Files.readAllLines(input, StandardCharsets.UTF_8);
List<String> result = new ArrayList<>();
TurkishTokenizer tokenizer = TurkishTokenizer.DEFAULT;
for (String chunk : chunks) {
List<String> words = tokenizer.tokenizeToStrings(chunk);
String tokenStr = String.join(" ", words);
String withoutSpaces = chunk.replaceAll("\\s+", "");
String turkishChrs = chunk.replaceAll("[^çÇöÖğĞüÜıİşŞâî]", "");
double ratio = turkishChrs.length() * 1d / withoutSpaces.length();
if (ratio < 0.01) {
result.add(Deasciifier.deasciify(tokenStr));
} else {
result.add(chunk);
}
}
return result;
}
Aggregations