use of zemberek.morphology.TurkishMorphology in project zemberek-nlp by ahmetaa.
the class DictionaryOperations method saveLemmas.
public static void saveLemmas(int minLength) throws IOException {
TurkishMorphology morphology = TurkishMorphology.createWithDefaults();
Set<String> set = new HashSet<>();
for (DictionaryItem item : morphology.getLexicon()) {
String lemma = item.lemma;
if (item.attributes.contains(RootAttribute.Dummy)) {
continue;
}
if (lemma.length() < minLength) {
continue;
}
if (item.primaryPos == PrimaryPos.Punctuation) {
continue;
}
set.add(lemma);
}
List<String> list = new ArrayList<>(set);
list.sort(Turkish.STRING_COMPARATOR_ASC);
Files.write(Paths.get("zemberek.vocab"), list);
}
use of zemberek.morphology.TurkishMorphology in project zemberek-nlp by ahmetaa.
the class ExtractTurkishCityDistrictNames method removeZemberekDictionaryWordsFromList.
private static void removeZemberekDictionaryWordsFromList(Path input, Path out) throws IOException {
LinkedHashSet<String> list = new LinkedHashSet<>(Files.readAllLines(input, StandardCharsets.UTF_8));
System.out.println("Total amount of lines = " + list.size());
TurkishMorphology morphology = TurkishMorphology.create(RootLexicon.builder().addTextDictionaryResources("tr/master-dictionary.dict", "tr/non-tdk.dict", "tr/proper.dict", "tr/proper-from-corpus.dict", "tr/abbreviations.dict").build());
List<String> toRemove = new ArrayList<>();
for (DictionaryItem item : morphology.getLexicon()) {
if (list.contains(item.lemma)) {
toRemove.add(item.lemma);
}
}
System.out.println("Total amount to remove = " + toRemove.size());
list.removeAll(toRemove);
try (PrintWriter pw = new PrintWriter(out.toFile(), "utf-8")) {
list.forEach(pw::println);
}
}
Aggregations