Search in sources :

Example 6 with TurkishSentenceExtractor

use of zemberek.tokenization.TurkishSentenceExtractor in project zemberek-nlp by ahmetaa.

the class DocumentSimilarityExperiment method prepareCorpus.

public void prepareCorpus(Path root, Path target) throws IOException {
    Set<Long> hashes = new HashSet<>();
    List<Path> files = new ArrayList<>();
    if (root.toFile().isFile()) {
        files.add(root);
    } else {
        files.addAll(Files.walk(root).filter(s -> s.toFile().isFile()).collect(Collectors.toList()));
    }
    files.sort(Comparator.comparing(Path::toString));
    WebCorpus corpus = new WebCorpus("web-news", "all");
    int duplicateCount = 0;
    TurkishSentenceExtractor extractor = TurkishSentenceExtractor.DEFAULT;
    for (Path file : files) {
        Log.info("Adding %s", file);
        List<WebDocument> docs = WebCorpus.loadDocuments(file);
        for (WebDocument doc : docs) {
            doc.setContent(extractor.fromParagraphs(doc.getLines()));
            doc.setContent(normalizeLines(doc.getLines()));
            if (hashes.contains(doc.getHash())) {
                duplicateCount++;
                continue;
            }
            if (doc.contentLength() < 50) {
                continue;
            }
            hashes.add(doc.getHash());
            corpus.addDocument(doc);
        }
        Log.info("Total doc count = %d Duplicate count= %d", corpus.documentCount(), duplicateCount);
    }
    Log.info("Total amount of files = %d", corpus.getDocuments().size());
    corpus.save(target, false);
}
Also used : Path(java.nio.file.Path) WebDocument(zemberek.corpus.WebDocument) ArrayList(java.util.ArrayList) WebCorpus(zemberek.corpus.WebCorpus) TurkishSentenceExtractor(zemberek.tokenization.TurkishSentenceExtractor) HashSet(java.util.HashSet) LinkedHashSet(java.util.LinkedHashSet)

Aggregations

TurkishSentenceExtractor (zemberek.tokenization.TurkishSentenceExtractor)6 Path (java.nio.file.Path)2 ArrayList (java.util.ArrayList)2 Stopwatch (com.google.common.base.Stopwatch)1 File (java.io.File)1 HashSet (java.util.HashSet)1 LinkedHashSet (java.util.LinkedHashSet)1 Ignore (org.junit.Ignore)1 Test (org.junit.Test)1 Histogram (zemberek.core.collections.Histogram)1 WebCorpus (zemberek.corpus.WebCorpus)1 WebDocument (zemberek.corpus.WebDocument)1 TurkishMorphology (zemberek.morphology.TurkishMorphology)1 Token (zemberek.tokenization.Token)1 TurkishTokenizer (zemberek.tokenization.TurkishTokenizer)1