Search in sources :

Example 1 with WordTokenizer

use of org.languagetool.tokenizers.WordTokenizer in project languagetool by languagetool-org.

the class LanguageModelTest method testPerformance.

protected void testPerformance(LuceneLanguageModel model, int ngramLength) throws Exception {
    try (FileInputStream fis = new FileInputStream(FILE)) {
        String content = StringTools.readStream(fis, "UTF-8");
        WordTokenizer wordTokenizer = new WordTokenizer();
        List<String> words = wordTokenizer.tokenize(content);
        String prevPrevWord = null;
        String prevWord = null;
        int i = 0;
        long totalMicros = 0;
        for (String word : words) {
            if (word.trim().isEmpty()) {
                continue;
            }
            if (prevWord != null) {
                long t1 = System.nanoTime() / 1000;
                long count = 0;
                if (ngramLength == 2) {
                    count = model.getCount(Arrays.asList(prevWord, word));
                } else if (ngramLength == 3) {
                    if (prevPrevWord != null) {
                        count = model.getCount(Arrays.asList(prevPrevWord, prevWord, word));
                    }
                } else {
                    throw new IllegalArgumentException("ngram length not supported: " + ngramLength);
                }
                long timeMicros = (System.nanoTime() / 1000) - t1;
                long timeMillis = timeMicros / 1000;
                if (ngramLength == 2) {
                    System.out.println(count + "\t\t" + prevWord + " " + word + ": " + timeMicros + "µs = " + timeMillis + "ms");
                } else {
                    System.out.println(count + "\t\t" + prevPrevWord + " " + prevWord + " " + word + ": " + timeMicros + "µs = " + timeMillis + "ms");
                }
                if (i > SKIP_FIRST_ITEMS) {
                    totalMicros += timeMicros;
                }
                if (++i % 25 == 0) {
                    printStats(i, totalMicros);
                }
            }
            prevPrevWord = prevWord;
            prevWord = word;
        }
        printStats(i, totalMicros);
    }
}
Also used : WordTokenizer(org.languagetool.tokenizers.WordTokenizer) FileInputStream(java.io.FileInputStream)

Example 2 with WordTokenizer

use of org.languagetool.tokenizers.WordTokenizer in project languagetool by languagetool-org.

the class EnglishDisambiguationRuleTest method setUp.

@Before
public void setUp() {
    tagger = new EnglishTagger();
    tokenizer = new WordTokenizer();
    sentenceTokenizer = new SRXSentenceTokenizer(new English());
    disambiguator = new XmlRuleDisambiguator(new English());
    disamb2 = new DemoDisambiguator();
}
Also used : English(org.languagetool.language.English) DemoDisambiguator(org.languagetool.tagging.disambiguation.xx.DemoDisambiguator) XmlRuleDisambiguator(org.languagetool.tagging.disambiguation.rules.XmlRuleDisambiguator) WordTokenizer(org.languagetool.tokenizers.WordTokenizer) EnglishTagger(org.languagetool.tagging.en.EnglishTagger) SRXSentenceTokenizer(org.languagetool.tokenizers.SRXSentenceTokenizer) Before(org.junit.Before)

Example 3 with WordTokenizer

use of org.languagetool.tokenizers.WordTokenizer in project languagetool by languagetool-org.

the class EnglishTaggerTest method setUp.

@Before
public void setUp() {
    tagger = new EnglishTagger();
    tokenizer = new WordTokenizer();
}
Also used : WordTokenizer(org.languagetool.tokenizers.WordTokenizer) Before(org.junit.Before)

Example 4 with WordTokenizer

use of org.languagetool.tokenizers.WordTokenizer in project languagetool by languagetool-org.

the class CatalanTaggerTest method setUp.

@Before
public void setUp() {
    tagger = new CatalanTagger();
    tokenizer = new WordTokenizer();
}
Also used : WordTokenizer(org.languagetool.tokenizers.WordTokenizer) Before(org.junit.Before)

Example 5 with WordTokenizer

use of org.languagetool.tokenizers.WordTokenizer in project languagetool by languagetool-org.

the class SwedishTaggerTest method setUp.

@Before
public void setUp() {
    tagger = new SwedishTagger();
    tokenizer = new WordTokenizer();
}
Also used : WordTokenizer(org.languagetool.tokenizers.WordTokenizer) Before(org.junit.Before)

Aggregations

WordTokenizer (org.languagetool.tokenizers.WordTokenizer)18 Before (org.junit.Before)17 SRXSentenceTokenizer (org.languagetool.tokenizers.SRXSentenceTokenizer)3 XmlRuleDisambiguator (org.languagetool.tagging.disambiguation.rules.XmlRuleDisambiguator)2 DemoDisambiguator (org.languagetool.tagging.disambiguation.xx.DemoDisambiguator)2 FileInputStream (java.io.FileInputStream)1 English (org.languagetool.language.English)1 French (org.languagetool.language.French)1 Polish (org.languagetool.language.Polish)1 EnglishTagger (org.languagetool.tagging.en.EnglishTagger)1 FrenchTagger (org.languagetool.tagging.fr.FrenchTagger)1 PolishTagger (org.languagetool.tagging.pl.PolishTagger)1