Search in sources :

Example 6 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.

the class OpenNlpTokenizerEngine method computeEnhancements.

/**
     * Compute enhancements for supplied ContentItem. The results of the process
     * are expected to be stored in the metadata of the content item.
     * <p/>
     * The client (usually an {@link org.apache.stanbol.enhancer.servicesapi.EnhancementJobManager}) should take care of
     * persistent storage of the enhanced {@link org.apache.stanbol.enhancer.servicesapi.ContentItem}.
     * <p/>
     * This method creates a new POSContentPart using {@link org.apache.stanbol.enhancer.engines.pos.api.POSTaggerHelper#createContentPart} from a text/plain part and
     * stores it as a new part in the content item. The metadata is not changed.
     *
     * @throws org.apache.stanbol.enhancer.servicesapi.EngineException
     *          if the underlying process failed to work as
     *          expected
     */
@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    AnalysedText at = initAnalysedText(this, analysedTextFactory, ci);
    String language = getLanguage(this, ci, true);
    Tokenizer tokenizer = getTokenizer(language);
    if (tokenizer == null) {
        log.warn("Tokenizer for language {} is no longer available. " + "This might happen if the model becomes unavailable during enhancement. " + "If this happens more often it might also indicate an bug in the used " + "EnhancementJobManager implementation as the availability is also checked " + "in the canEnhance(..) method of this Enhancement Engine.");
        return;
    }
    //Try to use sentences for tokenizing
    Iterator<? extends Section> sections = at.getSentences();
    if (!sections.hasNext()) {
        //if no sentences are annotated
        sections = Collections.singleton(at).iterator();
    }
    //for all sentences (or the whole Text - if no sentences available)
    while (sections.hasNext()) {
        Section section = sections.next();
        //Tokenize section
        opennlp.tools.util.Span[] tokenSpans = tokenizer.tokenizePos(section.getSpan());
        for (int i = 0; i < tokenSpans.length; i++) {
            Token token = section.addToken(tokenSpans[i].getStart(), tokenSpans[i].getEnd());
            log.trace(" > add {}", token);
        }
    }
}
Also used : AnalysedText(org.apache.stanbol.enhancer.nlp.model.AnalysedText) NlpEngineHelper.initAnalysedText(org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.initAnalysedText) Token(org.apache.stanbol.enhancer.nlp.model.Token) Tokenizer(opennlp.tools.tokenize.Tokenizer) SimpleTokenizer(opennlp.tools.tokenize.SimpleTokenizer) Section(org.apache.stanbol.enhancer.nlp.model.Section)

Example 7 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.

the class OpenNLP method getTokenizer.

/**
     * Getter for the Tokenizer of a given language. This first tries to
     * create an {@link TokenizerME} instance if the required 
     * {@link TokenizerModel} for the parsed language is available. if such a
     * model is not available it returns the {@link SimpleTokenizer} instance.
     * @param language the language or <code>null</code> to build a 
     * {@link SimpleTokenizer}
     * @return the {@link Tokenizer} for the parsed language.
     */
public Tokenizer getTokenizer(String language) {
    Tokenizer tokenizer = null;
    if (language != null) {
        try {
            TokenizerModel model = getTokenizerModel(language);
            if (model != null) {
                tokenizer = new TokenizerME(model);
            }
        } catch (InvalidFormatException e) {
            log.warn("Unable to load Tokenizer Model for " + language + ": " + "Will use Simple Tokenizer instead", e);
        } catch (IOException e) {
            log.warn("Unable to load Tokenizer Model for " + language + ": " + "Will use Simple Tokenizer instead", e);
        }
    }
    if (tokenizer == null) {
        log.debug("Use Simple Tokenizer for language {}", language);
        tokenizer = SimpleTokenizer.INSTANCE;
    } else {
        log.debug("Use ME Tokenizer for language {}", language);
    }
    return tokenizer;
}
Also used : TokenizerME(opennlp.tools.tokenize.TokenizerME) IOException(java.io.IOException) Tokenizer(opennlp.tools.tokenize.Tokenizer) SimpleTokenizer(opennlp.tools.tokenize.SimpleTokenizer) TokenizerModel(opennlp.tools.tokenize.TokenizerModel) InvalidFormatException(opennlp.tools.util.InvalidFormatException)

Example 8 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.

the class OpenNLPTest method testLoadEnTokenizer.

@Test
public void testLoadEnTokenizer() throws IOException {
    TokenizerModel model = openNLP.getTokenizerModel("en");
    Assert.assertNotNull(model);
    Tokenizer tokenizer = openNLP.getTokenizer("en");
    Assert.assertNotNull(tokenizer);
}
Also used : TokenizerModel(opennlp.tools.tokenize.TokenizerModel) Tokenizer(opennlp.tools.tokenize.Tokenizer) SimpleTokenizer(opennlp.tools.tokenize.SimpleTokenizer) Test(org.junit.Test)

Example 9 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.

the class OpenNLPTest method testFallbackToSimpleTokenizer.

@Test
public void testFallbackToSimpleTokenizer() throws IOException {
    //however for the tokenizer it is expected that a fallback to the
    //SimpleTokenizer is made
    Tokenizer tokenizer = openNLP.getTokenizer("ru");
    Assert.assertNotNull(tokenizer);
    Assert.assertEquals(SimpleTokenizer.INSTANCE, tokenizer);
}
Also used : Tokenizer(opennlp.tools.tokenize.Tokenizer) SimpleTokenizer(opennlp.tools.tokenize.SimpleTokenizer) Test(org.junit.Test)

Example 10 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project textdb by TextDB.

the class POSTagexample method Tokenize.

public static String[] Tokenize(String sentence) throws InvalidFormatException, IOException {
    InputStream is = new FileInputStream("./src/main/java/edu/uci/ics/texera/sandbox/OpenNLPexample/en-token.bin");
    TokenizerModel model = new TokenizerModel(is);
    Tokenizer tokenizer = new TokenizerME(model);
    String[] tokens = tokenizer.tokenize(sentence);
    is.close();
    return tokens;
}
Also used : FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) TokenizerME(opennlp.tools.tokenize.TokenizerME) TokenizerModel(opennlp.tools.tokenize.TokenizerModel) Tokenizer(opennlp.tools.tokenize.Tokenizer) FileInputStream(java.io.FileInputStream)

Aggregations

Tokenizer (opennlp.tools.tokenize.Tokenizer)10 TokenizerModel (opennlp.tools.tokenize.TokenizerModel)6 TokenizerME (opennlp.tools.tokenize.TokenizerME)5 FileInputStream (java.io.FileInputStream)4 InputStream (java.io.InputStream)4 SimpleTokenizer (opennlp.tools.tokenize.SimpleTokenizer)4 ArrayList (java.util.ArrayList)2 Token (org.apache.stanbol.enhancer.nlp.model.Token)2 Test (org.junit.Test)2 IOException (java.io.IOException)1 LinkedHashMap (java.util.LinkedHashMap)1 List (java.util.List)1 NameFinderME (opennlp.tools.namefind.NameFinderME)1 SentenceDetectorME (opennlp.tools.sentdetect.SentenceDetectorME)1 InvalidFormatException (opennlp.tools.util.InvalidFormatException)1 Span (opennlp.tools.util.Span)1 AnalysedText (org.apache.stanbol.enhancer.nlp.model.AnalysedText)1 Section (org.apache.stanbol.enhancer.nlp.model.Section)1 Span (org.apache.stanbol.enhancer.nlp.model.Span)1 NerTag (org.apache.stanbol.enhancer.nlp.ner.NerTag)1