use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.
the class OpenNlpTokenizerEngine method computeEnhancements.
/**
* Compute enhancements for supplied ContentItem. The results of the process
* are expected to be stored in the metadata of the content item.
* <p/>
* The client (usually an {@link org.apache.stanbol.enhancer.servicesapi.EnhancementJobManager}) should take care of
* persistent storage of the enhanced {@link org.apache.stanbol.enhancer.servicesapi.ContentItem}.
* <p/>
* This method creates a new POSContentPart using {@link org.apache.stanbol.enhancer.engines.pos.api.POSTaggerHelper#createContentPart} from a text/plain part and
* stores it as a new part in the content item. The metadata is not changed.
*
* @throws org.apache.stanbol.enhancer.servicesapi.EngineException
* if the underlying process failed to work as
* expected
*/
@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
AnalysedText at = initAnalysedText(this, analysedTextFactory, ci);
String language = getLanguage(this, ci, true);
Tokenizer tokenizer = getTokenizer(language);
if (tokenizer == null) {
log.warn("Tokenizer for language {} is no longer available. " + "This might happen if the model becomes unavailable during enhancement. " + "If this happens more often it might also indicate an bug in the used " + "EnhancementJobManager implementation as the availability is also checked " + "in the canEnhance(..) method of this Enhancement Engine.");
return;
}
//Try to use sentences for tokenizing
Iterator<? extends Section> sections = at.getSentences();
if (!sections.hasNext()) {
//if no sentences are annotated
sections = Collections.singleton(at).iterator();
}
//for all sentences (or the whole Text - if no sentences available)
while (sections.hasNext()) {
Section section = sections.next();
//Tokenize section
opennlp.tools.util.Span[] tokenSpans = tokenizer.tokenizePos(section.getSpan());
for (int i = 0; i < tokenSpans.length; i++) {
Token token = section.addToken(tokenSpans[i].getStart(), tokenSpans[i].getEnd());
log.trace(" > add {}", token);
}
}
}
use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.
the class OpenNLP method getTokenizer.
/**
* Getter for the Tokenizer of a given language. This first tries to
* create an {@link TokenizerME} instance if the required
* {@link TokenizerModel} for the parsed language is available. if such a
* model is not available it returns the {@link SimpleTokenizer} instance.
* @param language the language or <code>null</code> to build a
* {@link SimpleTokenizer}
* @return the {@link Tokenizer} for the parsed language.
*/
public Tokenizer getTokenizer(String language) {
Tokenizer tokenizer = null;
if (language != null) {
try {
TokenizerModel model = getTokenizerModel(language);
if (model != null) {
tokenizer = new TokenizerME(model);
}
} catch (InvalidFormatException e) {
log.warn("Unable to load Tokenizer Model for " + language + ": " + "Will use Simple Tokenizer instead", e);
} catch (IOException e) {
log.warn("Unable to load Tokenizer Model for " + language + ": " + "Will use Simple Tokenizer instead", e);
}
}
if (tokenizer == null) {
log.debug("Use Simple Tokenizer for language {}", language);
tokenizer = SimpleTokenizer.INSTANCE;
} else {
log.debug("Use ME Tokenizer for language {}", language);
}
return tokenizer;
}
use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.
the class OpenNLPTest method testLoadEnTokenizer.
@Test
public void testLoadEnTokenizer() throws IOException {
TokenizerModel model = openNLP.getTokenizerModel("en");
Assert.assertNotNull(model);
Tokenizer tokenizer = openNLP.getTokenizer("en");
Assert.assertNotNull(tokenizer);
}
use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.
the class OpenNLPTest method testFallbackToSimpleTokenizer.
@Test
public void testFallbackToSimpleTokenizer() throws IOException {
//however for the tokenizer it is expected that a fallback to the
//SimpleTokenizer is made
Tokenizer tokenizer = openNLP.getTokenizer("ru");
Assert.assertNotNull(tokenizer);
Assert.assertEquals(SimpleTokenizer.INSTANCE, tokenizer);
}
use of opennlp.tools.tokenize.Tokenizer in project textdb by TextDB.
the class POSTagexample method Tokenize.
public static String[] Tokenize(String sentence) throws InvalidFormatException, IOException {
InputStream is = new FileInputStream("./src/main/java/edu/uci/ics/texera/sandbox/OpenNLPexample/en-token.bin");
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
String[] tokens = tokenizer.tokenize(sentence);
is.close();
return tokens;
}
Aggregations