Examples with Tokenizer - opennlp.tools.tokenize.Tokenizer

Example 1 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project textdb by TextDB.

the class POSTagexample method Tokenize.

public static String[] Tokenize(String sentence) throws InvalidFormatException, IOException {
    InputStream is = new FileInputStream("./src/main/java/edu/uci/ics/textdb/sandbox/OpenNLPexample/en-token.bin");
    TokenizerModel model = new TokenizerModel(is);
    Tokenizer tokenizer = new TokenizerME(model);
    String[] tokens = tokenizer.tokenize(sentence);
    is.close();
    return tokens;
}

Also used : FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) TokenizerME(opennlp.tools.tokenize.TokenizerME) TokenizerModel(opennlp.tools.tokenize.TokenizerModel) Tokenizer(opennlp.tools.tokenize.Tokenizer) FileInputStream(java.io.FileInputStream)

Example 2 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.

the class NEREngineCore method extractNameOccurrences.

protected Map<String, List<NameOccurrence>> extractNameOccurrences(TokenNameFinderModel nameFinderModel, String text, String language) {
    // version with explicit sentence endings to reflect heading / paragraph
    // structure of an HTML or PDF document converted to text
    String textWithDots = text.replaceAll("\\n\\n", ".\n");
    text = removeNonUtf8CompliantCharacters(text);
    SentenceDetectorME sentenceDetector = new SentenceDetectorME(getSentenceModel("en"));
    Span[] sentenceSpans = sentenceDetector.sentPosDetect(textWithDots);
    NameFinderME finder = new NameFinderME(nameFinderModel);
    Tokenizer tokenizer = openNLP.getTokenizer(language);
    Map<String, List<NameOccurrence>> nameOccurrences = new LinkedHashMap<String, List<NameOccurrence>>();
    for (int i = 0; i < sentenceSpans.length; i++) {
        String sentence = sentenceSpans[i].getCoveredText(text).toString().trim();
        // build a context by concatenating three sentences to be used for
        // similarity ranking / disambiguation + contextual snippet in the
        // extraction structure
        List<String> contextElements = new ArrayList<String>();
        if (i > 0) {
            CharSequence previousSentence = sentenceSpans[i - 1].getCoveredText(text);
            contextElements.add(previousSentence.toString().trim());
        }
        contextElements.add(sentence.trim());
        if (i + 1 < sentenceSpans.length) {
            CharSequence nextSentence = sentenceSpans[i + 1].getCoveredText(text);
            contextElements.add(nextSentence.toString().trim());
        }
        String context = StringUtils.join(contextElements, " ");
        // extract the names in the current sentence and
        // keep them store them with the current context
        Span[] tokenSpans = tokenizer.tokenizePos(sentence);
        String[] tokens = Span.spansToStrings(tokenSpans, sentence);
        Span[] nameSpans = finder.find(tokens);
        double[] probs = finder.probs();
        // int lastStartPosition = 0;
        for (int j = 0; j < nameSpans.length; j++) {
            String name = sentence.substring(tokenSpans[nameSpans[j].getStart()].getStart(), tokenSpans[nameSpans[j].getEnd() - 1].getEnd());
            // NOTE: With OpenNLP 1.6 the probability is now stored in the span
            double prob = nameSpans[j].getProb();
            // prob == 0.0 := unspecified
            Double confidence = prob != 0.0 ? Double.valueOf(prob) : null;
            if (confidence == null) {
                // fall back to the old if it is not set.
                for (int k = nameSpans[j].getStart(); k < nameSpans[j].getEnd(); k++) {
                    prob *= probs[k];
                }
                confidence = Double.valueOf(prob);
            } else if (confidence < 0.5d) {
                // It looks like as if preceptron based models do return
                // invalid probabilities. As it is expected the Named Entities
                // with a probability < 50% are not even returned by finder.find(..)
                // we will just ignore confidence values < 0.5 here
                confidence = null;
            }
            int start = tokenSpans[nameSpans[j].getStart()].getStart();
            int absoluteStart = sentenceSpans[i].getStart() + start;
            int absoluteEnd = absoluteStart + name.length();
            NerTag nerTag = config.getNerTag(nameSpans[j].getType());
            NameOccurrence occurrence = new NameOccurrence(name, absoluteStart, absoluteEnd, nerTag.getType(), context, confidence);
            List<NameOccurrence> occurrences = nameOccurrences.get(name);
            if (occurrences == null) {
                occurrences = new ArrayList<NameOccurrence>();
            }
            occurrences.add(occurrence);
            nameOccurrences.put(name, occurrences);
        }
    }
    finder.clearAdaptiveData();
    log.debug("{} name occurrences found: {}", nameOccurrences.size(), nameOccurrences);
    return nameOccurrences;
}

Also used : NerTag(org.apache.stanbol.enhancer.nlp.ner.NerTag) ArrayList(java.util.ArrayList) SentenceDetectorME(opennlp.tools.sentdetect.SentenceDetectorME) Span(opennlp.tools.util.Span) LinkedHashMap(java.util.LinkedHashMap) NameFinderME(opennlp.tools.namefind.NameFinderME) List(java.util.List) ArrayList(java.util.ArrayList) Tokenizer(opennlp.tools.tokenize.Tokenizer)

Example 3 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project stanbol by apache.

the class OpenNlpPosTaggingEngine method tokenize.

private List<Token> tokenize(Section section, String langauge) {
    Tokenizer tokenizer = getTokenizer(langauge);
    String text = section.getSpan();
    // assume avr. token length is 5
    List<Token> tokens = new ArrayList<Token>(text.length() / 5);
    opennlp.tools.util.Span[] tokenSpans = tokenizer.tokenizePos(section.getSpan());
    for (int i = 0; i < tokenSpans.length; i++) {
        Token token = section.addToken(tokenSpans[i].getStart(), tokenSpans[i].getEnd());
        log.trace(" > add {}", token);
        tokens.add(token);
    }
    return tokens;
}

Also used : ArrayList(java.util.ArrayList) Token(org.apache.stanbol.enhancer.nlp.model.Token) Tokenizer(opennlp.tools.tokenize.Tokenizer) Span(org.apache.stanbol.enhancer.nlp.model.Span)

Example 4 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project textdb by TextDB.

the class NameFinderExample method Tokenize.

public static String[] Tokenize(String sentence) throws InvalidFormatException, IOException {
    InputStream is = new FileInputStream("./src/main/java/edu/uci/ics/texera/sandbox/OpenNLPexample/en-token.bin");
    TokenizerModel model = new TokenizerModel(is);
    Tokenizer tokenizer = new TokenizerME(model);
    String[] tokens = tokenizer.tokenize(sentence);
    is.close();
    return tokens;
}

Example 5 with Tokenizer

use of opennlp.tools.tokenize.Tokenizer in project textdb by TextDB.

the class NameFinderExample method Tokenize.

public static String[] Tokenize(String sentence) throws InvalidFormatException, IOException {
    InputStream is = new FileInputStream("./src/main/java/edu/uci/ics/textdb/sandbox/OpenNLPexample/en-token.bin");
    TokenizerModel model = new TokenizerModel(is);
    Tokenizer tokenizer = new TokenizerME(model);
    String[] tokens = tokenizer.tokenize(sentence);
    is.close();
    return tokens;
}

Aggregations

Tokenizer (opennlp.tools.tokenize.Tokenizer)10 TokenizerModel (opennlp.tools.tokenize.TokenizerModel)6 TokenizerME (opennlp.tools.tokenize.TokenizerME)5 FileInputStream (java.io.FileInputStream)4 InputStream (java.io.InputStream)4 SimpleTokenizer (opennlp.tools.tokenize.SimpleTokenizer)4 ArrayList (java.util.ArrayList)2 Token (org.apache.stanbol.enhancer.nlp.model.Token)2 Test (org.junit.Test)2 IOException (java.io.IOException)1 LinkedHashMap (java.util.LinkedHashMap)1 List (java.util.List)1 NameFinderME (opennlp.tools.namefind.NameFinderME)1 SentenceDetectorME (opennlp.tools.sentdetect.SentenceDetectorME)1 InvalidFormatException (opennlp.tools.util.InvalidFormatException)1 Span (opennlp.tools.util.Span)1 AnalysedText (org.apache.stanbol.enhancer.nlp.model.AnalysedText)1 Section (org.apache.stanbol.enhancer.nlp.model.Section)1 Span (org.apache.stanbol.enhancer.nlp.model.Span)1 NerTag (org.apache.stanbol.enhancer.nlp.ner.NerTag)1