Search in sources :

Example 26 with AnalysedText

use of org.apache.stanbol.enhancer.nlp.model.AnalysedText in project stanbol by apache.

the class OpenNlpTokenizerEngine method computeEnhancements.

/**
     * Compute enhancements for supplied ContentItem. The results of the process
     * are expected to be stored in the metadata of the content item.
     * <p/>
     * The client (usually an {@link org.apache.stanbol.enhancer.servicesapi.EnhancementJobManager}) should take care of
     * persistent storage of the enhanced {@link org.apache.stanbol.enhancer.servicesapi.ContentItem}.
     * <p/>
     * This method creates a new POSContentPart using {@link org.apache.stanbol.enhancer.engines.pos.api.POSTaggerHelper#createContentPart} from a text/plain part and
     * stores it as a new part in the content item. The metadata is not changed.
     *
     * @throws org.apache.stanbol.enhancer.servicesapi.EngineException
     *          if the underlying process failed to work as
     *          expected
     */
@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    AnalysedText at = initAnalysedText(this, analysedTextFactory, ci);
    String language = getLanguage(this, ci, true);
    Tokenizer tokenizer = getTokenizer(language);
    if (tokenizer == null) {
        log.warn("Tokenizer for language {} is no longer available. " + "This might happen if the model becomes unavailable during enhancement. " + "If this happens more often it might also indicate an bug in the used " + "EnhancementJobManager implementation as the availability is also checked " + "in the canEnhance(..) method of this Enhancement Engine.");
        return;
    }
    //Try to use sentences for tokenizing
    Iterator<? extends Section> sections = at.getSentences();
    if (!sections.hasNext()) {
        //if no sentences are annotated
        sections = Collections.singleton(at).iterator();
    }
    //for all sentences (or the whole Text - if no sentences available)
    while (sections.hasNext()) {
        Section section = sections.next();
        //Tokenize section
        opennlp.tools.util.Span[] tokenSpans = tokenizer.tokenizePos(section.getSpan());
        for (int i = 0; i < tokenSpans.length; i++) {
            Token token = section.addToken(tokenSpans[i].getStart(), tokenSpans[i].getEnd());
            log.trace(" > add {}", token);
        }
    }
}
Also used : AnalysedText(org.apache.stanbol.enhancer.nlp.model.AnalysedText) NlpEngineHelper.initAnalysedText(org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.initAnalysedText) Token(org.apache.stanbol.enhancer.nlp.model.Token) Tokenizer(opennlp.tools.tokenize.Tokenizer) SimpleTokenizer(opennlp.tools.tokenize.SimpleTokenizer) Section(org.apache.stanbol.enhancer.nlp.model.Section)

Example 27 with AnalysedText

use of org.apache.stanbol.enhancer.nlp.model.AnalysedText in project stanbol by apache.

the class KuromojiNlpEngine method computeEnhancements.

/**
     * Compute enhancements for supplied ContentItem. The results of the process
     * are expected to be stored in the metadata of the content item.
     * <p/>
     * The client (usually an {@link org.apache.stanbol.enhancer.servicesapi.EnhancementJobManager}) should take care of
     * persistent storage of the enhanced {@link org.apache.stanbol.enhancer.servicesapi.ContentItem}.
     * <p/>
     * This method creates a new POSContentPart using {@link org.apache.stanbol.enhancer.engines.pos.api.POSTaggerHelper#createContentPart} from a text/plain part and
     * stores it as a new part in the content item. The metadata is not changed.
     *
     * @throws org.apache.stanbol.enhancer.servicesapi.EngineException
     *          if the underlying process failed to work as
     *          expected
     */
@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    final AnalysedText at = initAnalysedText(this, analysedTextFactory, ci);
    String language = getLanguage(this, ci, false);
    if (!("ja".equals(language) || (language != null && language.startsWith("ja-")))) {
        throw new IllegalStateException("The detected language is NOT 'ja'! " + "As this is also checked within the #canEnhance(..) method this " + "indicates an Bug in the used EnhancementJobManager implementation. " + "Please report this on the dev@apache.stanbol.org or create an " + "JIRA issue about this.");
    }
    //start with the Tokenizer
    TokenStream tokenStream = tokenizerFactory.create(new CharSequenceReader(at.getText()));
    //build the analyzing chain by adding all TokenFilters
    for (TokenFilterFactory filterFactory : filterFactories) {
        tokenStream = filterFactory.create(tokenStream);
    }
    //Try to extract sentences based on POS tags ...
    int sentStartOffset = -1;
    //NER data
    List<NerData> nerList = new ArrayList<NerData>();
    //the next index where the NerData.context need to be set
    int nerSentIndex = 0;
    NerData ner = null;
    OffsetAttribute offset = null;
    try {
        //required with Solr 4
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            offset = tokenStream.addAttribute(OffsetAttribute.class);
            Token token = at.addToken(offset.startOffset(), offset.endOffset());
            //Get the POS attribute and init the PosTag
            PartOfSpeechAttribute posAttr = tokenStream.addAttribute(PartOfSpeechAttribute.class);
            PosTag posTag = POS_TAG_SET.getTag(posAttr.getPartOfSpeech());
            if (posTag == null) {
                posTag = adhocTags.get(posAttr.getPartOfSpeech());
                if (posTag == null) {
                    posTag = new PosTag(posAttr.getPartOfSpeech());
                    adhocTags.put(posAttr.getPartOfSpeech(), posTag);
                    log.warn(" ... missing PosTag mapping for {}", posAttr.getPartOfSpeech());
                }
            }
            //Sentence detection by POS tag
            if (sentStartOffset < 0) {
                //the last token was a sentence ending
                sentStartOffset = offset.startOffset();
            }
            if (posTag.hasPos(Pos.Point)) {
                Sentence sent = at.addSentence(sentStartOffset, offset.startOffset());
                //add the sentence as context to the NerData instances
                while (nerSentIndex < nerList.size()) {
                    nerList.get(nerSentIndex).context = sent.getSpan();
                    nerSentIndex++;
                }
                sentStartOffset = -1;
            }
            //POS
            token.addAnnotation(POS_ANNOTATION, Value.value(posTag));
            //NER
            NerTag nerTag = NER_TAG_SET.getTag(posAttr.getPartOfSpeech());
            if (ner != null && (nerTag == null || !ner.tag.getType().equals(nerTag.getType()))) {
                //write NER annotation
                Chunk chunk = at.addChunk(ner.start, ner.end);
                chunk.addAnnotation(NlpAnnotations.NER_ANNOTATION, Value.value(ner.tag));
                //NOTE that the fise:TextAnnotation are written later based on the nerList
                //clean up
                ner = null;
            }
            if (nerTag != null) {
                if (ner == null) {
                    ner = new NerData(nerTag, offset.startOffset());
                    nerList.add(ner);
                }
                ner.end = offset.endOffset();
            }
            BaseFormAttribute baseFormAttr = tokenStream.addAttribute(BaseFormAttribute.class);
            MorphoFeatures morpho = null;
            if (baseFormAttr != null && baseFormAttr.getBaseForm() != null) {
                morpho = new MorphoFeatures(baseFormAttr.getBaseForm());
                //and add the posTag
                morpho.addPos(posTag);
            }
            InflectionAttribute inflectionAttr = tokenStream.addAttribute(InflectionAttribute.class);
            inflectionAttr.getInflectionForm();
            inflectionAttr.getInflectionType();
            if (morpho != null) {
                //if present add the morpho
                token.addAnnotation(MORPHO_ANNOTATION, Value.value(morpho));
            }
        }
        //we still need to write the last sentence
        Sentence lastSent = null;
        if (offset != null && sentStartOffset >= 0 && offset.endOffset() > sentStartOffset) {
            lastSent = at.addSentence(sentStartOffset, offset.endOffset());
        }
        //and set the context off remaining named entities
        while (nerSentIndex < nerList.size()) {
            if (lastSent != null) {
                nerList.get(nerSentIndex).context = lastSent.getSpan();
            } else {
                //no sentence detected
                nerList.get(nerSentIndex).context = at.getSpan();
            }
            nerSentIndex++;
        }
    } catch (IOException e) {
        throw new EngineException(this, ci, "Exception while reading from " + "AnalyzedText contentpart", e);
    } finally {
        try {
            tokenStream.close();
        } catch (IOException e) {
        /* ignore */
        }
    }
    //finally write the NER annotations to the metadata of the ContentItem
    final Graph metadata = ci.getMetadata();
    ci.getLock().writeLock().lock();
    try {
        Language lang = new Language("ja");
        for (NerData nerData : nerList) {
            IRI ta = EnhancementEngineHelper.createTextEnhancement(ci, this);
            metadata.add(new TripleImpl(ta, ENHANCER_SELECTED_TEXT, new PlainLiteralImpl(at.getSpan().substring(nerData.start, nerData.end), lang)));
            metadata.add(new TripleImpl(ta, DC_TYPE, nerData.tag.getType()));
            metadata.add(new TripleImpl(ta, ENHANCER_START, lf.createTypedLiteral(nerData.start)));
            metadata.add(new TripleImpl(ta, ENHANCER_END, lf.createTypedLiteral(nerData.end)));
            metadata.add(new TripleImpl(ta, ENHANCER_SELECTION_CONTEXT, new PlainLiteralImpl(nerData.context, lang)));
        }
    } finally {
        ci.getLock().writeLock().unlock();
    }
}
Also used : NerTag(org.apache.stanbol.enhancer.nlp.ner.NerTag) IRI(org.apache.clerezza.commons.rdf.IRI) TokenStream(org.apache.lucene.analysis.TokenStream) ArrayList(java.util.ArrayList) EngineException(org.apache.stanbol.enhancer.servicesapi.EngineException) Token(org.apache.stanbol.enhancer.nlp.model.Token) NlpEngineHelper.initAnalysedText(org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.initAnalysedText) AnalysedText(org.apache.stanbol.enhancer.nlp.model.AnalysedText) CharSequenceReader(org.apache.commons.io.input.CharSequenceReader) PosTag(org.apache.stanbol.enhancer.nlp.pos.PosTag) Language(org.apache.clerezza.commons.rdf.Language) NlpEngineHelper.getLanguage(org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.getLanguage) BaseFormAttribute(org.apache.lucene.analysis.ja.tokenattributes.BaseFormAttribute) TripleImpl(org.apache.clerezza.commons.rdf.impl.utils.TripleImpl) MorphoFeatures(org.apache.stanbol.enhancer.nlp.morpho.MorphoFeatures) Sentence(org.apache.stanbol.enhancer.nlp.model.Sentence) InflectionAttribute(org.apache.lucene.analysis.ja.tokenattributes.InflectionAttribute) PlainLiteralImpl(org.apache.clerezza.commons.rdf.impl.utils.PlainLiteralImpl) PartOfSpeechAttribute(org.apache.lucene.analysis.ja.tokenattributes.PartOfSpeechAttribute) IOException(java.io.IOException) Chunk(org.apache.stanbol.enhancer.nlp.model.Chunk) TokenFilterFactory(org.apache.lucene.analysis.util.TokenFilterFactory) Graph(org.apache.clerezza.commons.rdf.Graph) OffsetAttribute(org.apache.lucene.analysis.tokenattributes.OffsetAttribute)

Example 28 with AnalysedText

use of org.apache.stanbol.enhancer.nlp.model.AnalysedText in project stanbol by apache.

the class TestKuromojiNlpEngine method testEngine.

@Test
public void testEngine() throws EngineException {
    LiteralFactory lf = LiteralFactory.getInstance();
    Assert.assertEquals(EnhancementEngine.ENHANCE_ASYNC, engine.canEnhance(contentItem));
    engine.computeEnhancements(contentItem);
    //assert the results
    Map<IRI, RDFTerm> expected = new HashMap<IRI, RDFTerm>();
    expected.put(Properties.DC_CREATOR, lf.createTypedLiteral(engine.getClass().getName()));
    expected.put(Properties.ENHANCER_EXTRACTED_FROM, contentItem.getUri());
    Assert.assertEquals(16, EnhancementStructureHelper.validateAllTextAnnotations(contentItem.getMetadata(), text, expected));
    AnalysedText at = AnalysedTextUtils.getAnalysedText(contentItem);
    Assert.assertNotNull(at);
    List<Sentence> sentences = AnalysedTextUtils.asList(at.getSentences());
    Assert.assertNotNull(sentences);
    Assert.assertEquals(7, sentences.size());
    //TODO: values in the following arrays are based on the first run of the
    // engine. So this is only to detect changes in results. It can not validate
    // that the tokenization and NER detections are correct - sorry I do not
    // speak Japanese ...
    int[] expectedChunks = new int[] { 5, 3, 1, 0, 1, 2, 4 };
    int[] expectedTokens = new int[] { 25, 25, 25, 24, 33, 17, 32 };
    int sentIndex = 0;
    for (Sentence sent : sentences) {
        List<Chunk> sentenceNer = AnalysedTextUtils.asList(sent.getChunks());
        Assert.assertEquals(expectedChunks[sentIndex], sentenceNer.size());
        for (Chunk chunk : sentenceNer) {
            Value<NerTag> nerValue = chunk.getAnnotation(NlpAnnotations.NER_ANNOTATION);
            Assert.assertNotNull(nerValue);
            Assert.assertNotNull(nerValue.value().getType());
        }
        List<Token> tokens = AnalysedTextUtils.asList(sent.getTokens());
        Assert.assertEquals(expectedTokens[sentIndex], tokens.size());
        for (Token token : tokens) {
            Value<PosTag> posValue = token.getAnnotation(NlpAnnotations.POS_ANNOTATION);
            Assert.assertNotNull(posValue);
        }
        sentIndex++;
    }
}
Also used : IRI(org.apache.clerezza.commons.rdf.IRI) NerTag(org.apache.stanbol.enhancer.nlp.ner.NerTag) HashMap(java.util.HashMap) RDFTerm(org.apache.clerezza.commons.rdf.RDFTerm) Token(org.apache.stanbol.enhancer.nlp.model.Token) Chunk(org.apache.stanbol.enhancer.nlp.model.Chunk) LiteralFactory(org.apache.clerezza.rdf.core.LiteralFactory) AnalysedText(org.apache.stanbol.enhancer.nlp.model.AnalysedText) PosTag(org.apache.stanbol.enhancer.nlp.pos.PosTag) Sentence(org.apache.stanbol.enhancer.nlp.model.Sentence) Test(org.junit.Test)

Example 29 with AnalysedText

use of org.apache.stanbol.enhancer.nlp.model.AnalysedText in project stanbol by apache.

the class EntityLinkingEngine method canEnhance.

@Override
public int canEnhance(ContentItem ci) throws EngineException {
    log.trace("canEnhancer {}", ci.getUri());
    if (isOfflineMode() && !entitySearcher.supportsOfflineMode()) {
        log.warn("{} '{}' is inactive because EntitySearcher does not support Offline mode!", getClass().getSimpleName(), getName());
        return CANNOT_ENHANCE;
    }
    String language = getLanguage(this, ci, false);
    if (language == null || textProcessingConfig.getConfiguration(language) == null) {
        log.debug("Engine {} ignores ContentItem {} becuase language {} is not condigured.", new Object[] { getName(), ci.getUri(), language });
        return CANNOT_ENHANCE;
    }
    //we need a detected language, the AnalyzedText contentPart with
    //Tokens.
    AnalysedText at = getAnalysedText(this, ci, false);
    return at != null && at.getTokens().hasNext() ? ENHANCE_ASYNC : CANNOT_ENHANCE;
}
Also used : NlpEngineHelper.getAnalysedText(org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.getAnalysedText) AnalysedText(org.apache.stanbol.enhancer.nlp.model.AnalysedText)

Example 30 with AnalysedText

use of org.apache.stanbol.enhancer.nlp.model.AnalysedText in project stanbol by apache.

the class EntityCoReferenceEngine method extractNersAndNounPhrases.

/**
     * Extracts the NERs and the noun phrases from the given text and puts them in the given lists.
     * 
     * @param ci
     * @param ners
     * @param nounPhrases
     */
private void extractNersAndNounPhrases(ContentItem ci, Map<Integer, List<Span>> ners, List<NounPhrase> nounPhrases) {
    AnalysedText at = NlpEngineHelper.getAnalysedText(this, ci, true);
    Iterator<? extends Section> sections = at.getSentences();
    if (!sections.hasNext()) {
        // process as single sentence
        sections = Collections.singleton(at).iterator();
    }
    int sentenceCnt = 0;
    while (sections.hasNext()) {
        sentenceCnt++;
        Section section = sections.next();
        List<NounPhrase> sectionNounPhrases = new ArrayList<NounPhrase>();
        List<Span> sectionNers = new ArrayList<Span>();
        Iterator<Span> chunks = section.getEnclosed(EnumSet.of(SpanTypeEnum.Chunk));
        while (chunks.hasNext()) {
            Span chunk = chunks.next();
            Value<NerTag> ner = chunk.getAnnotation(NlpAnnotations.NER_ANNOTATION);
            if (ner != null) {
                sectionNers.add(chunk);
            }
            Value<PhraseTag> phrase = chunk.getAnnotation(NlpAnnotations.PHRASE_ANNOTATION);
            if (phrase != null && phrase.value().getCategory() == LexicalCategory.Noun) {
                sectionNounPhrases.add(new NounPhrase(chunk, sentenceCnt));
            }
        }
        for (NounPhrase nounPhrase : sectionNounPhrases) {
            Iterator<Span> tokens = section.getEnclosed(EnumSet.of(SpanTypeEnum.Token));
            while (tokens.hasNext()) {
                Span token = tokens.next();
                if (nounPhrase.containsSpan(token)) {
                    nounPhrase.addToken(token);
                }
            }
            for (Span sectionNer : sectionNers) {
                if (nounPhrase.containsSpan(sectionNer)) {
                    nounPhrase.addNerChunk(sectionNer);
                }
            }
        }
        nounPhrases.addAll(sectionNounPhrases);
        if (!sectionNers.isEmpty()) {
            ners.put(sentenceCnt, sectionNers);
        }
    }
}
Also used : NerTag(org.apache.stanbol.enhancer.nlp.ner.NerTag) NounPhrase(org.apache.stanbol.enhancer.engines.entitycoreference.datamodel.NounPhrase) ArrayList(java.util.ArrayList) PhraseTag(org.apache.stanbol.enhancer.nlp.phrase.PhraseTag) Section(org.apache.stanbol.enhancer.nlp.model.Section) Span(org.apache.stanbol.enhancer.nlp.model.Span) AnalysedText(org.apache.stanbol.enhancer.nlp.model.AnalysedText)

Aggregations

AnalysedText (org.apache.stanbol.enhancer.nlp.model.AnalysedText)32 EngineException (org.apache.stanbol.enhancer.servicesapi.EngineException)15 Token (org.apache.stanbol.enhancer.nlp.model.Token)13 NlpEngineHelper.getAnalysedText (org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.getAnalysedText)13 IOException (java.io.IOException)9 IRI (org.apache.clerezza.commons.rdf.IRI)9 TripleImpl (org.apache.clerezza.commons.rdf.impl.utils.TripleImpl)8 Sentence (org.apache.stanbol.enhancer.nlp.model.Sentence)8 PosTag (org.apache.stanbol.enhancer.nlp.pos.PosTag)8 Test (org.junit.Test)7 Graph (org.apache.clerezza.commons.rdf.Graph)6 NlpEngineHelper.initAnalysedText (org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.initAnalysedText)6 Language (org.apache.clerezza.commons.rdf.Language)5 PlainLiteralImpl (org.apache.clerezza.commons.rdf.impl.utils.PlainLiteralImpl)5 Section (org.apache.stanbol.enhancer.nlp.model.Section)5 Span (org.apache.stanbol.enhancer.nlp.model.Span)5 NerTag (org.apache.stanbol.enhancer.nlp.ner.NerTag)5 ArrayList (java.util.ArrayList)4 Chunk (org.apache.stanbol.enhancer.nlp.model.Chunk)4 Value (org.apache.stanbol.enhancer.nlp.model.annotation.Value)4