Examples with Chunk - org.apache.stanbol.enhancer.nlp.model.Chunk

Example 1 with Chunk

use of org.apache.stanbol.enhancer.nlp.model.Chunk in project stanbol by apache.

the class PhraseBuilder method buildPhrase.

//varargs with generic types
@SuppressWarnings("unchecked")
private void buildPhrase(Token token) {
    Token lastConsumedToken = null;
    if (valid) {
        //search backwards for the first token matching an allowed end
        //category
        int endIndex = current.size() - 1;
        while (endIndex > 0 && !checkCategories(current.get(endIndex), phraseType.getEndType())[0]) {
            endIndex--;
        }
        lastConsumedToken = current.get(endIndex);
        //NOTE: ignore phrases with a single token
        if (endIndex > 0) {
            Chunk chunk = chunkFactory.createChunk(current.get(0), lastConsumedToken);
            //TODO: add support for confidence
            chunk.addAnnotation(PHRASE_ANNOTATION, Value.value(phraseTag));
            if (log.isDebugEnabled()) {
                log.debug("  << add {} phrase {} '{}'", new Object[] { phraseType.getPhraseType().name(), chunk, chunk.getSpan() });
            }
        } else if (log.isDebugEnabled()) {
            log.debug("  >> ignore {} phrase with single {} ", phraseType.getPhraseType().name(), current.get(0));
        }
    } else if (!current.isEmpty() && log.isDebugEnabled()) {
        log.debug("  << ignore invalid {} phrase [{},{}]", new Object[] { phraseType.getPhraseType().name(), current.get(0).getStart(), current.get(current.size() - 1).getEnd() });
    }
    //cleanup
    current.clear();
    valid = false;
    if (token != null && !token.equals(lastConsumedToken)) {
        //the current token might be the start of a new phrase
        checkStart(token);
    }
}

Also used : Token(org.apache.stanbol.enhancer.nlp.model.Token) Chunk(org.apache.stanbol.enhancer.nlp.model.Chunk)

Example 2 with Chunk

use of org.apache.stanbol.enhancer.nlp.model.Chunk in project stanbol by apache.

the class OpenNlpChunkingEngine method computeEnhancements.

/**
     * Compute enhancements for supplied ContentItem. The results of the process
     * are expected to be stored in the metadata of the content item.
     * <p/>
     * The client (usually an {@link org.apache.stanbol.enhancer.servicesapi.EnhancementJobManager}) should take care of
     * persistent storage of the enhanced {@link org.apache.stanbol.enhancer.servicesapi.ContentItem}.
     *
     * @throws org.apache.stanbol.enhancer.servicesapi.EngineException
     *          if the underlying process failed to work as
     *          expected
     */
@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    AnalysedText at = getAnalysedText(this, ci, true);
    String language = getLanguage(this, ci, true);
    isLangaugeConfigured(this, languageConfiguration, language, true);
    ChunkerME chunker = initChunker(language);
    if (chunker == null) {
        return;
    }
    //init the Phrase TagSet
    TagSet<PhraseTag> tagSet = tagSetRegistry.getTagSet(language);
    if (tagSet == null) {
    }
    if (tagSet == null) {
        log.warn("No Phrase TagSet registered for Language '{}'. Will build an " + "adhoc set based on encountered Tags!", language);
        //for now only created to avoid checks for tagSet == null
        //TODO: in future we might want to automatically create posModels based
        //on tagged texts. However this makes no sense as long we can not
        //persist TagSets.
        tagSet = new TagSet<PhraseTag>("dummy", language);
    }
    //holds PosTags created for POS tags that where not part of the posModel
    //(will hold all PosTags in case tagSet is NULL
    Map<String, PhraseTag> adhocTags = languageAdhocTags.get(language);
    if (adhocTags == null) {
        adhocTags = new HashMap<String, PhraseTag>();
        languageAdhocTags.put(language, adhocTags);
    }
    ci.getLock().writeLock().lock();
    try {
        Iterator<? extends Section> sentences = at.getSentences();
        if (!sentences.hasNext()) {
            //no sentences ... iterate over the whole text
            sentences = Collections.singleton(at).iterator();
        }
        List<String> tokenTextList = new ArrayList<String>(64);
        List<String> posList = new ArrayList<String>(64);
        List<Token> tokenList = new ArrayList<Token>(64);
        //process each sentence seperatly
        while (sentences.hasNext()) {
            // (1) get Tokens and POS information for the sentence
            Section sentence = sentences.next();
            Iterator<Token> tokens = sentence.getTokens();
            while (tokens.hasNext()) {
                Token token = tokens.next();
                tokenList.add(token);
                tokenTextList.add(token.getSpan());
                Value<PosTag> posValue = token.getAnnotation(POS_ANNOTATION);
                if (posValue == null) {
                    throw new EngineException("Missing POS value for Token '" + token.getSpan() + "' of ContentItem " + ci.getUri() + "(Sentence: '" + sentence.getSpan() + "'). This may " + "indicate that a POS tagging Engine is missing in " + "the EnhancementChain or that the used POS tagging " + "does not provide POS tags for each token!");
                } else {
                    posList.add(posValue.value().getTag());
                }
            }
            String[] tokenStrings = tokenTextList.toArray(new String[tokenTextList.size()]);
            String[] tokenPos = posList.toArray(new String[tokenTextList.size()]);
            if (log.isTraceEnabled()) {
                log.trace("Tokens: {}" + Arrays.toString(tokenStrings));
            }
            //free memory
            tokenTextList.clear();
            //free memory
            posList.clear();
            // (2) Chunk the sentence
            String[] chunkTags = chunker.chunk(tokenStrings, tokenPos);
            double[] chunkProb = chunker.probs();
            if (log.isTraceEnabled()) {
                log.trace("Chunks: {}" + Arrays.toString(chunkTags));
            }
            //free memory
            tokenStrings = null;
            //free memory
            tokenPos = null;
            // (3) Process the results and write the Annotations
            double chunkProps = 0;
            int chunkTokenCount = 0;
            PhraseTag tag = null;
            int i;
            /*
                 * This assumes:
                 *  - 'B-{tag}' ... for start of a new chunk
                 *  - '???' ... anything other for continuing the current chunk
                 *  - 'O' ... no chunk (ends current chunk)
                 */
            for (i = 0; i < tokenList.size(); i++) {
                boolean start = chunkTags[i].charAt(0) == 'B';
                boolean end = tag != null && (start || chunkTags[i].charAt(0) == 'O');
                if (end) {
                    //add the current phrase
                    //add at AnalysedText level, because offsets are absolute
                    //NOTE we are already at the next token when we detect the end
                    Chunk chunk = at.addChunk(tokenList.get(i - chunkTokenCount).getStart(), tokenList.get(i - 1).getEnd());
                    chunk.addAnnotation(PHRASE_ANNOTATION, new Value<PhraseTag>(tag, chunkProps / (double) chunkTokenCount));
                    //reset the state
                    tag = null;
                    chunkTokenCount = 0;
                    chunkProps = 0;
                }
                if (start) {
                    //create the new tag
                    tag = getPhraseTag(tagSet, adhocTags, chunkTags[i].substring(2), //skip 'B-'
                    language);
                }
                if (tag != null) {
                    //count this token for the current chunk
                    chunkProps = chunkProps + chunkProb[i];
                    chunkTokenCount++;
                }
            }
            if (tag != null) {
                Chunk chunk = at.addChunk(tokenList.get(i - chunkTokenCount).getStart(), tokenList.get(i - 1).getEnd());
                chunk.addAnnotation(PHRASE_ANNOTATION, new Value<PhraseTag>(tag, chunkProps / (double) chunkTokenCount));
            }
            // (4) clean up
            tokenList.clear();
        }
    } finally {
        ci.getLock().writeLock().unlock();
    }
    if (log.isTraceEnabled()) {
        logChunks(at);
    }
}

Also used : ArrayList(java.util.ArrayList) EngineException(org.apache.stanbol.enhancer.servicesapi.EngineException) Token(org.apache.stanbol.enhancer.nlp.model.Token) PhraseTag(org.apache.stanbol.enhancer.nlp.phrase.PhraseTag) Chunk(org.apache.stanbol.enhancer.nlp.model.Chunk) Section(org.apache.stanbol.enhancer.nlp.model.Section) NlpEngineHelper.getAnalysedText(org.apache.stanbol.enhancer.nlp.utils.NlpEngineHelper.getAnalysedText) AnalysedText(org.apache.stanbol.enhancer.nlp.model.AnalysedText) PosTag(org.apache.stanbol.enhancer.nlp.pos.PosTag) ChunkerME(opennlp.tools.chunker.ChunkerME)

Example 3 with Chunk

use of org.apache.stanbol.enhancer.nlp.model.Chunk in project stanbol by apache.

the class NEREngineCore method extractNameOccurrences.

/**
     * THis method extracts NamedEntity occurrences by using existing {@link Token}s and 
     * {@link Sentence}s in the parsed {@link AnalysedText}.
     * @param nameFinderModel the model used to find NamedEntities
     * @param at the Analysed Text
     * @param language the language of the text
     * @return the found named Entity Occurrences
     */
protected Map<String, List<NameOccurrence>> extractNameOccurrences(TokenNameFinderModel nameFinderModel, AnalysedText at, String language) {
    // version with explicit sentence endings to reflect heading / paragraph
    // structure of an HTML or PDF document converted to text
    NameFinderME finder = new NameFinderME(nameFinderModel);
    Map<String, List<NameOccurrence>> nameOccurrences = new LinkedHashMap<String, List<NameOccurrence>>();
    List<Section> sentences = new ArrayList<Section>();
    //Holds the tokens of the previouse (pos 0) current (pos 1) and next (pos 2) sentence
    AnalysedTextUtils.appandToList(at.getSentences(), sentences);
    if (sentences.isEmpty()) {
        //no sentence annotations
        //process as a single section
        sentences.add(at);
    }
    for (int i = 0; i < sentences.size(); i++) {
        String sentence = sentences.get(i).getSpan();
        // build a context by concatenating three sentences to be used for
        // similarity ranking / disambiguation + contextual snippet in the
        // extraction structure
        List<String> contextElements = new ArrayList<String>();
        contextElements.add(sentence);
        //three sentences as context
        String context = at.getSpan().substring(sentences.get(Math.max(0, i - 1)).getStart(), sentences.get(Math.min(sentences.size() - 1, i + 1)).getEnd());
        // get the tokens, words of the current sentence
        List<Token> tokens = new ArrayList<Token>(32);
        List<String> words = new ArrayList<String>(32);
        for (Iterator<Token> it = sentences.get(i).getTokens(); it.hasNext(); ) {
            Token t = it.next();
            tokens.add(t);
            words.add(t.getSpan());
        }
        Span[] nameSpans = finder.find(words.toArray(new String[words.size()]));
        double[] probs = finder.probs();
        //int lastStartPosition = 0;
        for (int j = 0; j < nameSpans.length; j++) {
            String name = at.getSpan().substring(tokens.get(nameSpans[j].getStart()).getStart(), tokens.get(nameSpans[j].getEnd() - 1).getEnd());
            Double confidence = 1.0;
            for (int k = nameSpans[j].getStart(); k < nameSpans[j].getEnd(); k++) {
                confidence *= probs[k];
            }
            int start = tokens.get(nameSpans[j].getStart()).getStart();
            int end = start + name.length();
            NerTag nerTag = config.getNerTag(nameSpans[j].getType());
            //create the occurrence for writing fise:TextAnnotations
            NameOccurrence occurrence = new NameOccurrence(name, start, end, nerTag.getType(), context, confidence);
            List<NameOccurrence> occurrences = nameOccurrences.get(name);
            if (occurrences == null) {
                occurrences = new ArrayList<NameOccurrence>();
            }
            occurrences.add(occurrence);
            nameOccurrences.put(name, occurrences);
            //add also the NerAnnotation to the AnalysedText
            Chunk chunk = at.addChunk(start, end);
            //TODO: build AnnotationModel based on the configured Mappings
            chunk.addAnnotation(NER_ANNOTATION, Value.value(nerTag, confidence));
        }
    }
    finder.clearAdaptiveData();
    log.debug("{} name occurrences found: {}", nameOccurrences.size(), nameOccurrences);
    return nameOccurrences;
}

Also used : NerTag(org.apache.stanbol.enhancer.nlp.ner.NerTag) ArrayList(java.util.ArrayList) Token(org.apache.stanbol.enhancer.nlp.model.Token) Chunk(org.apache.stanbol.enhancer.nlp.model.Chunk) Section(org.apache.stanbol.enhancer.nlp.model.Section) Span(opennlp.tools.util.Span) LinkedHashMap(java.util.LinkedHashMap) NameFinderME(opennlp.tools.namefind.NameFinderME) List(java.util.List) ArrayList(java.util.ArrayList)

Example 4 with Chunk

use of org.apache.stanbol.enhancer.nlp.model.Chunk in project stanbol by apache.

the class AnalyzedTextSerializerAndParserTest method setup.

@BeforeClass
public static final void setup() throws IOException {
    ci = ciFactory.createContentItem(new StringSource(text));
    textBlob = ContentItemHelper.getBlob(ci, Collections.singleton("text/plain"));
    analysedTextWithData = createAnalysedText();
    int sentence = text.indexOf('.') + 1;
    Sentence sent1 = analysedTextWithData.addSentence(0, sentence);
    expectedSentences.put(sent1, "The Stanbol enhancer can detect famous " + "cities such as Paris and people such as Bob Marley.");
    Token the = sent1.addToken(0, 3);
    expectedTokens.put(the, "The");
    the.addAnnotation(NlpAnnotations.POS_ANNOTATION, Value.value(new PosTag("PREP", Pos.Preposition), 0.85));
    Token stanbol = sent1.addToken(4, 11);
    expectedTokens.put(stanbol, "Stanbol");
    stanbol.addAnnotation(NlpAnnotations.POS_ANNOTATION, Value.value(new PosTag("PN", Pos.ProperNoun), 0.95));
    stanbol.addAnnotation(NlpAnnotations.SENTIMENT_ANNOTATION, Value.value(0.5));
    //use index to create Tokens
    int enhancerStart = sent1.getSpan().indexOf("enhancer");
    Token enhancer = sent1.addToken(enhancerStart, enhancerStart + "enhancer".length());
    expectedTokens.put(enhancer, "enhancer");
    enhancer.addAnnotation(NlpAnnotations.POS_ANNOTATION, Value.value(new PosTag("PN", Pos.ProperNoun), 0.95));
    enhancer.addAnnotation(NlpAnnotations.POS_ANNOTATION, Value.value(new PosTag("N", LexicalCategory.Noun), 0.87));
    MorphoFeatures morpho = new MorphoFeatures("enhance");
    morpho.addCase(new CaseTag("test-case-1", Case.Comitative));
    morpho.addCase(new CaseTag("test-case-2", Case.Abessive));
    morpho.addDefinitness(Definitness.Definite);
    morpho.addPerson(Person.First);
    morpho.addPos(new PosTag("PN", Pos.ProperNoun));
    morpho.addGender(new GenderTag("test-gender", Gender.Masculine));
    morpho.addNumber(new NumberTag("test-number", NumberFeature.Plural));
    morpho.addTense(new TenseTag("test-tense", Tense.Present));
    morpho.addVerbForm(new VerbMoodTag("test-verb-mood", VerbMood.ConditionalVerb));
    enhancer.addAnnotation(NlpAnnotations.MORPHO_ANNOTATION, Value.value(morpho));
    //create a chunk
    Chunk stanbolEnhancer = analysedTextWithData.addChunk(stanbol.getStart(), enhancer.getEnd());
    expectedChunks.put(stanbolEnhancer, "Stanbol enhancer");
    stanbolEnhancer.addAnnotation(NlpAnnotations.NER_ANNOTATION, Value.value(new NerTag("organization", DBPEDIA_ORGANISATION)));
    stanbolEnhancer.addAnnotation(NlpAnnotations.PHRASE_ANNOTATION, Value.value(new PhraseTag("NP", LexicalCategory.Noun), 0.98));
}

Also used : CaseTag(org.apache.stanbol.enhancer.nlp.morpho.CaseTag) NerTag(org.apache.stanbol.enhancer.nlp.ner.NerTag) Token(org.apache.stanbol.enhancer.nlp.model.Token) VerbMoodTag(org.apache.stanbol.enhancer.nlp.morpho.VerbMoodTag) Chunk(org.apache.stanbol.enhancer.nlp.model.Chunk) PhraseTag(org.apache.stanbol.enhancer.nlp.phrase.PhraseTag) PosTag(org.apache.stanbol.enhancer.nlp.pos.PosTag) NumberTag(org.apache.stanbol.enhancer.nlp.morpho.NumberTag) StringSource(org.apache.stanbol.enhancer.servicesapi.impl.StringSource) TenseTag(org.apache.stanbol.enhancer.nlp.morpho.TenseTag) MorphoFeatures(org.apache.stanbol.enhancer.nlp.morpho.MorphoFeatures) Sentence(org.apache.stanbol.enhancer.nlp.model.Sentence) GenderTag(org.apache.stanbol.enhancer.nlp.morpho.GenderTag) BeforeClass(org.junit.BeforeClass)

Example 5 with Chunk

use of org.apache.stanbol.enhancer.nlp.model.Chunk in project stanbol by apache.

the class EntityCoReferenceEngineTest method testSpatialCoref.

@Test
public void testSpatialCoref() throws EngineException, IOException {
    ContentItem ci = ciFactory.createContentItem(new StringSource(SPATIAL_TEXT));
    Graph graph = ci.getMetadata();
    IRI textEnhancement = EnhancementEngineHelper.createTextEnhancement(ci, engine);
    graph.add(new TripleImpl(textEnhancement, DC_LANGUAGE, new PlainLiteralImpl("en")));
    graph.add(new TripleImpl(textEnhancement, ENHANCER_CONFIDENCE, new PlainLiteralImpl("100.0")));
    graph.add(new TripleImpl(textEnhancement, DC_TYPE, DCTERMS_LINGUISTIC_SYSTEM));
    Entry<IRI, Blob> textBlob = ContentItemHelper.getBlob(ci, Collections.singleton("text/plain"));
    AnalysedText at = atFactory.createAnalysedText(ci, textBlob.getValue());
    Sentence sentence1 = at.addSentence(0, SPATIAL_SENTENCE_1.indexOf(".") + 1);
    Chunk angelaMerkel = sentence1.addChunk(0, "Angela Merkel".length());
    angelaMerkel.addAnnotation(NlpAnnotations.NER_ANNOTATION, Value.value(new NerTag("Angela Merkel", OntologicalClasses.DBPEDIA_PERSON)));
    Sentence sentence2 = at.addSentence(SPATIAL_SENTENCE_1.indexOf(".") + 1, SPATIAL_SENTENCE_1.length() + SPATIAL_SENTENCE_2.indexOf(".") + 1);
    int theStartIdx = sentence2.getSpan().indexOf("The");
    int germanStartIdx = sentence2.getSpan().indexOf("German");
    int chancellorStartIdx = sentence2.getSpan().indexOf("politician");
    Token the = sentence2.addToken(theStartIdx, theStartIdx + "The".length());
    the.addAnnotation(NlpAnnotations.POS_ANNOTATION, Value.value(new PosTag("The", LexicalCategory.PronounOrDeterminer, Pos.Determiner)));
    Token german = sentence2.addToken(germanStartIdx, germanStartIdx + "German".length());
    german.addAnnotation(NlpAnnotations.POS_ANNOTATION, Value.value(new PosTag("German", LexicalCategory.Adjective)));
    Token politician = sentence2.addToken(chancellorStartIdx, chancellorStartIdx + "politician".length());
    politician.addAnnotation(NlpAnnotations.POS_ANNOTATION, Value.value(new PosTag("politician", LexicalCategory.Noun)));
    Chunk theGermanChancellor = sentence2.addChunk(theStartIdx, chancellorStartIdx + "politician".length());
    theGermanChancellor.addAnnotation(NlpAnnotations.PHRASE_ANNOTATION, Value.value(new PhraseTag("The German politician", LexicalCategory.Noun)));
    engine.computeEnhancements(ci);
    Value<CorefFeature> representativeCorefValue = angelaMerkel.getAnnotation(NlpAnnotations.COREF_ANNOTATION);
    Assert.assertNotNull(representativeCorefValue);
    CorefFeature representativeCoref = representativeCorefValue.value();
    Assert.assertTrue(representativeCoref.isRepresentative());
    Assert.assertTrue(representativeCoref.getMentions().contains(theGermanChancellor));
    Value<CorefFeature> subordinateCorefValue = theGermanChancellor.getAnnotation(NlpAnnotations.COREF_ANNOTATION);
    Assert.assertNotNull(subordinateCorefValue);
    CorefFeature subordinateCoref = subordinateCorefValue.value();
    Assert.assertTrue(!subordinateCoref.isRepresentative());
    Assert.assertTrue(subordinateCoref.getMentions().contains(angelaMerkel));
}

Also used : IRI(org.apache.clerezza.commons.rdf.IRI) NerTag(org.apache.stanbol.enhancer.nlp.ner.NerTag) CorefFeature(org.apache.stanbol.enhancer.nlp.coref.CorefFeature) Blob(org.apache.stanbol.enhancer.servicesapi.Blob) PlainLiteralImpl(org.apache.clerezza.commons.rdf.impl.utils.PlainLiteralImpl) Token(org.apache.stanbol.enhancer.nlp.model.Token) Chunk(org.apache.stanbol.enhancer.nlp.model.Chunk) PhraseTag(org.apache.stanbol.enhancer.nlp.phrase.PhraseTag) AnalysedText(org.apache.stanbol.enhancer.nlp.model.AnalysedText) Graph(org.apache.clerezza.commons.rdf.Graph) PosTag(org.apache.stanbol.enhancer.nlp.pos.PosTag) StringSource(org.apache.stanbol.enhancer.servicesapi.impl.StringSource) TripleImpl(org.apache.clerezza.commons.rdf.impl.utils.TripleImpl) Sentence(org.apache.stanbol.enhancer.nlp.model.Sentence) ContentItem(org.apache.stanbol.enhancer.servicesapi.ContentItem) Test(org.junit.Test)

Aggregations

Chunk (org.apache.stanbol.enhancer.nlp.model.Chunk)9 Token (org.apache.stanbol.enhancer.nlp.model.Token)7 NerTag (org.apache.stanbol.enhancer.nlp.ner.NerTag)5 PosTag (org.apache.stanbol.enhancer.nlp.pos.PosTag)5 AnalysedText (org.apache.stanbol.enhancer.nlp.model.AnalysedText)4 Sentence (org.apache.stanbol.enhancer.nlp.model.Sentence)4 ArrayList (java.util.ArrayList)3 IRI (org.apache.clerezza.commons.rdf.IRI)3 PhraseTag (org.apache.stanbol.enhancer.nlp.phrase.PhraseTag)3 Graph (org.apache.clerezza.commons.rdf.Graph)2 PlainLiteralImpl (org.apache.clerezza.commons.rdf.impl.utils.PlainLiteralImpl)2 TripleImpl (org.apache.clerezza.commons.rdf.impl.utils.TripleImpl)2 Section (org.apache.stanbol.enhancer.nlp.model.Section)2 MorphoFeatures (org.apache.stanbol.enhancer.nlp.morpho.MorphoFeatures)2 EngineException (org.apache.stanbol.enhancer.servicesapi.EngineException)2 StringSource (org.apache.stanbol.enhancer.servicesapi.impl.StringSource)2 Test (org.junit.Test)2 IOException (java.io.IOException)1 HashMap (java.util.HashMap)1 LinkedHashMap (java.util.LinkedHashMap)1