Search in sources :

Example 1 with StanfordCoreNLP

use of edu.stanford.nlp.pipeline.StanfordCoreNLP in project textdb by TextDB.

the class NlpEntityOperator method extractNlpSpans.

/**
     * @param iField
     * @param attributeName
     * @return
     * @about This function takes an IField(TextField) and a String (the field's
     *        name) as input and uses the Stanford NLP package to process the
     *        field based on the input token type and nlpTypeIndicator. In the
     *        result spans, value represents the word itself and key represents
     *        the recognized token type
     * @overview First set up a pipeline of Annotators based on the
     *           nlpTypeIndicator. If the nlpTypeIndicator is "NE_ALL", we set
     *           up the NamedEntityTagAnnotator, if it's "POS", then only
     *           PartOfSpeechAnnotator is needed.
     *           <p>
     *           The pipeline has to be this order: TokenizerAnnotator,
     *           SentencesAnnotator, PartOfSpeechAnnotator, LemmaAnnotator and
     *           NamedEntityTagAnnotator.
     *           <p>
     *           In the pipeline, each token is wrapped as a CoreLabel and each
     *           sentence is wrapped as CoreMap. Each annotator adds its
     *           annotation to the CoreMap(sentence) or CoreLabel(token) object.
     *           <p>
     *           After the pipeline, scan each CoreLabel(token) for its
     *           NamedEntityAnnotation or PartOfSpeechAnnotator depends on the
     *           nlpTypeIndicator
     *           <p>
     *           For each Stanford NLP annotation, get it's corresponding
     *           inputnlpEntityType that used in this package, then check if it
     *           equals to the input token type. If yes, makes it a span and add
     *           to the return list.
     *           <p>
     *           The NLP package has annotations for the start and end position
     *           of a token and it perfectly matches the span design so we just
     *           use them.
     *           <p>
     *           For Example: With TextField value: "Microsoft, Google and
     *           Facebook are organizations while Donald Trump and Barack Obama
     *           are persons", with attributeName: Sentence1 and inputTokenType is
     *           Organization. Since the inputTokenType require us to use
     *           NamedEntity Annotator in the Stanford NLP package, the
     *           nlpTypeIndicator would be set to "NE". The pipeline would set
     *           up to cover the Named Entity Recognizer. Then get the value of
     *           NamedEntityTagAnnotation for each CoreLabel(token).If the value
     *           is the token type "Organization", then it meets the
     *           requirement. In this case "Microsoft","Google" and "Facebook"
     *           will satisfy the requirement. "Donald Trump" and "Barack Obama"
     *           would have token type "Person" and do not meet the requirement.
     *           For each qualified token, create a span accordingly and add it
     *           to the returned list. In this case, token "Microsoft" would be
     *           span: ["Sentence1", 0, 9, Organization, "Microsoft"]
     */
private List<Span> extractNlpSpans(IField iField, String attributeName) {
    List<Span> spanList = new ArrayList<>();
    String text = (String) iField.getValue();
    Properties props = new Properties();
    // Setup Stanford NLP pipeline based on nlpTypeIndicator
    StanfordCoreNLP pipeline = null;
    if (getNlpTypeIndicator(predicate.getNlpEntityType()).equals("POS")) {
        props.setProperty("annotators", "tokenize, ssplit, pos");
        if (posPipeline == null) {
            posPipeline = new StanfordCoreNLP(props);
        }
        pipeline = posPipeline;
    } else {
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, " + "ner");
        if (nerPipeline == null) {
            nerPipeline = new StanfordCoreNLP(props);
        }
        pipeline = nerPipeline;
    }
    Annotation documentAnnotation = new Annotation(text);
    pipeline.annotate(documentAnnotation);
    List<CoreMap> sentences = documentAnnotation.get(CoreAnnotations.SentencesAnnotation.class);
    for (CoreMap sentence : sentences) {
        for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
            String stanfordNlpConstant;
            // Extract annotations based on nlpTypeIndicator
            if (getNlpTypeIndicator(predicate.getNlpEntityType()).equals("POS")) {
                stanfordNlpConstant = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
            } else {
                stanfordNlpConstant = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
            }
            NlpEntityType nlpEntityType = mapNlpEntityType(stanfordNlpConstant);
            if (nlpEntityType == null) {
                continue;
            }
            if (predicate.getNlpEntityType().equals(NlpEntityType.NE_ALL) || predicate.getNlpEntityType().equals(nlpEntityType)) {
                int start = token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class);
                int end = token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class);
                String word = token.get(CoreAnnotations.TextAnnotation.class);
                Span span = new Span(attributeName, start, end, nlpEntityType.toString(), word);
                if (spanList.size() >= 1 && (getNlpTypeIndicator(predicate.getNlpEntityType()).equals("NE_ALL"))) {
                    Span previousSpan = spanList.get(spanList.size() - 1);
                    if (previousSpan.getAttributeName().equals(span.getAttributeName()) && (span.getStart() - previousSpan.getEnd() <= 1) && previousSpan.getKey().equals(span.getKey())) {
                        Span newSpan = mergeTwoSpans(previousSpan, span);
                        span = newSpan;
                        spanList.remove(spanList.size() - 1);
                    }
                }
                spanList.add(span);
            }
        }
    }
    return spanList;
}
Also used : ArrayList(java.util.ArrayList) Properties(java.util.Properties) Span(edu.uci.ics.textdb.api.span.Span) StanfordCoreNLP(edu.stanford.nlp.pipeline.StanfordCoreNLP) Annotation(edu.stanford.nlp.pipeline.Annotation) CoreLabel(edu.stanford.nlp.ling.CoreLabel) CoreAnnotations(edu.stanford.nlp.ling.CoreAnnotations) CoreMap(edu.stanford.nlp.util.CoreMap)

Example 2 with StanfordCoreNLP

use of edu.stanford.nlp.pipeline.StanfordCoreNLP in project textdb by TextDB.

the class NlpSentimentOperator method open.

@Override
public void open() throws TextDBException {
    if (cursor != CLOSED) {
        return;
    }
    if (inputOperator == null) {
        throw new DataFlowException(ErrorMessages.INPUT_OPERATOR_NOT_SPECIFIED);
    }
    inputOperator.open();
    Schema inputSchema = inputOperator.getOutputSchema();
    // check if input schema is present
    if (!inputSchema.containsField(predicate.getInputAttributeName())) {
        throw new RuntimeException(String.format("input attribute %s is not in the input schema %s", predicate.getInputAttributeName(), inputSchema.getAttributeNames()));
    }
    // check if attribute type is valid
    AttributeType inputAttributeType = inputSchema.getAttribute(predicate.getInputAttributeName()).getAttributeType();
    boolean isValidType = inputAttributeType.equals(AttributeType.STRING) || inputAttributeType.equals(AttributeType.TEXT);
    if (!isValidType) {
        throw new RuntimeException(String.format("input attribute %s must have type String or Text, its actual type is %s", predicate.getInputAttributeName(), inputAttributeType));
    }
    // generate output schema by transforming the input schema
    outputSchema = transformSchema(inputOperator.getOutputSchema());
    cursor = OPENED;
    // setup NLP sentiment analysis pipeline
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
    sentimentPipeline = new StanfordCoreNLP(props);
}
Also used : AttributeType(edu.uci.ics.textdb.api.schema.AttributeType) Schema(edu.uci.ics.textdb.api.schema.Schema) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) Properties(java.util.Properties) StanfordCoreNLP(edu.stanford.nlp.pipeline.StanfordCoreNLP)

Example 3 with StanfordCoreNLP

use of edu.stanford.nlp.pipeline.StanfordCoreNLP in project neo4j-nlp-stanfordnlp by graphaware.

the class PipelineBuilder method build.

public StanfordCoreNLP build() {
    properties.setProperty("annotators", annotators.toString());
    // properties.setProperty("ner.model", customNEs.toString());
    properties.setProperty("threads", String.valueOf(threadsNumber));
    StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
    return pipeline;
}
Also used : StanfordCoreNLP(edu.stanford.nlp.pipeline.StanfordCoreNLP)

Example 4 with StanfordCoreNLP

use of edu.stanford.nlp.pipeline.StanfordCoreNLP in project neo4j-nlp-stanfordnlp by graphaware.

the class StanfordTextProcessor method getPipeline.

public StanfordCoreNLP getPipeline(String name) {
    if (name == null || name.isEmpty()) {
        name = TOKENIZER;
        LOG.debug("Using default pipeline: " + name);
    }
    StanfordCoreNLP pipeline = pipelines.get(name);
    if (pipeline == null) {
        throw new RuntimeException("Pipeline: " + name + " doesn't exist");
    }
    return pipeline;
}
Also used : StanfordCoreNLP(edu.stanford.nlp.pipeline.StanfordCoreNLP)

Example 5 with StanfordCoreNLP

use of edu.stanford.nlp.pipeline.StanfordCoreNLP in project neo4j-nlp-stanfordnlp by graphaware.

the class DependencyParserTest method testTagMerging.

@Test
public void testTagMerging() throws Exception {
    StanfordCoreNLP pipeline = ((StanfordTextProcessor) textProcessor).getPipeline("default");
    String text = "Donald Trump flew yesterday to New York City";
    AnnotatedText at = textProcessor.annotateText(text, "en", PIPELINE_DEFAULT);
}
Also used : AnnotatedText(com.graphaware.nlp.domain.AnnotatedText) StanfordTextProcessor(com.graphaware.nlp.processor.stanford.StanfordTextProcessor) StanfordCoreNLP(edu.stanford.nlp.pipeline.StanfordCoreNLP) Test(org.junit.Test)

Aggregations

StanfordCoreNLP (edu.stanford.nlp.pipeline.StanfordCoreNLP)71 Properties (java.util.Properties)44 Annotation (edu.stanford.nlp.pipeline.Annotation)40 CoreAnnotations (edu.stanford.nlp.ling.CoreAnnotations)33 CoreMap (edu.stanford.nlp.util.CoreMap)33 Test (org.junit.Test)15 CoreLabel (edu.stanford.nlp.ling.CoreLabel)12 SemanticGraphCoreAnnotations (edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations)12 SemanticGraph (edu.stanford.nlp.semgraph.SemanticGraph)10 CorefCoreAnnotations (edu.stanford.nlp.coref.CorefCoreAnnotations)6 SemanticGraphEdge (edu.stanford.nlp.semgraph.SemanticGraphEdge)6 StanfordTextProcessor (com.graphaware.nlp.processor.stanford.StanfordTextProcessor)5 TreeCoreAnnotations (edu.stanford.nlp.trees.TreeCoreAnnotations)5 PrintWriter (java.io.PrintWriter)5 ArrayList (java.util.ArrayList)5 AnnotatedText (com.graphaware.nlp.domain.AnnotatedText)3 CorefChain (edu.stanford.nlp.coref.data.CorefChain)3 GoldAnswerAnnotation (edu.stanford.nlp.ling.CoreAnnotations.GoldAnswerAnnotation)3 SentencesAnnotation (edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation)3 TokenSequencePattern (edu.stanford.nlp.ling.tokensregex.TokenSequencePattern)3