Search in sources :

Example 16 with DocumentPreprocessor

use of edu.stanford.nlp.process.DocumentPreprocessor in project textdb by TextDB.

the class NlpSplitOperator method computeSentenceList.

private List<Span> computeSentenceList(Tuple inputTuple) {
    String inputText = inputTuple.<IField>getField(predicate.getInputAttributeName()).getValue().toString();
    Reader reader = new StringReader(inputText);
    DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader);
    documentPreprocessor.setTokenizerFactory(PTBTokenizer.PTBTokenizerFactory.newCoreLabelTokenizerFactory("ptb3Escaping=false"));
    List<Span> sentenceList = new ArrayList<Span>();
    int start = 0;
    int end = 0;
    String key = PropertyNameConstants.NLP_SPLIT_KEY;
    String attributeName = predicate.getInputAttributeName();
    for (List<HasWord> sentence : documentPreprocessor) {
        String sentenceText = SentenceUtils.listToString(sentence);
        // Make span
        end = start + sentenceText.length();
        Span span = new Span(attributeName, start, end, key, sentenceText);
        sentenceList.add(span);
        start = end + 1;
    }
    return sentenceList;
}
Also used : HasWord(edu.stanford.nlp.ling.HasWord) StringReader(java.io.StringReader) ArrayList(java.util.ArrayList) Reader(java.io.Reader) StringReader(java.io.StringReader) IField(edu.uci.ics.texera.api.field.IField) DocumentPreprocessor(edu.stanford.nlp.process.DocumentPreprocessor) Span(edu.uci.ics.texera.api.span.Span)

Aggregations

DocumentPreprocessor (edu.stanford.nlp.process.DocumentPreprocessor)16 HasWord (edu.stanford.nlp.ling.HasWord)13 StringReader (java.io.StringReader)8 TaggedWord (edu.stanford.nlp.ling.TaggedWord)5 MaxentTagger (edu.stanford.nlp.tagger.maxent.MaxentTagger)5 CoreLabel (edu.stanford.nlp.ling.CoreLabel)3 LexicalizedParser (edu.stanford.nlp.parser.lexparser.LexicalizedParser)3 Tree (edu.stanford.nlp.trees.Tree)3 Reader (java.io.Reader)3 ArrayList (java.util.ArrayList)3 ParserQuery (edu.stanford.nlp.parser.common.ParserQuery)2 CoreLabelTokenFactory (edu.stanford.nlp.process.CoreLabelTokenFactory)2 GrammaticalStructure (edu.stanford.nlp.trees.GrammaticalStructure)2 Pair (edu.stanford.nlp.util.Pair)2 Timing (edu.stanford.nlp.util.Timing)2 BufferedReader (java.io.BufferedReader)2 File (java.io.File)2 PrintWriter (java.io.PrintWriter)2 Map (java.util.Map)2 Twokenize (cmu.arktweetnlp.Twokenize)1