Examples with DocumentPreprocessor - edu.stanford.nlp.process.DocumentPreprocessor

Example 6 with DocumentPreprocessor

use of edu.stanford.nlp.process.DocumentPreprocessor in project CoreNLP by stanfordnlp.

the class MaxentTagger method runTagger.

/**
 * This method runs the tagger on the provided reader and writer.
 *
 * It takes input from the given {@code reader}, applies the
 * tagger to it one sentence at a time (determined using
 * documentPreprocessor), and writes the output to the given
 * {@code writer}.
 *
 * The document is broken into sentences using the sentence
 * processor determined in the tagger's TaggerConfig.
 *
 * {@code tagInside} makes the tagger run in XML mode.... If set
 * to non-empty, instead of processing the document as one large
 * text blob, it considers each region in between the given tag to
 * be a separate text blob.
 */
public void runTagger(BufferedReader reader, BufferedWriter writer, String tagInside, OutputStyle outputStyle) throws IOException {
    String sentenceDelimiter = config.getSentenceDelimiter();
    if (sentenceDelimiter != null && sentenceDelimiter.equals("newline")) {
        sentenceDelimiter = "\n";
    }
    final TokenizerFactory<? extends HasWord> tokenizerFactory = chooseTokenizerFactory();
    // Now we do everything through the doc preprocessor
    final DocumentPreprocessor docProcessor;
    if (tagInside.length() > 0) {
        docProcessor = new DocumentPreprocessor(reader, DocumentPreprocessor.DocType.XML);
        docProcessor.setElementDelimiter(tagInside);
    } else {
        docProcessor = new DocumentPreprocessor(reader);
        docProcessor.setSentenceDelimiter(sentenceDelimiter);
    }
    if (config.keepEmptySentences()) {
        docProcessor.setKeepEmptySentences(true);
    }
    docProcessor.setTokenizerFactory(tokenizerFactory);
    runTagger(docProcessor, writer, outputStyle);
}

Also used : DocumentPreprocessor(edu.stanford.nlp.process.DocumentPreprocessor)

Example 7 with DocumentPreprocessor

use of edu.stanford.nlp.process.DocumentPreprocessor in project CoreNLP by stanfordnlp.

the class TaggerDemo2 method main.

public static void main(String[] args) throws Exception {
    if (args.length != 2) {
        log.info("usage: java TaggerDemo2 modelFile fileToTag");
        return;
    }
    MaxentTagger tagger = new MaxentTagger(args[0]);
    TokenizerFactory<CoreLabel> ptbTokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "untokenizable=noneKeep");
    BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(args[1]), "utf-8"));
    PrintWriter pw = new PrintWriter(new OutputStreamWriter(System.out, "utf-8"));
    DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(r);
    documentPreprocessor.setTokenizerFactory(ptbTokenizerFactory);
    for (List<HasWord> sentence : documentPreprocessor) {
        List<TaggedWord> tSentence = tagger.tagSentence(sentence);
        pw.println(SentenceUtils.listToString(tSentence, false));
    }
    // print the adjectives in one more sentence. This shows how to get at words and tags in a tagged sentence.
    List<HasWord> sent = SentenceUtils.toWordList("The", "slimy", "slug", "crawled", "over", "the", "long", ",", "green", "grass", ".");
    List<TaggedWord> taggedSent = tagger.tagSentence(sent);
    for (TaggedWord tw : taggedSent) {
        if (tw.tag().startsWith("JJ")) {
            pw.println(tw.word());
        }
    }
    pw.close();
}

Also used : HasWord(edu.stanford.nlp.ling.HasWord) CoreLabelTokenFactory(edu.stanford.nlp.process.CoreLabelTokenFactory) InputStreamReader(java.io.InputStreamReader) FileInputStream(java.io.FileInputStream) CoreLabel(edu.stanford.nlp.ling.CoreLabel) TaggedWord(edu.stanford.nlp.ling.TaggedWord) MaxentTagger(edu.stanford.nlp.tagger.maxent.MaxentTagger) BufferedReader(java.io.BufferedReader) OutputStreamWriter(java.io.OutputStreamWriter) DocumentPreprocessor(edu.stanford.nlp.process.DocumentPreprocessor) PrintWriter(java.io.PrintWriter)

Example 8 with DocumentPreprocessor

use of edu.stanford.nlp.process.DocumentPreprocessor in project textdb by TextDB.

the class NlpSplitOperator method computeSentenceList.

private List<Span> computeSentenceList(Tuple inputTuple) {
    String inputText = inputTuple.<IField>getField(predicate.getInputAttributeName()).getValue().toString();
    Reader reader = new StringReader(inputText);
    DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader);
    List<Span> sentenceList = new ArrayList<Span>();
    int start = 0;
    int end = 0;
    String key = PropertyNameConstants.NLP_SPLIT_KEY;
    String attributeName = predicate.getInputAttributeName();
    for (List<HasWord> sentence : documentPreprocessor) {
        String sentenceText = Sentence.listToString(sentence);
        //Make span
        end = start + sentenceText.length();
        Span span = new Span(attributeName, start, end, key, sentenceText);
        sentenceList.add(span);
        start = end + 1;
    }
    return sentenceList;
}

Also used : HasWord(edu.stanford.nlp.ling.HasWord) StringReader(java.io.StringReader) ArrayList(java.util.ArrayList) Reader(java.io.Reader) StringReader(java.io.StringReader) IField(edu.uci.ics.textdb.api.field.IField) DocumentPreprocessor(edu.stanford.nlp.process.DocumentPreprocessor) Span(edu.uci.ics.textdb.api.span.Span)

Example 9 with DocumentPreprocessor

use of edu.stanford.nlp.process.DocumentPreprocessor in project uuusa by aghie.

the class Processor method process.

public List<SentimentDependencyGraph> process(String text) {
    // HashMap<String, String> emoLookupTable = new HashMap<String,String>();
    // for (String emoticon : emoticons){
    // System.out.println(emoticon);
    // String emouuid = UUID.randomUUID().toString();
    // text.replaceAll(emoticon, emouuid);
    // emoLookupTable.put(emouuid, emoticon);
    // }
    List<SentimentDependencyGraph> sdgs = new ArrayList<SentimentDependencyGraph>();
    DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(text.concat(" ")));
    dp.setTokenizerFactory(PTBTokenizer.factory(new WordTokenFactory(), "ptb3Escaping=false"));
    for (List<HasWord> sentence : dp) {
        List<String> words = sentence.stream().map(w -> w.toString()).collect(Collectors.toList());
        // System.out.println("text: "+text);
        List<String> tokens = this.tokenizer.tokenize(String.join(" ", words));
        // System.out.println("tokens: "+tokens);
        List<TaggedTokenInformation> ttis = this.tagger.tag(tokens);
        sdgs.add(this.parser.parse(ttis));
    }
    // this.parser.parse(ttis);
    return sdgs;
}

Also used : HasWord(edu.stanford.nlp.ling.HasWord) PTBTokenizer(edu.stanford.nlp.process.PTBTokenizer) TreeTokenizerFactory(edu.stanford.nlp.trees.TreeTokenizerFactory) HashMap(java.util.HashMap) LexedTokenFactory(edu.stanford.nlp.process.LexedTokenFactory) ParserI(org.grupolys.samulan.processor.parser.ParserI) ArrayList(java.util.ArrayList) TokenizeI(org.grupolys.samulan.processor.tokenizer.TokenizeI) Twokenize(cmu.arktweetnlp.Twokenize) CoreLabelTokenFactory(edu.stanford.nlp.process.CoreLabelTokenFactory) DocumentPreprocessor(edu.stanford.nlp.process.DocumentPreprocessor) TokenizerFactory(edu.stanford.nlp.process.TokenizerFactory) WordTokenFactory(edu.stanford.nlp.process.WordTokenFactory) HasWord(edu.stanford.nlp.ling.HasWord) WhitespaceTokenizerFactory(edu.stanford.nlp.process.WhitespaceTokenizer.WhitespaceTokenizerFactory) Set(java.util.Set) UUID(java.util.UUID) LexerTokenizer(edu.stanford.nlp.process.LexerTokenizer) MaxentTagger(edu.stanford.nlp.tagger.maxent.MaxentTagger) Collectors(java.util.stream.Collectors) List(java.util.List) TaggerI(org.grupolys.samulan.processor.tagger.TaggerI) Stream(java.util.stream.Stream) StringReader(java.io.StringReader) SentimentDependencyGraph(org.grupolys.samulan.util.SentimentDependencyGraph) TaggedTokenInformation(org.grupolys.samulan.util.TaggedTokenInformation) SentimentDependencyGraph(org.grupolys.samulan.util.SentimentDependencyGraph) ArrayList(java.util.ArrayList) WordTokenFactory(edu.stanford.nlp.process.WordTokenFactory) StringReader(java.io.StringReader) DocumentPreprocessor(edu.stanford.nlp.process.DocumentPreprocessor) TaggedTokenInformation(org.grupolys.samulan.util.TaggedTokenInformation)

Example 10 with DocumentPreprocessor

use of edu.stanford.nlp.process.DocumentPreprocessor in project CoreNLP by stanfordnlp.

the class MaxentTagger method tokenizeText.

/**
 * Reads data from r, tokenizes it with the given tokenizer, and
 * returns a List of Lists of (extends) HasWord objects, which can then be
 * fed into tagSentence.
 *
 * @param r Reader where untokenized text is read
 * @param tokenizerFactory Tokenizer.  This can be {@code null} in which case
 *     the default English tokenizer (PTBTokenizerFactory) is used.
 * @return List of tokenized sentences
 */
public static List<List<HasWord>> tokenizeText(Reader r, TokenizerFactory<? extends HasWord> tokenizerFactory) {
    DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(r);
    if (tokenizerFactory != null) {
        documentPreprocessor.setTokenizerFactory(tokenizerFactory);
    }
    List<List<HasWord>> out = Generics.newArrayList();
    for (List<HasWord> item : documentPreprocessor) {
        out.add(item);
    }
    return out;
}

Also used : DocumentPreprocessor(edu.stanford.nlp.process.DocumentPreprocessor)

Aggregations

DocumentPreprocessor (edu.stanford.nlp.process.DocumentPreprocessor)16 HasWord (edu.stanford.nlp.ling.HasWord)13 StringReader (java.io.StringReader)8 TaggedWord (edu.stanford.nlp.ling.TaggedWord)5 MaxentTagger (edu.stanford.nlp.tagger.maxent.MaxentTagger)5 CoreLabel (edu.stanford.nlp.ling.CoreLabel)3 LexicalizedParser (edu.stanford.nlp.parser.lexparser.LexicalizedParser)3 Tree (edu.stanford.nlp.trees.Tree)3 Reader (java.io.Reader)3 ArrayList (java.util.ArrayList)3 ParserQuery (edu.stanford.nlp.parser.common.ParserQuery)2 CoreLabelTokenFactory (edu.stanford.nlp.process.CoreLabelTokenFactory)2 GrammaticalStructure (edu.stanford.nlp.trees.GrammaticalStructure)2 Pair (edu.stanford.nlp.util.Pair)2 Timing (edu.stanford.nlp.util.Timing)2 BufferedReader (java.io.BufferedReader)2 File (java.io.File)2 PrintWriter (java.io.PrintWriter)2 Map (java.util.Map)2 Twokenize (cmu.arktweetnlp.Twokenize)1