Search in sources :

Example 6 with Sentence

use of edu.illinois.cs.cogcomp.lbjava.nlp.Sentence in project cogcomp-nlp by CogComp.

the class IllinoisTokenizer method tokenizeSentence.

/**
     * given a sentence, return a set of tokens and their character offsets
     *
     * @param sentence the plain text sentence to tokenize
     * @return an ordered list of tokens from the sentence, and an ordered list of their start and
     *         end character offsets (one-past-the-end indexing)
     */
@Override
public Pair<String[], IntPair[]> tokenizeSentence(String sentence) {
    Sentence lbjSentence = new Sentence(sentence);
    LinkedVector wordSplit = lbjSentence.wordSplit();
    String[] output = new String[wordSplit.size()];
    IntPair[] offsets = new IntPair[wordSplit.size()];
    for (int i = 0; i < output.length; i++) {
        LinkedChild linkedChild = wordSplit.get(i);
        output[i] = linkedChild.toString();
        offsets[i] = new IntPair(linkedChild.start, linkedChild.end + 1);
    }
    return new Pair<>(output, offsets);
}
Also used : LinkedVector(edu.illinois.cs.cogcomp.lbjava.parse.LinkedVector) LinkedChild(edu.illinois.cs.cogcomp.lbjava.parse.LinkedChild) Sentence(edu.illinois.cs.cogcomp.lbjava.nlp.Sentence) IntPair(edu.illinois.cs.cogcomp.core.datastructures.IntPair) IntPair(edu.illinois.cs.cogcomp.core.datastructures.IntPair) Pair(edu.illinois.cs.cogcomp.core.datastructures.Pair)

Aggregations

Sentence (edu.illinois.cs.cogcomp.lbjava.nlp.Sentence)6 SentenceSplitter (edu.illinois.cs.cogcomp.lbjava.nlp.SentenceSplitter)5 IntPair (edu.illinois.cs.cogcomp.core.datastructures.IntPair)4 LinkedVector (edu.illinois.cs.cogcomp.lbjava.parse.LinkedVector)4 Word (edu.illinois.cs.cogcomp.lbjava.nlp.Word)2 ArrayList (java.util.ArrayList)2 StringTokenizer (java.util.StringTokenizer)2 Pair (edu.illinois.cs.cogcomp.core.datastructures.Pair)1 SpanLabelView (edu.illinois.cs.cogcomp.core.datastructures.textannotation.SpanLabelView)1 TextAnnotation (edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation)1 LinkedChild (edu.illinois.cs.cogcomp.lbjava.parse.LinkedChild)1 FileNotFoundException (java.io.FileNotFoundException)1 HashSet (java.util.HashSet)1 Vector (java.util.Vector)1 Matcher (java.util.regex.Matcher)1 Pattern (java.util.regex.Pattern)1 Test (org.junit.Test)1