Search in sources :

Example 1 with Sentence

use of edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence in project cogcomp-nlp by CogComp.

the class ThaiTokenizer method main.

public static void main(String[] args) {
    String text = "สตาร์คราฟต์   เป็นวิดีโอเกมประเภทวางแผนเรียลไทม์และบันเทิงคดีวิทยาศาสตร์การทหาร พัฒนาและจัดจำหน่ายโดยบลิซซาร์ด เอ็นเตอร์เทนเมนต์ ออกบนระบบปฏิบัติการไมโครซอฟท์ วินโดวส์เมื่อวันที่ 31 มีนาคม 2541 ต่อมา เกมขยายเป็นแฟรนไชส์ และเป็นเกมแรกของซีรีส์สตาร์คราฟต์ รุ่นแมคโอเอสออกในเดือนมีนาคม 2542 และรุ่นดัดแปลงนินเทนโด 64 ซึ่งพัฒนาร่วมกับแมสมีเดีย ออกในวันที่ 13 มิถุนายน 2543 การพัฒนาเกมนี้เริ่มขึ้นไม่นานหลังวอร์คราฟต์ 2: ไทด์สออฟดาร์กเนส ออกในปี 2538 สตาร์คราฟต์เปิดตัวในงานอี3 ปี 2539 ซึ่งเป็นที่ชื่นชอบน้อยกว่าวอร์คราฟต์ 2 ฉะนั้น โครงการจึงถูกพลิกโฉมทั้งหมดแล้วแสดงต่อสาธารณะในต้นปี 2540 ซึ่งได้รับการตอบรับดีกว่ามาก";
    text = "    2507  การสืบสวนของคณะกรรมการสมาชิกผู้แทนราษฎรสหรัฐว่าด้วยการลอบสังหารประธานาธิบดี (hsca) ระหว่าง - พศ 2522  และการสืบสวนของรัฐบาล สรุปว่าประธานาธิบดีถูกลอบสังหารโดยลี ฮาร์วีย์ ออสวอลด์ ซึ่งในเวล\n";
    ThaiTokenizer token = new ThaiTokenizer();
    TextAnnotation ta = token.getTextAnnotation(text);
    for (Sentence sen : ta.sentences()) {
        System.out.println(sen.getTokenizedText());
    }
}
Also used : TextAnnotation(edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation) Sentence(edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence)

Example 2 with Sentence

use of edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence in project cogcomp-nlp by CogComp.

the class ReferenceUtils method createNerDataStructuresForText.

public Data createNerDataStructuresForText(TextAnnotation ta) {
    ArrayList<LinkedVector> sentences = new ArrayList<>();
    String[] tokens = ta.getTokens();
    int[] tokenindices = new int[tokens.length];
    int tokenIndex = 0;
    int neWordIndex = 0;
    for (int i = 0; i < ta.getNumberOfSentences(); i++) {
        Sentence sentence = ta.getSentence(i);
        String[] wtoks = sentence.getTokens();
        LinkedVector words = new LinkedVector();
        for (String w : wtoks) {
            if (w.length() > 0) {
                NEWord.addTokenToSentence(words, w, "unlabeled");
                tokenindices[neWordIndex] = tokenIndex;
                neWordIndex++;
            } else {
                throw new IllegalStateException("Bad (zero length) token.");
            }
            tokenIndex++;
        }
        if (words.size() > 0)
            sentences.add(words);
    }
    // Do the annotation.
    Data data = new Data(new NERDocument(sentences, "input"));
    return data;
}
Also used : LinkedVector(edu.illinois.cs.cogcomp.lbjava.parse.LinkedVector) ArrayList(java.util.ArrayList) Data(edu.illinois.cs.cogcomp.ner.LbjTagger.Data) NERDocument(edu.illinois.cs.cogcomp.ner.LbjTagger.NERDocument) Sentence(edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence)

Example 3 with Sentence

use of edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence in project cogcomp-nlp by CogComp.

the class VerbVoiceIndicator method getWordFeatures.

@Override
public Set<Feature> getWordFeatures(TextAnnotation ta, int wordPosition) throws EdisonException {
    Sentence sentence = ta.getSentenceFromToken(wordPosition);
    int sentenceStart = sentence.getStartSpan();
    int predicatePosition = wordPosition - sentenceStart;
    Tree<String> tree = ParseHelper.getParseTree(parseViewName, sentence);
    Tree<Pair<String, IntPair>> spanLabeledTree = ParseUtils.getSpanLabeledTree(tree);
    Tree<Pair<String, IntPair>> currentNode = spanLabeledTree.getYield().get(predicatePosition).getParent();
    String f = getVoice(currentNode);
    return new LinkedHashSet<Feature>(Collections.singletonList(DiscreteFeature.create(f)));
}
Also used : LinkedHashSet(java.util.LinkedHashSet) Sentence(edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence) IntPair(edu.illinois.cs.cogcomp.core.datastructures.IntPair) Pair(edu.illinois.cs.cogcomp.core.datastructures.Pair)

Example 4 with Sentence

use of edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence in project cogcomp-nlp by CogComp.

the class NERAnnotator method addView.

/**
     * Generate the view representing the list of extracted entities and adds it the
     * {@link TextAnnotation}.
     */
@Override
public void addView(TextAnnotation ta) {
    // convert this data structure into one the NER package can deal with.
    ArrayList<LinkedVector> sentences = new ArrayList<>();
    String[] tokens = ta.getTokens();
    int[] tokenindices = new int[tokens.length];
    int tokenIndex = 0;
    int neWordIndex = 0;
    for (int i = 0; i < ta.getNumberOfSentences(); i++) {
        Sentence sentence = ta.getSentence(i);
        String[] wtoks = sentence.getTokens();
        LinkedVector words = new LinkedVector();
        for (String w : wtoks) {
            if (w.length() > 0) {
                NEWord.addTokenToSentence(words, w, "unlabeled");
                tokenindices[neWordIndex] = tokenIndex;
                neWordIndex++;
            } else {
                logger.error("Bad (zero length) token.");
            }
            tokenIndex++;
        }
        if (words.size() > 0)
            sentences.add(words);
    }
    // Do the annotation.
    Data data = new Data(new NERDocument(sentences, "input"));
    try {
        ExpressiveFeaturesAnnotator.annotate(data);
        Decoder.annotateDataBIO(data, t1, t2);
    } catch (Exception e) {
        logger.error("Cannot annotate the text, the exception was: ", e);
        return;
    }
    // now we have the parsed entities, construct the view object.
    ArrayList<LinkedVector> nerSentences = data.documents.get(0).sentences;
    SpanLabelView nerView = new SpanLabelView(getViewName(), ta);
    // the data always has a single document
    // each LinkedVector in data corresponds to a sentence.
    int tokenoffset = 0;
    for (LinkedVector vector : nerSentences) {
        boolean open = false;
        // there should be a 1:1 mapping btw sentence tokens in record and words/predictions
        // from NER.
        int startIndex = -1;
        String label = null;
        for (int j = 0; j < vector.size(); j++, tokenoffset++) {
            NEWord neWord = (NEWord) (vector.get(j));
            String prediction = neWord.neTypeLevel2;
            // inefficient, use enums, or nominalized indexes for this sort of thing.
            if (prediction.startsWith("B-")) {
                startIndex = tokenoffset;
                label = prediction.substring(2);
                open = true;
            } else if (j > 0) {
                String previous_prediction = ((NEWord) vector.get(j - 1)).neTypeLevel2;
                if (prediction.startsWith("I-") && (!previous_prediction.endsWith(prediction.substring(2)))) {
                    startIndex = tokenoffset;
                    label = prediction.substring(2);
                    open = true;
                }
            }
            if (open) {
                boolean close = false;
                if (j == vector.size() - 1) {
                    close = true;
                } else {
                    String next_prediction = ((NEWord) vector.get(j + 1)).neTypeLevel2;
                    if (next_prediction.startsWith("B-"))
                        close = true;
                    if (next_prediction.equals("O"))
                        close = true;
                    if (next_prediction.indexOf('-') > -1 && (!prediction.endsWith(next_prediction.substring(2))))
                        close = true;
                }
                if (close) {
                    int s = tokenindices[startIndex];
                    /**
                         * MS: fixed bug. Originally, e was set using tokenindices[tokenoffset], but
                         * tokenoffset can reach tokens.length) and this exceeds array length.
                         * Constituent constructor requires one-past-the-end token indexing,
                         * requiring e > s. Hence the complicated setting of endIndex/e below.
                         */
                    int endIndex = Math.min(tokenoffset + 1, tokens.length - 1);
                    int e = tokenindices[endIndex];
                    if (e <= s)
                        e = s + 1;
                    nerView.addSpanLabel(s, e, label, 1d);
                    open = false;
                }
            }
        }
    }
    ta.addView(viewName, nerView);
}
Also used : LinkedVector(edu.illinois.cs.cogcomp.lbjava.parse.LinkedVector) ArrayList(java.util.ArrayList) SpanLabelView(edu.illinois.cs.cogcomp.core.datastructures.textannotation.SpanLabelView) IOException(java.io.IOException) Sentence(edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence)

Example 5 with Sentence

use of edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence in project cogcomp-nlp by CogComp.

the class BulkTokenizer method main.

/**
     * @param args
     * @throws IOException
     */
public static void main(String[] args) throws IOException {
    parseArgs(args);
    if (file == null) {
        System.err.println("Must provide a file or directory name on the command line.");
        return;
    }
    File[] files;
    File nf = new File(file);
    if (nf.isDirectory())
        files = new File(args[0]).listFiles();
    else {
        files = new File[1];
        files[0] = nf;
    }
    ArrayList<String> datas = readAllFiles(files);
    BufferedWriter fw = new BufferedWriter(new FileWriter(new File("tokenizerdiffs.out")));
    final TextAnnotationBuilder stab = new TokenizerTextAnnotationBuilder(new StatefulTokenizer());
    if (profile) {
        System.out.println("Starting profiling");
        while (true) {
            for (String data : datas) {
                stab.createTextAnnotation(data);
            }
        }
    } else {
        System.out.println("Starting new annotations");
        long nt = System.currentTimeMillis();
        ArrayList<TextAnnotation> newannotations = new ArrayList<TextAnnotation>();
        final TextAnnotationBuilder ntab = new TokenizerTextAnnotationBuilder(new StatefulTokenizer());
        for (String data : datas) {
            TextAnnotation ta = ntab.createTextAnnotation(data);
            newannotations.add(ta);
        }
        nt = System.currentTimeMillis() - nt;
        System.out.println("Starting old annotations");
        long ot = System.currentTimeMillis();
        ArrayList<TextAnnotation> oldannotations = new ArrayList<TextAnnotation>();
        final TextAnnotationBuilder tab = new TokenizerTextAnnotationBuilder(new IllinoisTokenizer());
        for (String data : datas) {
            TextAnnotation ta = tab.createTextAnnotation(data);
            oldannotations.add(ta);
        }
        ot = System.currentTimeMillis() - ot;
        System.out.println("new way = " + nt + ", old way = " + ot);
        int good = 0, bad = 0;
        for (int i = 0; i < oldannotations.size(); i++) {
            File file = files[i];
            TextAnnotation newone = newannotations.get(i);
            TextAnnotation oldone = oldannotations.get(i);
            if (newone.sentences().equals(oldone.sentences())) {
                good++;
            } else {
                bad++;
                fw.write("-" + file + "\n");
                if (verbose) {
                    List<Sentence> newsentences = newone.sentences();
                    List<Sentence> oldsentences = oldone.sentences();
                    int max = newsentences.size() > oldsentences.size() ? newsentences.size() : oldsentences.size();
                    boolean sentencewritten = false;
                    for (int j = 0; j < max; j++) {
                        String news = newsentences.size() > j ? newsentences.get(j).toString() : "???";
                        String olds = oldsentences.size() > j ? oldsentences.get(j).toString() : "???";
                        if (!compareSentences(olds, news)) {
                            if (!sentencewritten) {
                                sentencewritten = true;
                                fw.write("-" + file + "\n");
                                fw.write(newone.toString() + "\n");
                            }
                            fw.write(" new : " + news + "\n old : " + olds + "\n");
                        }
                    }
                }
            }
        }
        fw.close();
        System.out.println(good + " correct, " + bad + " wrong.");
    }
}
Also used : TextAnnotationBuilder(edu.illinois.cs.cogcomp.annotation.TextAnnotationBuilder) TokenizerTextAnnotationBuilder(edu.illinois.cs.cogcomp.nlp.utility.TokenizerTextAnnotationBuilder) FileWriter(java.io.FileWriter) IllinoisTokenizer(edu.illinois.cs.cogcomp.nlp.tokenizer.IllinoisTokenizer) ArrayList(java.util.ArrayList) BufferedWriter(java.io.BufferedWriter) TokenizerTextAnnotationBuilder(edu.illinois.cs.cogcomp.nlp.utility.TokenizerTextAnnotationBuilder) StatefulTokenizer(edu.illinois.cs.cogcomp.nlp.tokenizer.StatefulTokenizer) TextAnnotation(edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation) File(java.io.File) Sentence(edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence)

Aggregations

Sentence (edu.illinois.cs.cogcomp.core.datastructures.textannotation.Sentence)5 ArrayList (java.util.ArrayList)3 TextAnnotation (edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation)2 LinkedVector (edu.illinois.cs.cogcomp.lbjava.parse.LinkedVector)2 TextAnnotationBuilder (edu.illinois.cs.cogcomp.annotation.TextAnnotationBuilder)1 IntPair (edu.illinois.cs.cogcomp.core.datastructures.IntPair)1 Pair (edu.illinois.cs.cogcomp.core.datastructures.Pair)1 SpanLabelView (edu.illinois.cs.cogcomp.core.datastructures.textannotation.SpanLabelView)1 Data (edu.illinois.cs.cogcomp.ner.LbjTagger.Data)1 NERDocument (edu.illinois.cs.cogcomp.ner.LbjTagger.NERDocument)1 IllinoisTokenizer (edu.illinois.cs.cogcomp.nlp.tokenizer.IllinoisTokenizer)1 StatefulTokenizer (edu.illinois.cs.cogcomp.nlp.tokenizer.StatefulTokenizer)1 TokenizerTextAnnotationBuilder (edu.illinois.cs.cogcomp.nlp.utility.TokenizerTextAnnotationBuilder)1 BufferedWriter (java.io.BufferedWriter)1 File (java.io.File)1 FileWriter (java.io.FileWriter)1 IOException (java.io.IOException)1 LinkedHashSet (java.util.LinkedHashSet)1