Search in sources :

Example 51 with HasWord

use of edu.stanford.nlp.ling.HasWord in project CoreNLP by stanfordnlp.

the class Tree method dependencies.

/**
   * Return a set of TaggedWord-TaggedWord dependencies, represented as
   * Dependency objects, for the Tree.  This will only give
   * useful results if the internal tree node labels support HasWord and
   * head percolation has already been done (see percolateHeads()).
   *
   * @param f Dependencies are excluded for which the Dependency is not
   *          accepted by the Filter
   * @return Set of dependencies (each a Dependency)
   */
public Set<Dependency<Label, Label, Object>> dependencies(Predicate<Dependency<Label, Label, Object>> f, boolean isConcrete, boolean copyLabel, boolean copyPosTag) {
    Set<Dependency<Label, Label, Object>> deps = Generics.newHashSet();
    for (Tree node : this) {
        // Skip leaves and unary re-writes
        if (node.isLeaf() || node.children().length < 2) {
            continue;
        }
        // Create the head label (percolateHeads has already been executed)
        Label headLabel = makeDependencyLabel(node.label(), copyLabel, isConcrete, copyPosTag);
        String headWord = ((HasWord) headLabel).word();
        if (headWord == null) {
            headWord = headLabel.value();
        }
        int headIndex = (isConcrete && (headLabel instanceof HasIndex)) ? ((HasIndex) headLabel).index() : -1;
        // every child with a different (or repeated) head is an argument
        boolean seenHead = false;
        for (Tree child : node.children()) {
            Label depLabel = makeDependencyLabel(child.label(), copyLabel, isConcrete, copyPosTag);
            String depWord = ((HasWord) depLabel).word();
            if (depWord == null) {
                depWord = depLabel.value();
            }
            int depIndex = (isConcrete && (depLabel instanceof HasIndex)) ? ((HasIndex) depLabel).index() : -1;
            if (!seenHead && headIndex == depIndex && headWord.equals(depWord)) {
                seenHead = true;
            } else {
                Dependency<Label, Label, Object> dependency = (isConcrete && depIndex != headIndex) ? new UnnamedConcreteDependency(headLabel, depLabel) : new UnnamedDependency(headLabel, depLabel);
                if (f.test(dependency)) {
                    deps.add(dependency);
                }
            }
        }
    }
    return deps;
}
Also used : HasWord(edu.stanford.nlp.ling.HasWord) CoreLabel(edu.stanford.nlp.ling.CoreLabel) Label(edu.stanford.nlp.ling.Label) HasIndex(edu.stanford.nlp.ling.HasIndex)

Example 52 with HasWord

use of edu.stanford.nlp.ling.HasWord in project CoreNLP by stanfordnlp.

the class Tree method percolateHeads.

/**
   * Finds the heads of the tree.  This code assumes that the label
   * does store and return sensible values for the category, word, and tag.
   * It will be a no-op otherwise.  The tree is modified.  The routine
   * assumes the Tree has word leaves and tag preterminals, and copies
   * their category to word and tag respectively, if they have a null
   * value.
   *
   * @param hf The headfinding algorithm to use
   */
public void percolateHeads(HeadFinder hf) {
    Label nodeLabel = label();
    if (isLeaf()) {
        // Sanity check: word() is usually set by the TreeReader.
        if (nodeLabel instanceof HasWord) {
            HasWord w = (HasWord) nodeLabel;
            if (w.word() == null) {
                w.setWord(nodeLabel.value());
            }
        }
    } else {
        for (Tree kid : children()) {
            kid.percolateHeads(hf);
        }
        final Tree head = hf.determineHead(this);
        if (head != null) {
            final Label headLabel = head.label();
            // Set the head tag.
            String headTag = (headLabel instanceof HasTag) ? ((HasTag) headLabel).tag() : null;
            if (headTag == null && head.isLeaf()) {
                // below us is a leaf
                headTag = nodeLabel.value();
            }
            // Set the head word
            String headWord = (headLabel instanceof HasWord) ? ((HasWord) headLabel).word() : null;
            if (headWord == null && head.isLeaf()) {
                // below us is a leaf
                // this might be useful despite case for leaf above in
                // case the leaf label type doesn't support word()
                headWord = headLabel.value();
            }
            // Set the head index
            int headIndex = (headLabel instanceof HasIndex) ? ((HasIndex) headLabel).index() : -1;
            if (nodeLabel instanceof HasWord) {
                ((HasWord) nodeLabel).setWord(headWord);
            }
            if (nodeLabel instanceof HasTag) {
                ((HasTag) nodeLabel).setTag(headTag);
            }
            if (nodeLabel instanceof HasIndex && headIndex >= 0) {
                ((HasIndex) nodeLabel).setIndex(headIndex);
            }
        } else {
            log.info("Head is null: " + this);
        }
    }
}
Also used : HasWord(edu.stanford.nlp.ling.HasWord) CoreLabel(edu.stanford.nlp.ling.CoreLabel) Label(edu.stanford.nlp.ling.Label) HasTag(edu.stanford.nlp.ling.HasTag) HasIndex(edu.stanford.nlp.ling.HasIndex)

Example 53 with HasWord

use of edu.stanford.nlp.ling.HasWord in project CoreNLP by stanfordnlp.

the class ChineseLexiconAndWordSegmenter method main.

/** This method lets you train and test a segmenter relative to a
   *  Treebank.
   *  <p>
   *  <i>Implementation note:</i> This method is largely cloned from
   *  LexicalizedParser's main method.  Should we try to have it be able
   *  to train segmenters to stop things going out of sync?
   */
public static void main(String[] args) {
    boolean train = false;
    boolean saveToSerializedFile = false;
    boolean saveToTextFile = false;
    String serializedInputFileOrUrl = null;
    String textInputFileOrUrl = null;
    String serializedOutputFileOrUrl = null;
    String textOutputFileOrUrl = null;
    String treebankPath = null;
    Treebank testTreebank = null;
    // Treebank tuneTreebank = null;
    String testPath = null;
    FileFilter testFilter = null;
    FileFilter trainFilter = null;
    String encoding = null;
    // variables needed to process the files to be parsed
    TokenizerFactory<Word> tokenizerFactory = null;
    //    DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor();
    // whether or not the input file has already been tokenized
    boolean tokenized = false;
    Function<List<HasWord>, List<HasWord>> escaper = new ChineseEscaper();
    // int tagDelimiter = -1;
    // String sentenceDelimiter = "\n";
    // boolean fromXML = false;
    int argIndex = 0;
    if (args.length < 1) {
        log.info("usage: java edu.stanford.nlp.parser.lexparser." + "LexicalizedParser parserFileOrUrl filename*");
        return;
    }
    Options op = new Options();
    op.tlpParams = new ChineseTreebankParserParams();
    // while loop through option arguments
    while (argIndex < args.length && args[argIndex].charAt(0) == '-') {
        if (args[argIndex].equalsIgnoreCase("-train")) {
            train = true;
            saveToSerializedFile = true;
            int numSubArgs = numSubArgs(args, argIndex);
            argIndex++;
            if (numSubArgs > 1) {
                treebankPath = args[argIndex];
                argIndex++;
            } else {
                throw new RuntimeException("Error: -train option must have treebankPath as first argument.");
            }
            if (numSubArgs == 2) {
                trainFilter = new NumberRangesFileFilter(args[argIndex++], true);
            } else if (numSubArgs >= 3) {
                try {
                    int low = Integer.parseInt(args[argIndex]);
                    int high = Integer.parseInt(args[argIndex + 1]);
                    trainFilter = new NumberRangeFileFilter(low, high, true);
                    argIndex += 2;
                } catch (NumberFormatException e) {
                    // maybe it's a ranges expression?
                    trainFilter = new NumberRangesFileFilter(args[argIndex], true);
                    argIndex++;
                }
            }
        } else if (args[argIndex].equalsIgnoreCase("-encoding")) {
            // sets encoding for TreebankLangParserParams
            encoding = args[argIndex + 1];
            op.tlpParams.setInputEncoding(encoding);
            op.tlpParams.setOutputEncoding(encoding);
            argIndex += 2;
        } else if (args[argIndex].equalsIgnoreCase("-loadFromSerializedFile")) {
            // load the parser from a binary serialized file
            // the next argument must be the path to the parser file
            serializedInputFileOrUrl = args[argIndex + 1];
            argIndex += 2;
        // doesn't make sense to load from TextFile -pichuan
        //      } else if (args[argIndex].equalsIgnoreCase("-loadFromTextFile")) {
        //        // load the parser from declarative text file
        //        // the next argument must be the path to the parser file
        //        textInputFileOrUrl = args[argIndex + 1];
        //        argIndex += 2;
        } else if (args[argIndex].equalsIgnoreCase("-saveToSerializedFile")) {
            saveToSerializedFile = true;
            serializedOutputFileOrUrl = args[argIndex + 1];
            argIndex += 2;
        } else if (args[argIndex].equalsIgnoreCase("-saveToTextFile")) {
            // save the parser to declarative text file
            saveToTextFile = true;
            textOutputFileOrUrl = args[argIndex + 1];
            argIndex += 2;
        } else if (args[argIndex].equalsIgnoreCase("-treebank")) {
            // the next argument is the treebank path and range for testing
            int numSubArgs = numSubArgs(args, argIndex);
            argIndex++;
            if (numSubArgs == 1) {
                testFilter = new NumberRangesFileFilter(args[argIndex++], true);
            } else if (numSubArgs > 1) {
                testPath = args[argIndex++];
                if (numSubArgs == 2) {
                    testFilter = new NumberRangesFileFilter(args[argIndex++], true);
                } else if (numSubArgs >= 3) {
                    try {
                        int low = Integer.parseInt(args[argIndex]);
                        int high = Integer.parseInt(args[argIndex + 1]);
                        testFilter = new NumberRangeFileFilter(low, high, true);
                        argIndex += 2;
                    } catch (NumberFormatException e) {
                        // maybe it's a ranges expression?
                        testFilter = new NumberRangesFileFilter(args[argIndex++], true);
                    }
                }
            }
        } else {
            int j = op.tlpParams.setOptionFlag(args, argIndex);
            if (j == argIndex) {
                log.info("Unknown option ignored: " + args[argIndex]);
                j++;
            }
            argIndex = j;
        }
    }
    // end while loop through arguments
    TreebankLangParserParams tlpParams = op.tlpParams;
    // all other arguments are order dependent and
    // are processed in order below
    ChineseLexiconAndWordSegmenter cs = null;
    if (!train && op.testOptions.verbose) {
        System.out.println("Currently " + new Date());
        printArgs(args, System.out);
    }
    if (train) {
        printArgs(args, System.out);
        // so we train a parser using the treebank
        if (treebankPath == null) {
            // the next arg must be the treebank path, since it wasn't give earlier
            treebankPath = args[argIndex];
            argIndex++;
            if (args.length > argIndex + 1) {
                try {
                    // the next two args might be the range
                    int low = Integer.parseInt(args[argIndex]);
                    int high = Integer.parseInt(args[argIndex + 1]);
                    trainFilter = new NumberRangeFileFilter(low, high, true);
                    argIndex += 2;
                } catch (NumberFormatException e) {
                    // maybe it's a ranges expression?
                    trainFilter = new NumberRangesFileFilter(args[argIndex], true);
                    argIndex++;
                }
            }
        }
        Treebank trainTreebank = makeTreebank(treebankPath, op, trainFilter);
        Index<String> wordIndex = new HashIndex<>();
        Index<String> tagIndex = new HashIndex<>();
        cs = new ChineseLexiconAndWordSegmenter(trainTreebank, op, wordIndex, tagIndex);
    } else if (textInputFileOrUrl != null) {
    // so we load the segmenter from a text grammar file
    // XXXXX fix later -pichuan
    //cs = new LexicalizedParser(textInputFileOrUrl, true, op);
    } else {
        // so we load a serialized segmenter
        if (serializedInputFileOrUrl == null) {
            // the next argument must be the path to the serialized parser
            serializedInputFileOrUrl = args[argIndex];
            argIndex++;
        }
        try {
            cs = new ChineseLexiconAndWordSegmenter(serializedInputFileOrUrl, op);
        } catch (IllegalArgumentException e) {
            log.info("Error loading segmenter, exiting...");
            System.exit(0);
        }
    }
    // the following has to go after reading parser to make sure
    // op and tlpParams are the same for train and test
    TreePrint treePrint = op.testOptions.treePrint(tlpParams);
    if (testFilter != null) {
        if (testPath == null) {
            if (treebankPath == null) {
                throw new RuntimeException("No test treebank path specified...");
            } else {
                log.info("No test treebank path specified.  Using train path: \"" + treebankPath + "\"");
                testPath = treebankPath;
            }
        }
        testTreebank = tlpParams.testMemoryTreebank();
        testTreebank.loadPath(testPath, testFilter);
    }
    op.trainOptions.sisterSplitters = Generics.newHashSet(Arrays.asList(tlpParams.sisterSplitters()));
    // -- Roger
    if (op.testOptions.verbose) {
        log.info("Lexicon is " + cs.getClass().getName());
    }
    PrintWriter pwOut = tlpParams.pw();
    PrintWriter pwErr = tlpParams.pw(System.err);
    // Now what do we do with the parser we've made
    if (saveToTextFile) {
        // save the parser to textGrammar format
        if (textOutputFileOrUrl != null) {
            saveSegmenterDataToText(cs, textOutputFileOrUrl);
        } else {
            log.info("Usage: must specify a text segmenter data output path");
        }
    }
    if (saveToSerializedFile) {
        if (serializedOutputFileOrUrl == null && argIndex < args.length) {
            // the next argument must be the path to serialize to
            serializedOutputFileOrUrl = args[argIndex];
            argIndex++;
        }
        if (serializedOutputFileOrUrl != null) {
            saveSegmenterDataToSerialized(cs, serializedOutputFileOrUrl);
        } else if (textOutputFileOrUrl == null && testTreebank == null) {
            // no saving/parsing request has been specified
            log.info("usage: " + "java edu.stanford.nlp.parser.lexparser.ChineseLexiconAndWordSegmenter" + "-train trainFilesPath [start stop] serializedParserFilename");
        }
    }
    /* --------------------- Testing part!!!! ----------------------- */
    if (op.testOptions.verbose) {
    //      printOptions(false, op);
    }
    if (testTreebank != null || (argIndex < args.length && args[argIndex].equalsIgnoreCase("-treebank"))) {
        // test parser on treebank
        if (testTreebank == null) {
            // the next argument is the treebank path and range for testing
            testTreebank = tlpParams.testMemoryTreebank();
            if (args.length < argIndex + 4) {
                testTreebank.loadPath(args[argIndex + 1]);
            } else {
                int testlow = Integer.parseInt(args[argIndex + 2]);
                int testhigh = Integer.parseInt(args[argIndex + 3]);
                testTreebank.loadPath(args[argIndex + 1], new NumberRangeFileFilter(testlow, testhigh, true));
            }
        }
    /* TODO - test segmenting on treebank. -pichuan */
    //      lp.testOnTreebank(testTreebank);
    //    } else if (argIndex >= args.length) {
    //      // no more arguments, so we just parse our own test sentence
    //      if (lp.parse(op.tlpParams.defaultTestSentence())) {
    //        treePrint.printTree(lp.getBestParse(), pwOut);
    //      } else {
    //        pwErr.println("Error. Can't parse test sentence: " +
    //              lp.parse(op.tlpParams.defaultTestSentence()));
    //      }
    }
//wsg2010: This code block doesn't actually do anything. It appears to read and tokenize a file, and then just print it.
//         There are easier ways to do that. This code was copied from an old version of LexicalizedParser.
//    else {
//      // We parse filenames given by the remaining arguments
//      int numWords = 0;
//      Timing timer = new Timing();
//      // set the tokenizer
//      if (tokenized) {
//        tokenizerFactory = WhitespaceTokenizer.factory();
//      }
//      TreebankLanguagePack tlp = tlpParams.treebankLanguagePack();
//      if (tokenizerFactory == null) {
//        tokenizerFactory = (TokenizerFactory<Word>) tlp.getTokenizerFactory();
//      }
//      documentPreprocessor.setTokenizerFactory(tokenizerFactory);
//      documentPreprocessor.setSentenceFinalPuncWords(tlp.sentenceFinalPunctuationWords());
//      if (encoding != null) {
//        documentPreprocessor.setEncoding(encoding);
//      }
//      timer.start();
//      for (int i = argIndex; i < args.length; i++) {
//        String filename = args[i];
//        try {
//          List document = null;
//          if (fromXML) {
//            document = documentPreprocessor.getSentencesFromXML(filename, sentenceDelimiter, tokenized);
//          } else {
//            document = documentPreprocessor.getSentencesFromText(filename, escaper, sentenceDelimiter, tagDelimiter);
//          }
//          log.info("Segmenting file: " + filename + " with " + document.size() + " sentences.");
//          PrintWriter pwo = pwOut;
//          if (op.testOptions.writeOutputFiles) {
//            try {
//              pwo = tlpParams.pw(new FileOutputStream(filename + ".stp"));
//            } catch (IOException ioe) {
//              ioe.printStackTrace();
//            }
//          }
//          int num = 0;
//          treePrint.printHeader(pwo, tlp.getEncoding());
//          for (Iterator it = document.iterator(); it.hasNext();) {
//            num++;
//            List sentence = (List) it.next();
//            int len = sentence.size();
//            numWords += len;
////            pwErr.println("Parsing [sent. " + num + " len. " + len + "]: " + sentence);
//            pwo.println(Sentence.listToString(sentence));
//          }
//          treePrint.printFooter(pwo);
//          if (op.testOptions.writeOutputFiles) {
//            pwo.close();
//          }
//        } catch (IOException e) {
//          pwErr.println("Couldn't find file: " + filename);
//        }
//
//      } // end for each file
//      long millis = timer.stop();
//      double wordspersec = numWords / (((double) millis) / 1000);
//      NumberFormat nf = new DecimalFormat("0.00"); // easier way!
//      pwErr.println("Segmented " + numWords + " words at " + nf.format(wordspersec) + " words per second.");
//    }
}
Also used : NumberRangeFileFilter(edu.stanford.nlp.io.NumberRangeFileFilter) HasWord(edu.stanford.nlp.ling.HasWord) TaggedWord(edu.stanford.nlp.ling.TaggedWord) Word(edu.stanford.nlp.ling.Word) NumberRangesFileFilter(edu.stanford.nlp.io.NumberRangesFileFilter) ChineseEscaper(edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper) HashIndex(edu.stanford.nlp.util.HashIndex) NumberRangesFileFilter(edu.stanford.nlp.io.NumberRangesFileFilter) NumberRangeFileFilter(edu.stanford.nlp.io.NumberRangeFileFilter)

Example 54 with HasWord

use of edu.stanford.nlp.ling.HasWord in project CoreNLP by stanfordnlp.

the class ParserUtils method xTree.

/**
   * Construct a fall through tree in case we can't parse this sentence.
   *
   * @param words Words of the sentence that didn't parse
   * @return A tree with X for all the internal nodes.
   *     Preterminals have the right tag if the words are tagged.
   */
public static Tree xTree(List<? extends HasWord> words) {
    TreeFactory treeFactory = new LabeledScoredTreeFactory();
    List<Tree> lst2 = new ArrayList<>();
    for (HasWord obj : words) {
        String s = obj.word();
        Tree t = treeFactory.newLeaf(s);
        String tag = "XX";
        if (obj instanceof HasTag) {
            if (((HasTag) obj).tag() != null) {
                tag = ((HasTag) obj).tag();
            }
        }
        Tree t2 = treeFactory.newTreeNode(tag, Collections.singletonList(t));
        lst2.add(t2);
    }
    return treeFactory.newTreeNode("X", lst2);
}
Also used : HasWord(edu.stanford.nlp.ling.HasWord) TreeFactory(edu.stanford.nlp.trees.TreeFactory) LabeledScoredTreeFactory(edu.stanford.nlp.trees.LabeledScoredTreeFactory) ArrayList(java.util.ArrayList) Tree(edu.stanford.nlp.trees.Tree) HasTag(edu.stanford.nlp.ling.HasTag) LabeledScoredTreeFactory(edu.stanford.nlp.trees.LabeledScoredTreeFactory)

Example 55 with HasWord

use of edu.stanford.nlp.ling.HasWord in project CoreNLP by stanfordnlp.

the class LexicalizedParserServer method handleTokenize.

public void handleTokenize(String arg, OutputStream outStream) throws IOException {
    if (arg == null) {
        return;
    }
    List<? extends HasWord> tokens = parser.tokenize(arg);
    OutputStreamWriter osw = new OutputStreamWriter(outStream, "utf-8");
    for (int i = 0; i < tokens.size(); ++i) {
        HasWord word = tokens.get(i);
        if (i > 0) {
            osw.write(" ");
        }
        osw.write(word.toString());
    }
    osw.write("\n");
    osw.flush();
}
Also used : HasWord(edu.stanford.nlp.ling.HasWord) OutputStreamWriter(java.io.OutputStreamWriter)

Aggregations

HasWord (edu.stanford.nlp.ling.HasWord)58 CoreLabel (edu.stanford.nlp.ling.CoreLabel)17 TaggedWord (edu.stanford.nlp.ling.TaggedWord)15 ArrayList (java.util.ArrayList)15 HasTag (edu.stanford.nlp.ling.HasTag)13 Tree (edu.stanford.nlp.trees.Tree)13 DocumentPreprocessor (edu.stanford.nlp.process.DocumentPreprocessor)12 StringReader (java.io.StringReader)12 Label (edu.stanford.nlp.ling.Label)10 Word (edu.stanford.nlp.ling.Word)10 List (java.util.List)8 BufferedReader (java.io.BufferedReader)6 MaxentTagger (edu.stanford.nlp.tagger.maxent.MaxentTagger)5 File (java.io.File)5 PrintWriter (java.io.PrintWriter)5 ParserConstraint (edu.stanford.nlp.parser.common.ParserConstraint)4 Pair (edu.stanford.nlp.util.Pair)4 CoreAnnotations (edu.stanford.nlp.ling.CoreAnnotations)3 HasIndex (edu.stanford.nlp.ling.HasIndex)3 Sentence (edu.stanford.nlp.ling.Sentence)3