Search in sources :

Example 6 with IWord

use of com.hankcs.hanlp.corpus.document.sentence.word.IWord in project HanLP by hankcs.

the class Document method getSimpleSentenceList.

/**
     * 获取简单的句子列表,其中复合词的标签如果是set中指定的话会被拆分为简单词
     * @param labelSet
     * @return
     */
public List<List<Word>> getSimpleSentenceList(Set<String> labelSet) {
    List<List<Word>> simpleList = new LinkedList<List<Word>>();
    for (Sentence sentence : sentenceList) {
        List<Word> wordList = new LinkedList<Word>();
        for (IWord word : sentence.wordList) {
            if (word instanceof CompoundWord) {
                if (labelSet.contains(word.getLabel())) {
                    for (Word inner : ((CompoundWord) word).innerList) {
                        wordList.add(inner);
                    }
                } else {
                    wordList.add(((CompoundWord) word).toWord());
                }
            } else {
                wordList.add((Word) word);
            }
        }
        simpleList.add(wordList);
    }
    return simpleList;
}
Also used : CompoundWord(com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord) Word(com.hankcs.hanlp.corpus.document.sentence.word.Word) IWord(com.hankcs.hanlp.corpus.document.sentence.word.IWord) List(java.util.List) LinkedList(java.util.LinkedList) Sentence(com.hankcs.hanlp.corpus.document.sentence.Sentence) CompoundWord(com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord) LinkedList(java.util.LinkedList) IWord(com.hankcs.hanlp.corpus.document.sentence.word.IWord)

Example 7 with IWord

use of com.hankcs.hanlp.corpus.document.sentence.word.IWord in project HanLP by hankcs.

the class Sentence method create.

public static Sentence create(String param) {
    Pattern pattern = Pattern.compile("(\\[(([^\\s]+/[0-9a-zA-Z]+)\\s+)+?([^\\s]+/[0-9a-zA-Z]+)]/[0-9a-zA-Z]+)|([^\\s]+/[0-9a-zA-Z]+)");
    Matcher matcher = pattern.matcher(param);
    List<IWord> wordList = new LinkedList<IWord>();
    while (matcher.find()) {
        String single = matcher.group();
        IWord word = WordFactory.create(single);
        if (word == null) {
            logger.warning("在用" + single + "构造单词时失败");
            return null;
        }
        wordList.add(word);
    }
    return new Sentence(wordList);
}
Also used : Pattern(java.util.regex.Pattern) Matcher(java.util.regex.Matcher) LinkedList(java.util.LinkedList) IWord(com.hankcs.hanlp.corpus.document.sentence.word.IWord)

Example 8 with IWord

use of com.hankcs.hanlp.corpus.document.sentence.word.IWord in project HanLP by hankcs.

the class Sentence method toString.

@Override
public String toString() {
    StringBuilder sb = new StringBuilder();
    int i = 1;
    for (IWord word : wordList) {
        sb.append(word);
        if (i != wordList.size())
            sb.append(' ');
        ++i;
    }
    return sb.toString();
}
Also used : IWord(com.hankcs.hanlp.corpus.document.sentence.word.IWord)

Example 9 with IWord

use of com.hankcs.hanlp.corpus.document.sentence.word.IWord in project HanLP by hankcs.

the class TestCorpusLoader method testMakeOrganizationCustomDictionary.

public void testMakeOrganizationCustomDictionary() throws Exception {
    final DictionaryMaker dictionaryMaker = new DictionaryMaker();
    CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014", new CorpusLoader.Handler() {

        @Override
        public void handle(Document document) {
            List<List<IWord>> complexSentenceList = document.getComplexSentenceList();
            for (List<IWord> wordList : complexSentenceList) {
                for (IWord word : wordList) {
                    if (word.getLabel().startsWith("nt")) {
                        dictionaryMaker.add(word);
                    }
                }
            }
        }
    });
    dictionaryMaker.saveTxtTo("data/dictionary/custom/机构名词典.txt");
}
Also used : CorpusLoader(com.hankcs.hanlp.corpus.document.CorpusLoader) DictionaryMaker(com.hankcs.hanlp.corpus.dictionary.DictionaryMaker) List(java.util.List) Document(com.hankcs.hanlp.corpus.document.Document) IWord(com.hankcs.hanlp.corpus.document.sentence.word.IWord)

Example 10 with IWord

use of com.hankcs.hanlp.corpus.document.sentence.word.IWord in project HanLP by hankcs.

the class TestCorpusLoader method testMakePersonCustomDictionary.

public void testMakePersonCustomDictionary() throws Exception {
    final DictionaryMaker dictionaryMaker = new DictionaryMaker();
    CorpusLoader.walk("D:\\JavaProjects\\CorpusToolBox\\data\\2014", new CorpusLoader.Handler() {

        @Override
        public void handle(Document document) {
            List<List<IWord>> complexSentenceList = document.getComplexSentenceList();
            for (List<IWord> wordList : complexSentenceList) {
                for (IWord word : wordList) {
                    if (word.getLabel().startsWith("nr")) {
                        dictionaryMaker.add(word);
                    }
                }
            }
        }
    });
    dictionaryMaker.saveTxtTo("data/dictionary/custom/人名词典.txt");
}
Also used : CorpusLoader(com.hankcs.hanlp.corpus.document.CorpusLoader) DictionaryMaker(com.hankcs.hanlp.corpus.dictionary.DictionaryMaker) List(java.util.List) Document(com.hankcs.hanlp.corpus.document.Document) IWord(com.hankcs.hanlp.corpus.document.sentence.word.IWord)

Aggregations

IWord (com.hankcs.hanlp.corpus.document.sentence.word.IWord)17 LinkedList (java.util.LinkedList)11 Word (com.hankcs.hanlp.corpus.document.sentence.word.Word)8 List (java.util.List)8 CompoundWord (com.hankcs.hanlp.corpus.document.sentence.word.CompoundWord)7 CorpusLoader (com.hankcs.hanlp.corpus.document.CorpusLoader)4 Document (com.hankcs.hanlp.corpus.document.Document)4 Sentence (com.hankcs.hanlp.corpus.document.sentence.Sentence)4 DictionaryMaker (com.hankcs.hanlp.corpus.dictionary.DictionaryMaker)3 TFDictionary (com.hankcs.hanlp.corpus.dictionary.TFDictionary)1 Matcher (java.util.regex.Matcher)1 Pattern (java.util.regex.Pattern)1