Search in sources :

Example 1 with MutableCharArray

use of org.carrot2.text.util.MutableCharArray in project lucene-solr by apache.

the class DuplicatingTokenizerFactory method getTokenizer.

@Override
public ITokenizer getTokenizer(LanguageCode language) {
    return new ITokenizer() {

        private final ExtendedWhitespaceTokenizer delegate = new ExtendedWhitespaceTokenizer();

        @Override
        public void setTermBuffer(MutableCharArray buffer) {
            delegate.setTermBuffer(buffer);
            buffer.reset(buffer.toString() + buffer.toString());
        }

        @Override
        public void reset(Reader input) {
            delegate.reset(input);
        }

        @Override
        public short nextToken() throws IOException {
            return delegate.nextToken();
        }
    };
}
Also used : ExtendedWhitespaceTokenizer(org.carrot2.text.analysis.ExtendedWhitespaceTokenizer) ITokenizer(org.carrot2.text.analysis.ITokenizer) MutableCharArray(org.carrot2.text.util.MutableCharArray) Reader(java.io.Reader)

Example 2 with MutableCharArray

use of org.carrot2.text.util.MutableCharArray in project lucene-solr by apache.

the class LexicalResourcesCheckClusteringAlgorithm method process.

@Override
public void process() throws ProcessingException {
    clusters = new ArrayList<>();
    if (wordsToCheck == null) {
        return;
    }
    // Test with Maltese so that the English clustering performed in other tests
    // is not affected by the test stopwords and stoplabels.
    ILexicalData lexicalData = preprocessing.lexicalDataFactory.getLexicalData(LanguageCode.MALTESE);
    for (String word : wordsToCheck.split(",")) {
        if (!lexicalData.isCommonWord(new MutableCharArray(word)) && !lexicalData.isStopLabel(word)) {
            clusters.add(new Cluster(word));
        }
    }
}
Also used : ILexicalData(org.carrot2.text.linguistic.ILexicalData) MutableCharArray(org.carrot2.text.util.MutableCharArray) Cluster(org.carrot2.core.Cluster)

Aggregations

MutableCharArray (org.carrot2.text.util.MutableCharArray)2 Reader (java.io.Reader)1 Cluster (org.carrot2.core.Cluster)1 ExtendedWhitespaceTokenizer (org.carrot2.text.analysis.ExtendedWhitespaceTokenizer)1 ITokenizer (org.carrot2.text.analysis.ITokenizer)1 ILexicalData (org.carrot2.text.linguistic.ILexicalData)1