Search in sources :

Example 81 with OffsetAttribute

use of org.apache.lucene.analysis.tokenattributes.OffsetAttribute in project corelib by europeana.

the class QueryExtractor method extractTokens.

private List<QueryTermPosition> extractTokens(String text) {
    List<QueryTermPosition> queryTerms = new ArrayList<>();
    TokenStream ts;
    try {
        ts = analyzer.tokenStream("text", new StringReader(text));
        OffsetAttribute offsetAttribute = ts.addAttribute(OffsetAttribute.class);
        CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        int i = 0;
        while (ts.incrementToken()) {
            int start = offsetAttribute.startOffset();
            int end = offsetAttribute.endOffset();
            String term = charTermAttribute.toString();
            // ANDY
            if (term.contains(":")) {
                start = start + term.indexOf(":") + 1;
            }
            // END ANDY
            queryTerms.add(new QueryTermPosition(start, end, term, text.substring(start, end), i++));
        }
        ts.end();
        ts.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    return queryTerms;
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) ArrayList(java.util.ArrayList) StringReader(java.io.StringReader) OffsetAttribute(org.apache.lucene.analysis.tokenattributes.OffsetAttribute) IOException(java.io.IOException)

Example 82 with OffsetAttribute

use of org.apache.lucene.analysis.tokenattributes.OffsetAttribute in project omegat by omegat-org.

the class BaseTokenizer method tokenize.

protected Token[] tokenize(final String strOrig, final boolean stemsAllowed, final boolean stopWordsAllowed, final boolean filterDigits, final boolean filterWhitespace) {
    if (StringUtil.isEmpty(strOrig)) {
        return EMPTY_TOKENS_LIST;
    }
    List<Token> result = new ArrayList<Token>(64);
    try (TokenStream in = getTokenStream(strOrig, stemsAllowed, stopWordsAllowed)) {
        in.addAttribute(CharTermAttribute.class);
        in.addAttribute(OffsetAttribute.class);
        CharTermAttribute cattr = in.getAttribute(CharTermAttribute.class);
        OffsetAttribute off = in.getAttribute(OffsetAttribute.class);
        in.reset();
        while (in.incrementToken()) {
            String tokenText = cattr.toString();
            if (acceptToken(tokenText, filterDigits, filterWhitespace)) {
                result.add(new Token(tokenText, off.startOffset(), off.endOffset() - off.startOffset()));
            }
        }
        in.end();
    } catch (IOException ex) {
        Log.log(ex);
    }
    return result.toArray(new Token[result.size()]);
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) ArrayList(java.util.ArrayList) OffsetAttribute(org.apache.lucene.analysis.tokenattributes.OffsetAttribute) Token(org.omegat.util.Token) IOException(java.io.IOException)

Aggregations

OffsetAttribute (org.apache.lucene.analysis.tokenattributes.OffsetAttribute)82 CharTermAttribute (org.apache.lucene.analysis.tokenattributes.CharTermAttribute)59 TokenStream (org.apache.lucene.analysis.TokenStream)47 StringReader (java.io.StringReader)36 PositionIncrementAttribute (org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute)33 IOException (java.io.IOException)25 ArrayList (java.util.ArrayList)23 TypeAttribute (org.apache.lucene.analysis.tokenattributes.TypeAttribute)17 BytesRef (org.apache.lucene.util.BytesRef)14 PayloadAttribute (org.apache.lucene.analysis.tokenattributes.PayloadAttribute)12 Tokenizer (org.apache.lucene.analysis.Tokenizer)10 Reader (java.io.Reader)9 FlagsAttribute (org.apache.lucene.analysis.tokenattributes.FlagsAttribute)8 Analyzer (org.apache.lucene.analysis.Analyzer)7 Token (org.apache.lucene.analysis.Token)7 TermToBytesRefAttribute (org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute)7 List (java.util.List)6 PackedTokenAttributeImpl (org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl)5 PositionLengthAttribute (org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute)5 IndexReader (org.apache.lucene.index.IndexReader)5