Examples with IBinaryTokenizer - org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer

Example 1 with IBinaryTokenizer

use of org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer in project asterixdb by apache.

the class InMemoryInvertedIndexOpContext method setTokenizingTupleIterator.

protected void setTokenizingTupleIterator() {
    IBinaryTokenizer tokenizer = getTokenizerFactory().createTokenizer();
    tupleIter = new InvertedIndexTokenizingTupleIterator(tokenCmpFactories.length, btree.getFieldCount() - tokenCmpFactories.length, tokenizer);
}

Also used : InvertedIndexTokenizingTupleIterator(org.apache.hyracks.storage.am.lsm.invertedindex.util.InvertedIndexTokenizingTupleIterator) IBinaryTokenizer(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer)

Example 2 with IBinaryTokenizer

use of org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer in project asterixdb by apache.

the class AbstractTOccurrenceSearcher method tokenizeQuery.

protected void tokenizeQuery(InvertedIndexSearchPredicate searchPred) throws HyracksDataException {
    ITupleReference queryTuple = searchPred.getQueryTuple();
    int queryFieldIndex = searchPred.getQueryFieldIndex();
    IBinaryTokenizer queryTokenizer = searchPred.getQueryTokenizer();
    // Is this a full-text query?
    // Then, the last argument is conjuctive or disjunctive search option, not a query text.
    // Thus, we need to remove the last argument.
    boolean isFullTextSearchQuery = searchPred.getIsFullTextSearchQuery();
    // Get the type of query tokenizer.
    TokenizerType queryTokenizerType = queryTokenizer.getTokenizerType();
    int tokenCountInOneField = 0;
    queryTokenAppender.reset(queryTokenFrame, true);
    queryTokenizer.reset(queryTuple.getFieldData(queryFieldIndex), queryTuple.getFieldStart(queryFieldIndex), queryTuple.getFieldLength(queryFieldIndex));
    while (queryTokenizer.hasNext()) {
        queryTokenizer.next();
        queryTokenBuilder.reset();
        tokenCountInOneField++;
        try {
            IToken token = queryTokenizer.getToken();
            // If it's a list, it can have multiple keywords in it. But, each keyword should not be a phrase.
            if (isFullTextSearchQuery) {
                if (queryTokenizerType == TokenizerType.STRING && tokenCountInOneField > 1) {
                    throw HyracksDataException.create(ErrorCode.FULLTEXT_PHRASE_FOUND);
                } else if (queryTokenizerType == TokenizerType.LIST) {
                    for (int j = 1; j < token.getTokenLength(); j++) {
                        if (DelimitedUTF8StringBinaryTokenizer.isSeparator((char) token.getData()[token.getStartOffset() + j])) {
                            throw HyracksDataException.create(ErrorCode.FULLTEXT_PHRASE_FOUND);
                        }
                    }
                }
            }
            token.serializeToken(queryTokenBuilder.getFieldData());
            queryTokenBuilder.addFieldEndOffset();
            // WARNING: assuming one frame is big enough to hold all tokens
            queryTokenAppender.append(queryTokenBuilder.getFieldEndOffsets(), queryTokenBuilder.getByteArray(), 0, queryTokenBuilder.getSize());
        } catch (IOException e) {
            throw new HyracksDataException(e);
        }
    }
}

Also used : IToken(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IToken) ITupleReference(org.apache.hyracks.dataflow.common.data.accessors.ITupleReference) TokenizerType(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.TokenizerInfo.TokenizerType) IBinaryTokenizer(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer) IOException(java.io.IOException) HyracksDataException(org.apache.hyracks.api.exceptions.HyracksDataException)

Example 3 with IBinaryTokenizer

use of org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer in project asterixdb by apache.

the class HashedWordTokensDescriptor method createEvaluatorFactory.

@Override
public IScalarEvaluatorFactory createEvaluatorFactory(final IScalarEvaluatorFactory[] args) {
    return new IScalarEvaluatorFactory() {

        private static final long serialVersionUID = 1L;

        @Override
        public IScalarEvaluator createScalarEvaluator(IHyracksTaskContext ctx) throws HyracksDataException {
            ITokenFactory tokenFactory = new HashedUTF8WordTokenFactory();
            IBinaryTokenizer tokenizer = new DelimitedUTF8StringBinaryTokenizer(true, true, tokenFactory);
            return new WordTokensEvaluator(args, ctx, tokenizer, BuiltinType.AINT32);
        }
    };
}

Also used : HashedUTF8WordTokenFactory(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.HashedUTF8WordTokenFactory) DelimitedUTF8StringBinaryTokenizer(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.DelimitedUTF8StringBinaryTokenizer) IHyracksTaskContext(org.apache.hyracks.api.context.IHyracksTaskContext) IBinaryTokenizer(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer) WordTokensEvaluator(org.apache.asterix.runtime.evaluators.common.WordTokensEvaluator) ITokenFactory(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.ITokenFactory) IScalarEvaluatorFactory(org.apache.hyracks.algebricks.runtime.base.IScalarEvaluatorFactory)

Example 4 with IBinaryTokenizer

use of org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer in project asterixdb by apache.

the class WordTokensDescriptor method createEvaluatorFactory.

@Override
public IScalarEvaluatorFactory createEvaluatorFactory(final IScalarEvaluatorFactory[] args) {
    return new IScalarEvaluatorFactory() {

        private static final long serialVersionUID = 1L;

        @Override
        public IScalarEvaluator createScalarEvaluator(IHyracksTaskContext ctx) throws HyracksDataException {
            ITokenFactory tokenFactory = new UTF8WordTokenFactory();
            IBinaryTokenizer tokenizer = new DelimitedUTF8StringBinaryTokenizer(true, true, tokenFactory);
            return new WordTokensEvaluator(args, ctx, tokenizer, BuiltinType.ASTRING);
        }
    };
}

Also used : DelimitedUTF8StringBinaryTokenizer(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.DelimitedUTF8StringBinaryTokenizer) IHyracksTaskContext(org.apache.hyracks.api.context.IHyracksTaskContext) IBinaryTokenizer(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer) WordTokensEvaluator(org.apache.asterix.runtime.evaluators.common.WordTokensEvaluator) ITokenFactory(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.ITokenFactory) IScalarEvaluatorFactory(org.apache.hyracks.algebricks.runtime.base.IScalarEvaluatorFactory) UTF8WordTokenFactory(org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.UTF8WordTokenFactory)

Example 5 with IBinaryTokenizer

use of org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer in project asterixdb by apache.

the class CountHashedWordTokensDescriptor method createEvaluatorFactory.

@Override
public IScalarEvaluatorFactory createEvaluatorFactory(final IScalarEvaluatorFactory[] args) {
    return new IScalarEvaluatorFactory() {

        private static final long serialVersionUID = 1L;

        @Override
        public IScalarEvaluator createScalarEvaluator(IHyracksTaskContext ctx) throws HyracksDataException {
            ITokenFactory tokenFactory = new HashedUTF8WordTokenFactory();
            IBinaryTokenizer tokenizer = new DelimitedUTF8StringBinaryTokenizer(false, true, tokenFactory);
            return new WordTokensEvaluator(args, ctx, tokenizer, BuiltinType.AINT32);
        }
    };
}

Aggregations

IBinaryTokenizer (org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IBinaryTokenizer)7 WordTokensEvaluator (org.apache.asterix.runtime.evaluators.common.WordTokensEvaluator)3 IScalarEvaluatorFactory (org.apache.hyracks.algebricks.runtime.base.IScalarEvaluatorFactory)3 IHyracksTaskContext (org.apache.hyracks.api.context.IHyracksTaskContext)3 DelimitedUTF8StringBinaryTokenizer (org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.DelimitedUTF8StringBinaryTokenizer)3 ITokenFactory (org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.ITokenFactory)3 HyracksDataException (org.apache.hyracks.api.exceptions.HyracksDataException)2 ITupleReference (org.apache.hyracks.dataflow.common.data.accessors.ITupleReference)2 HashedUTF8WordTokenFactory (org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.HashedUTF8WordTokenFactory)2 IOException (java.io.IOException)1 ArrayList (java.util.ArrayList)1 PermutingTupleReference (org.apache.hyracks.storage.am.common.tuples.PermutingTupleReference)1 IInvertedIndex (org.apache.hyracks.storage.am.lsm.invertedindex.api.IInvertedIndex)1 IInvertedIndexAccessor (org.apache.hyracks.storage.am.lsm.invertedindex.api.IInvertedIndexAccessor)1 InvertedIndexSearchPredicate (org.apache.hyracks.storage.am.lsm.invertedindex.search.InvertedIndexSearchPredicate)1 IToken (org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.IToken)1 TokenizerType (org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.TokenizerInfo.TokenizerType)1 UTF8WordTokenFactory (org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.UTF8WordTokenFactory)1 InvertedIndexTokenizingTupleIterator (org.apache.hyracks.storage.am.lsm.invertedindex.util.InvertedIndexTokenizingTupleIterator)1 PartitionedInvertedIndexTokenizingTupleIterator (org.apache.hyracks.storage.am.lsm.invertedindex.util.PartitionedInvertedIndexTokenizingTupleIterator)1