Search in sources :

Example 1 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class ScanBasedSourceOperator method close.

@Override
public void close() throws TexeraException {
    if (!isOpen) {
        return;
    }
    try {
        dataReader.close();
        isOpen = false;
    } catch (Exception e) {
        throw new DataflowException(e.getMessage(), e);
    }
}
Also used : DataflowException(edu.uci.ics.texera.api.exception.DataflowException) DataflowException(edu.uci.ics.texera.api.exception.DataflowException) TexeraException(edu.uci.ics.texera.api.exception.TexeraException) StorageException(edu.uci.ics.texera.api.exception.StorageException)

Example 2 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class TwitterConverter method getNextTuple.

@Override
public Tuple getNextTuple() throws TexeraException {
    if (cursor == CLOSED) {
        throw new DataflowException(ErrorMessages.OPERATOR_NOT_OPENED);
    }
    Tuple tuple;
    while ((tuple = inputOperator.getNextTuple()) != null) {
        List<IField> tweetFields = generateFieldsFromJson(tuple.getField(rawDataAttribute).getValue().toString());
        if (!tweetFields.isEmpty()) {
            cursor++;
            List<IField> tupleFields = new ArrayList<>();
            final Tuple finalTuple = tuple;
            tupleFields.addAll(tuple.getSchema().getAttributeNames().stream().filter(attrName -> !attrName.equalsIgnoreCase(rawDataAttribute)).map(attrName -> finalTuple.getField(attrName, IField.class)).collect(Collectors.toList()));
            tupleFields.addAll(tweetFields);
            return new Tuple(outputSchema, tupleFields);
        }
    }
    return null;
}
Also used : DateTimeField(edu.uci.ics.texera.api.field.DateTimeField) Arrays(java.util.Arrays) ZonedDateTime(java.time.ZonedDateTime) Tuple(edu.uci.ics.texera.api.tuple.Tuple) ObjectMapper(com.fasterxml.jackson.databind.ObjectMapper) TexeraException(edu.uci.ics.texera.api.exception.TexeraException) Collectors(java.util.stream.Collectors) ZoneId(java.time.ZoneId) ArrayList(java.util.ArrayList) List(java.util.List) IOperator(edu.uci.ics.texera.api.dataflow.IOperator) IField(edu.uci.ics.texera.api.field.IField) TextField(edu.uci.ics.texera.api.field.TextField) StringField(edu.uci.ics.texera.api.field.StringField) DateTimeFormatter(java.time.format.DateTimeFormatter) ErrorMessages(edu.uci.ics.texera.api.constants.ErrorMessages) DataflowException(edu.uci.ics.texera.api.exception.DataflowException) Schema(edu.uci.ics.texera.api.schema.Schema) JsonNode(com.fasterxml.jackson.databind.JsonNode) Attribute(edu.uci.ics.texera.api.schema.Attribute) IntegerField(edu.uci.ics.texera.api.field.IntegerField) AsterixSource(edu.uci.ics.texera.dataflow.source.asterix.AsterixSource) ArrayList(java.util.ArrayList) DataflowException(edu.uci.ics.texera.api.exception.DataflowException) IField(edu.uci.ics.texera.api.field.IField) Tuple(edu.uci.ics.texera.api.tuple.Tuple)

Example 3 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class DataflowUtils method tokenizeQueryWithStopwords.

public static ArrayList<String> tokenizeQueryWithStopwords(String luceneAnalyzerStr, String query) {
    Analyzer luceneAnalyzer;
    if (luceneAnalyzerStr.equals(LuceneAnalyzerConstants.standardAnalyzerString())) {
        // use an empty stop word list for standard analyzer
        CharArraySet emptyStopwords = new CharArraySet(1, true);
        luceneAnalyzer = new StandardAnalyzer(emptyStopwords);
    } else if (luceneAnalyzerStr.equals(LuceneAnalyzerConstants.chineseAnalyzerString())) {
        // use the default smart chinese analyzer
        // because the smart chinese analyzer's default stopword list is simply a list of punctuations
        // https://lucene.apache.org/core/5_5_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html
        luceneAnalyzer = LuceneAnalyzerConstants.getLuceneAnalyzer(luceneAnalyzerStr);
    } else {
        throw new TexeraException("tokenizeQueryWithStopwords: analyzer " + luceneAnalyzerStr + " not recgonized");
    }
    ArrayList<String> result = new ArrayList<String>();
    TokenStream tokenStream = luceneAnalyzer.tokenStream(null, new StringReader(query));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    try {
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            String token = term.toString();
            int tokenIndex = query.toLowerCase().indexOf(token);
            // Since tokens are converted to lower case,
            // get the exact token from the query string.
            String actualQueryToken = query.substring(tokenIndex, tokenIndex + token.length());
            result.add(actualQueryToken);
        }
        tokenStream.close();
    } catch (IOException e) {
        throw new DataflowException(e);
    } finally {
        luceneAnalyzer.close();
    }
    return result;
}
Also used : CharArraySet(org.apache.lucene.analysis.util.CharArraySet) TokenStream(org.apache.lucene.analysis.TokenStream) IOException(java.io.IOException) Analyzer(org.apache.lucene.analysis.Analyzer) StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) TexeraException(edu.uci.ics.texera.api.exception.TexeraException) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) StringReader(java.io.StringReader) DataflowException(edu.uci.ics.texera.api.exception.DataflowException)

Example 4 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class DataflowUtils method generatePayload.

public static List<Span> generatePayload(String attributeName, String fieldValue, Analyzer luceneAnalyzer) {
    List<Span> payload = new ArrayList<>();
    try {
        TokenStream tokenStream = luceneAnalyzer.tokenStream(null, new StringReader(fieldValue));
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        PositionIncrementAttribute positionIncrementAttribute = tokenStream.addAttribute(PositionIncrementAttribute.class);
        int tokenPositionCounter = -1;
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            tokenPositionCounter += positionIncrementAttribute.getPositionIncrement();
            int tokenPosition = tokenPositionCounter;
            int charStart = offsetAttribute.startOffset();
            int charEnd = offsetAttribute.endOffset();
            String analyzedTermStr = charTermAttribute.toString();
            String originalTermStr = fieldValue.substring(charStart, charEnd);
            payload.add(new Span(attributeName, charStart, charEnd, analyzedTermStr, originalTermStr, tokenPosition));
        }
        tokenStream.close();
    } catch (IOException e) {
        throw new DataflowException(e);
    }
    return payload;
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) StringReader(java.io.StringReader) OffsetAttribute(org.apache.lucene.analysis.tokenattributes.OffsetAttribute) DataflowException(edu.uci.ics.texera.api.exception.DataflowException) IOException(java.io.IOException) Span(edu.uci.ics.texera.api.span.Span) PositionIncrementAttribute(org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute)

Example 5 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class WordCountIndexSource method computeWordCount.

private void computeWordCount() throws TexeraException {
    try {
        HashMap<String, Integer> wordCountMap = new HashMap<>();
        DataReader dataReader = RelationManager.getInstance().getTableDataReader(predicate.getTableName(), new MatchAllDocsQuery());
        dataReader.open();
        IndexReader luceneIndexReader = dataReader.getLuceneIndexReader();
        for (int i = 0; i < luceneIndexReader.numDocs(); i++) {
            Terms termVector = luceneIndexReader.getTermVector(i, predicate.getAttribute());
            TermsEnum termsEnum = termVector.iterator();
            while (termsEnum.next() != null) {
                String key = termsEnum.term().utf8ToString();
                wordCountMap.put(key, wordCountMap.get(key) == null ? ((int) termsEnum.totalTermFreq()) : wordCountMap.get(key) + ((int) termsEnum.totalTermFreq()));
            }
        }
        luceneIndexReader.close();
        dataReader.close();
        sortedWordCountMap = wordCountMap.entrySet().stream().sorted((e1, e2) -> e2.getValue().compareTo(e1.getValue())).collect(Collectors.toList());
        wordCountIterator = sortedWordCountMap.iterator();
    } catch (IOException e) {
        throw new DataflowException(e);
    }
}
Also used : DataReader(edu.uci.ics.texera.storage.DataReader) HashMap(java.util.HashMap) IndexReader(org.apache.lucene.index.IndexReader) Terms(org.apache.lucene.index.Terms) DataflowException(edu.uci.ics.texera.api.exception.DataflowException) IOException(java.io.IOException) MatchAllDocsQuery(org.apache.lucene.search.MatchAllDocsQuery) TermsEnum(org.apache.lucene.index.TermsEnum)

Aggregations

DataflowException (edu.uci.ics.texera.api.exception.DataflowException)56 TexeraException (edu.uci.ics.texera.api.exception.TexeraException)23 AttributeType (edu.uci.ics.texera.api.schema.AttributeType)20 Schema (edu.uci.ics.texera.api.schema.Schema)20 Tuple (edu.uci.ics.texera.api.tuple.Tuple)18 IOException (java.io.IOException)14 Span (edu.uci.ics.texera.api.span.Span)11 Collectors (java.util.stream.Collectors)10 SchemaConstants (edu.uci.ics.texera.api.constants.SchemaConstants)9 ArrayList (java.util.ArrayList)9 Attribute (edu.uci.ics.texera.api.schema.Attribute)8 IOperator (edu.uci.ics.texera.api.dataflow.IOperator)7 IField (edu.uci.ics.texera.api.field.IField)7 ListField (edu.uci.ics.texera.api.field.ListField)7 List (java.util.List)7 AbstractSingleInputOperator (edu.uci.ics.texera.dataflow.common.AbstractSingleInputOperator)6 ErrorMessages (edu.uci.ics.texera.api.constants.ErrorMessages)5 StorageException (edu.uci.ics.texera.api.exception.StorageException)5 IntegerField (edu.uci.ics.texera.api.field.IntegerField)4 DataflowUtils (edu.uci.ics.texera.dataflow.utils.DataflowUtils)4