Search in sources :

Example 21 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class SimilarityJoinPredicate method joinTuples.

@Override
public Tuple joinTuples(Tuple innerTuple, Tuple outerTuple, Schema outputSchema) throws DataFlowException {
    if (similarityThreshold == 0) {
        return null;
    }
    // get the span list only with the joinAttributeName
    ListField<Span> innerSpanListField = innerTuple.getField(SchemaConstants.SPAN_LIST);
    List<Span> innerRelevantSpanList = innerSpanListField.getValue().stream().filter(span -> span.getAttributeName().equals(innerJoinAttrName)).collect(Collectors.toList());
    ListField<Span> outerSpanListField = outerTuple.getField(SchemaConstants.SPAN_LIST);
    List<Span> outerRelevantSpanList = outerSpanListField.getValue().stream().filter(span -> span.getAttributeName().equals(outerJoinAttrName)).collect(Collectors.toList());
    // get a set of span's values (since multiple spans may have the same value)
    Set<String> innerSpanValueSet = innerRelevantSpanList.stream().map(span -> span.getValue()).collect(Collectors.toSet());
    Set<String> outerSpanValueSet = outerRelevantSpanList.stream().map(span -> span.getValue()).collect(Collectors.toSet());
    // compute the result value set using the similarity function
    Set<String> resultValueSet = new HashSet<>();
    for (String innerString : innerSpanValueSet) {
        for (String outerString : outerSpanValueSet) {
            if (this.similarityFunc.calculateSimilarity(innerString, outerString) >= this.similarityThreshold) {
                resultValueSet.add(innerString);
                resultValueSet.add(outerString);
            }
        }
    }
    // return null if none of them are similar
    if (resultValueSet.isEmpty()) {
        return null;
    }
    // generate the result spans
    List<Span> resultSpans = new ArrayList<>();
    for (Span span : innerRelevantSpanList) {
        if (resultValueSet.contains(span.getValue())) {
            resultSpans.add(addFieldPrefix(span, INNER_PREFIX));
        }
    }
    for (Span span : outerRelevantSpanList) {
        if (resultValueSet.contains(span.getValue())) {
            resultSpans.add(addFieldPrefix(span, OUTER_PREFIX));
        }
    }
    return mergeTuples(innerTuple, outerTuple, outputSchema, resultSpans);
}
Also used : SchemaConstants(edu.uci.ics.textdb.api.constants.SchemaConstants) JsonProperty(com.fasterxml.jackson.annotation.JsonProperty) java.util(java.util) Attribute(edu.uci.ics.textdb.api.schema.Attribute) IDField(edu.uci.ics.textdb.api.field.IDField) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) Collectors(java.util.stream.Collectors) AttributeType(edu.uci.ics.textdb.api.schema.AttributeType) PredicateBase(edu.uci.ics.textdb.exp.common.PredicateBase) Schema(edu.uci.ics.textdb.api.schema.Schema) ListField(edu.uci.ics.textdb.api.field.ListField) IField(edu.uci.ics.textdb.api.field.IField) JsonCreator(com.fasterxml.jackson.annotation.JsonCreator) JsonIgnore(com.fasterxml.jackson.annotation.JsonIgnore) edu.uci.ics.textdb.api.tuple(edu.uci.ics.textdb.api.tuple) Span(edu.uci.ics.textdb.api.span.Span) PropertyNameConstants(edu.uci.ics.textdb.exp.common.PropertyNameConstants) IOperator(edu.uci.ics.textdb.api.dataflow.IOperator) NormalizedLevenshtein(info.debatty.java.stringsimilarity.NormalizedLevenshtein) Span(edu.uci.ics.textdb.api.span.Span)

Example 22 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class KeywordMatcher method computePhraseMatchingResult.

private List<Span> computePhraseMatchingResult(Tuple inputTuple) throws DataFlowException {
    ListField<Span> payloadField = inputTuple.getField(SchemaConstants.PAYLOAD);
    List<Span> payload = payloadField.getValue();
    List<Span> relevantSpans = filterRelevantSpans(payload);
    List<Span> matchingResults = new ArrayList<>();
    for (String attributeName : this.predicate.getAttributeNames()) {
        AttributeType attributeType = this.inputSchema.getAttribute(attributeName).getAttributeType();
        String fieldValue = inputTuple.getField(attributeName).getValue().toString();
        // types other than TEXT and STRING: throw Exception for now
        if (attributeType != AttributeType.STRING && attributeType != AttributeType.TEXT) {
            throw new DataFlowException("KeywordMatcher: Fields other than STRING and TEXT are not supported yet");
        }
        // for STRING type, the query should match the fieldValue completely
        if (attributeType == AttributeType.STRING) {
            if (fieldValue.equals(predicate.getQuery())) {
                matchingResults.add(new Span(attributeName, 0, predicate.getQuery().length(), predicate.getQuery(), fieldValue));
            }
        }
        // phrase query
        if (attributeType == AttributeType.TEXT) {
            List<Span> fieldSpanList = relevantSpans.stream().filter(span -> span.getAttributeName().equals(attributeName)).collect(Collectors.toList());
            if (!isAllQueryTokensPresent(fieldSpanList, queryTokenSet)) {
                // in the spans
                continue;
            }
            // Sort current field's span list by token offset for later use
            Collections.sort(fieldSpanList, (span1, span2) -> span1.getTokenOffset() - span2.getTokenOffset());
            List<Integer> queryTokenOffset = new ArrayList<>();
            for (int i = 0; i < queryTokensWithStopwords.size(); i++) {
                if (queryTokenList.contains(queryTokensWithStopwords.get(i))) {
                    queryTokenOffset.add(i);
                }
            }
            // maintains position of term being checked in
            int iter = 0;
            // spanForThisField list
            while (iter < fieldSpanList.size()) {
                if (iter > fieldSpanList.size() - queryTokenList.size()) {
                    break;
                }
                // Verify if span in the spanForThisField correspond to our
                // phrase query, ie relative position offsets should be
                // similar
                // and the value should be same.
                // flag to check if a
                boolean isMismatchInSpan = false;
                // To check all the terms in query are verified
                for (int i = 0; i < queryTokenList.size() - 1; i++) {
                    Span first = fieldSpanList.get(iter + i);
                    Span second = fieldSpanList.get(iter + i + 1);
                    if (!(second.getTokenOffset() - first.getTokenOffset() == queryTokenOffset.get(i + 1) - queryTokenOffset.get(i) && first.getValue().equalsIgnoreCase(queryTokenList.get(i)) && second.getValue().equalsIgnoreCase(queryTokenList.get(i + 1)))) {
                        iter++;
                        isMismatchInSpan = true;
                        break;
                    }
                }
                if (isMismatchInSpan) {
                    continue;
                }
                int combinedSpanStartIndex = fieldSpanList.get(iter).getStart();
                int combinedSpanEndIndex = fieldSpanList.get(iter + queryTokenList.size() - 1).getEnd();
                Span combinedSpan = new Span(attributeName, combinedSpanStartIndex, combinedSpanEndIndex, predicate.getQuery(), fieldValue.substring(combinedSpanStartIndex, combinedSpanEndIndex));
                matchingResults.add(combinedSpan);
                iter = iter + queryTokenList.size();
            }
        }
    }
    return matchingResults;
}
Also used : SchemaConstants(edu.uci.ics.textdb.api.constants.SchemaConstants) Attribute(edu.uci.ics.textdb.api.schema.Attribute) Iterator(java.util.Iterator) ErrorMessages(edu.uci.ics.textdb.api.constants.ErrorMessages) AbstractSingleInputOperator(edu.uci.ics.textdb.exp.common.AbstractSingleInputOperator) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) Set(java.util.Set) Utils(edu.uci.ics.textdb.api.utils.Utils) Collectors(java.util.stream.Collectors) ArrayList(java.util.ArrayList) AttributeType(edu.uci.ics.textdb.api.schema.AttributeType) HashSet(java.util.HashSet) Schema(edu.uci.ics.textdb.api.schema.Schema) List(java.util.List) ListField(edu.uci.ics.textdb.api.field.ListField) Matcher(java.util.regex.Matcher) TextDBException(edu.uci.ics.textdb.api.exception.TextDBException) Pattern(java.util.regex.Pattern) Span(edu.uci.ics.textdb.api.span.Span) Collections(java.util.Collections) DataflowUtils(edu.uci.ics.textdb.exp.utils.DataflowUtils) Tuple(edu.uci.ics.textdb.api.tuple.Tuple) AttributeType(edu.uci.ics.textdb.api.schema.AttributeType) ArrayList(java.util.ArrayList) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) Span(edu.uci.ics.textdb.api.span.Span)

Example 23 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class Join method getNextTuple.

/**
     * Gets the next tuple which is a joint of two tuples which passed the
     * criteria set in the JoinPredicate. <br>
     * Example in JoinPredicate.java
     * 
     * @return nextTuple
     */
@Override
public Tuple getNextTuple() throws TextDBException {
    if (cursor == CLOSED) {
        throw new DataFlowException(ErrorMessages.OPERATOR_NOT_OPENED);
    }
    // load all tuples from inner operator into memory in the first time
    if (innerTupleList == null) {
        innerTupleList = new ArrayList<>();
        Tuple tuple;
        while ((tuple = innerOperator.getNextTuple()) != null) {
            innerTupleList.add(tuple);
        }
    }
    // load the first outer tuple
    currentOuterTuple = outerOperator.getNextTuple();
    //   all outer tuples have been consumed
    if (innerTupleList.isEmpty() || currentOuterTuple == null) {
        return null;
    }
    if (resultCursor >= limit + offset - 1 || limit == 0) {
        return null;
    }
    try {
        Tuple resultTuple = null;
        while (true) {
            resultTuple = computeNextMatchingTuple();
            if (resultTuple == null) {
                break;
            }
            resultCursor++;
            if (resultCursor >= offset) {
                break;
            }
        }
        return resultTuple;
    } catch (Exception e) {
        throw new DataFlowException(e.getMessage(), e);
    }
}
Also used : DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) Tuple(edu.uci.ics.textdb.api.tuple.Tuple) TextDBException(edu.uci.ics.textdb.api.exception.TextDBException) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException)

Example 24 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class JoinDistancePredicate method generateIntersectionSchema.

/**
     * Create outputSchema, which is the intersection of innerOperator's schema and outerOperator's schema.
     * The attributes have to be exactly the same (name and type) to be intersected.
     * 
     * InnerOperator's attributes and outerOperator's attributes must:
     * both contain the attributes to be joined.
     * both contain "_ID" attribute.
     * both contain "spanList" attribute.
     * 
     * @return outputSchema
     */
private Schema generateIntersectionSchema(Schema innerOperatorSchema, Schema outerOperatorSchema) throws DataFlowException {
    List<Attribute> innerAttributes = innerOperatorSchema.getAttributes();
    List<Attribute> outerAttributes = outerOperatorSchema.getAttributes();
    List<Attribute> intersectionAttributes = innerAttributes.stream().filter(attr -> outerAttributes.contains(attr)).collect(Collectors.toList());
    Schema intersectionSchema = new Schema(intersectionAttributes.stream().toArray(Attribute[]::new));
    // check if output schema contain necessary attributes
    if (intersectionSchema.getAttributes().isEmpty()) {
        throw new DataFlowException("inner operator and outer operator don't share any common attributes");
    } else if (intersectionSchema.getAttribute(this.joinAttributeName) == null) {
        throw new DataFlowException("inner operator or outer operator doesn't contain join attribute");
    } else if (intersectionSchema.getAttribute(SchemaConstants._ID) == null) {
        throw new DataFlowException("inner operator or outer operator doesn't contain _ID attribute");
    } else if (intersectionSchema.getAttribute(SchemaConstants.SPAN_LIST) == null) {
        throw new DataFlowException("inner operator or outer operator doesn't contain spanList attribute");
    }
    // check if join attribute is TEXT or STRING
    AttributeType joinAttrType = intersectionSchema.getAttribute(this.joinAttributeName).getAttributeType();
    if (joinAttrType != AttributeType.TEXT && joinAttrType != AttributeType.STRING) {
        throw new DataFlowException(String.format("Join attribute %s must be either TEXT or STRING.", this.joinAttributeName));
    }
    return intersectionSchema;
}
Also used : SchemaConstants(edu.uci.ics.textdb.api.constants.SchemaConstants) JsonProperty(com.fasterxml.jackson.annotation.JsonProperty) Attribute(edu.uci.ics.textdb.api.schema.Attribute) Iterator(java.util.Iterator) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) Collectors(java.util.stream.Collectors) ArrayList(java.util.ArrayList) AttributeType(edu.uci.ics.textdb.api.schema.AttributeType) PredicateBase(edu.uci.ics.textdb.exp.common.PredicateBase) Schema(edu.uci.ics.textdb.api.schema.Schema) List(java.util.List) ListField(edu.uci.ics.textdb.api.field.ListField) IField(edu.uci.ics.textdb.api.field.IField) JsonCreator(com.fasterxml.jackson.annotation.JsonCreator) edu.uci.ics.textdb.api.tuple(edu.uci.ics.textdb.api.tuple) Span(edu.uci.ics.textdb.api.span.Span) PropertyNameConstants(edu.uci.ics.textdb.exp.common.PropertyNameConstants) IOperator(edu.uci.ics.textdb.api.dataflow.IOperator) Attribute(edu.uci.ics.textdb.api.schema.Attribute) AttributeType(edu.uci.ics.textdb.api.schema.AttributeType) Schema(edu.uci.ics.textdb.api.schema.Schema) DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException)

Example 25 with DataFlowException

use of edu.uci.ics.textdb.api.exception.DataFlowException in project textdb by TextDB.

the class AbstractSink method processTuples.

@Override
public void processTuples() throws TextDBException {
    if (cursor == CLOSED) {
        throw new DataFlowException(ErrorMessages.OPERATOR_NOT_OPENED);
    }
    Tuple nextTuple;
    while ((nextTuple = inputOperator.getNextTuple()) != null) {
        processOneTuple(nextTuple);
        cursor++;
    }
}
Also used : DataFlowException(edu.uci.ics.textdb.api.exception.DataFlowException) Tuple(edu.uci.ics.textdb.api.tuple.Tuple)

Aggregations

DataFlowException (edu.uci.ics.textdb.api.exception.DataFlowException)34 TextDBException (edu.uci.ics.textdb.api.exception.TextDBException)13 AttributeType (edu.uci.ics.textdb.api.schema.AttributeType)12 Schema (edu.uci.ics.textdb.api.schema.Schema)11 Tuple (edu.uci.ics.textdb.api.tuple.Tuple)10 Attribute (edu.uci.ics.textdb.api.schema.Attribute)8 Span (edu.uci.ics.textdb.api.span.Span)7 ArrayList (java.util.ArrayList)7 SchemaConstants (edu.uci.ics.textdb.api.constants.SchemaConstants)6 List (java.util.List)6 Collectors (java.util.stream.Collectors)6 StorageException (edu.uci.ics.textdb.api.exception.StorageException)5 ListField (edu.uci.ics.textdb.api.field.ListField)5 IOException (java.io.IOException)5 IField (edu.uci.ics.textdb.api.field.IField)4 Utils (edu.uci.ics.textdb.api.utils.Utils)4 AbstractSingleInputOperator (edu.uci.ics.textdb.exp.common.AbstractSingleInputOperator)4 Iterator (java.util.Iterator)4 ErrorMessages (edu.uci.ics.textdb.api.constants.ErrorMessages)3 IOperator (edu.uci.ics.textdb.api.dataflow.IOperator)3