Search in sources :

Example 51 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class AsterixSource method open.

@Override
public void open() throws TexeraException {
    if (cursor == OPENED) {
        return;
    }
    try {
        String asterixAddress = "http://" + predicate.getHost() + ":" + predicate.getPort() + "/query/service";
        String asterixQuery = generateAsterixQuery(predicate);
        HttpResponse<JsonNode> jsonResponse = Unirest.post(asterixAddress).queryString("statement", asterixQuery).field("mode", "immediate").asJson();
        // if status is 200 OK, store the results
        if (jsonResponse.getStatus() == 200) {
            this.resultJsonArray = jsonResponse.getBody().getObject().getJSONArray("results");
        } else {
            throw new DataflowException("Send query to asterix failed: " + "error status: " + jsonResponse.getStatusText() + ", " + "error body: " + jsonResponse.getBody().toString());
        }
        cursor = OPENED;
    } catch (UnirestException e) {
        throw new DataflowException(e);
    }
}
Also used : DataflowException(edu.uci.ics.texera.api.exception.DataflowException) UnirestException(com.mashape.unirest.http.exceptions.UnirestException) JsonNode(com.mashape.unirest.http.JsonNode)

Example 52 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class FileExtractorUtils method extractPPTFile.

/**
 * Extracts data from PPT/PPTX from using poi.
 *
 * @param path
 * @return
 * @throws DataflowException
 */
public static String extractPPTFile(Path path) throws DataflowException {
    try (FileInputStream inputStream = new FileInputStream(path.toString());
        XMLSlideShow ppt = new XMLSlideShow(inputStream)) {
        StringBuffer res = new StringBuffer();
        for (XSLFSlide slide : ppt.getSlides()) {
            List<XSLFShape> shapes = slide.getShapes();
            for (XSLFShape shape : shapes) {
                if (shape instanceof XSLFTextShape) {
                    XSLFTextShape textShape = (XSLFTextShape) shape;
                    String text = textShape.getText();
                    res.append(text);
                }
            }
        }
        return res.toString();
    } catch (IOException e) {
        throw new DataflowException(e);
    }
}
Also used : XSLFSlide(org.apache.poi.xslf.usermodel.XSLFSlide) XMLSlideShow(org.apache.poi.xslf.usermodel.XMLSlideShow) XSLFTextShape(org.apache.poi.xslf.usermodel.XSLFTextShape) DataflowException(edu.uci.ics.texera.api.exception.DataflowException) XSLFShape(org.apache.poi.xslf.usermodel.XSLFShape) IOException(java.io.IOException) FileInputStream(java.io.FileInputStream)

Example 53 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class FileExtractorUtils method extractWordFile.

/**
 * Extract data from MS Word DOC/DOCX file to text
 *
 * @param path
 * @return
 * @throws DataflowException
 */
public static String extractWordFile(Path path) throws DataflowException {
    try (FileInputStream inputStream = new FileInputStream(path.toString())) {
        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        parser.parse(inputStream, handler, metadata);
        return handler.toString();
    } catch (IOException | SAXException | TikaException e) {
        throw new DataflowException(e);
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaException(org.apache.tika.exception.TikaException) Metadata(org.apache.tika.metadata.Metadata) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) DataflowException(edu.uci.ics.texera.api.exception.DataflowException) IOException(java.io.IOException) FileInputStream(java.io.FileInputStream) SAXException(org.xml.sax.SAXException)

Example 54 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class FileSourceOperator method getNextTuple.

@Override
public Tuple getNextTuple() throws TexeraException {
    if (cursor == CLOSED || cursor >= pathList.size()) {
        return null;
    }
    // 2) the cursor reaches the end
    while (cursor < pathList.size()) {
        try {
            Path path = pathIterator.next();
            String extension = com.google.common.io.Files.getFileExtension(path.toString());
            String content;
            if (extension.equalsIgnoreCase("pdf")) {
                content = FileExtractorUtils.extractPDFFile(path);
            } else if (extension.equalsIgnoreCase("ppt") || extension.equalsIgnoreCase("pptx")) {
                content = FileExtractorUtils.extractPPTFile(path);
            } else if (extension.equalsIgnoreCase("doc") || extension.equalsIgnoreCase("docx")) {
                content = FileExtractorUtils.extractWordFile(path);
            } else {
                content = FileExtractorUtils.extractPlainTextFile(path);
            }
            Tuple tuple = new Tuple(outputSchema, IDField.newRandomID(), new TextField(content));
            cursor++;
            return tuple;
        } catch (DataflowException e) {
            // ignore error and move on
            // TODO: use log4j
            System.out.println("FileSourceOperator: file read error, file is ignored. " + e.getMessage());
        }
    }
    return null;
}
Also used : Path(java.nio.file.Path) TextField(edu.uci.ics.texera.api.field.TextField) DataflowException(edu.uci.ics.texera.api.exception.DataflowException) Tuple(edu.uci.ics.texera.api.tuple.Tuple)

Example 55 with DataflowException

use of edu.uci.ics.texera.api.exception.DataflowException in project textdb by TextDB.

the class AbstractSingleInputOperator method getNextTuple.

@Override
public Tuple getNextTuple() throws TexeraException {
    if (cursor == CLOSED) {
        throw new DataflowException(ErrorMessages.OPERATOR_NOT_OPENED);
    }
    if (cursor >= limit + offset) {
        return null;
    }
    try {
        Tuple resultTuple = null;
        while (true) {
            resultTuple = computeNextMatchingTuple();
            if (resultTuple == null) {
                break;
            }
            cursor++;
            if (cursor > offset) {
                break;
            }
        }
        return resultTuple;
    } catch (Exception e) {
        throw new DataflowException(e.getMessage(), e);
    }
}
Also used : DataflowException(edu.uci.ics.texera.api.exception.DataflowException) Tuple(edu.uci.ics.texera.api.tuple.Tuple) DataflowException(edu.uci.ics.texera.api.exception.DataflowException) TexeraException(edu.uci.ics.texera.api.exception.TexeraException)

Aggregations

DataflowException (edu.uci.ics.texera.api.exception.DataflowException)56 TexeraException (edu.uci.ics.texera.api.exception.TexeraException)23 AttributeType (edu.uci.ics.texera.api.schema.AttributeType)20 Schema (edu.uci.ics.texera.api.schema.Schema)20 Tuple (edu.uci.ics.texera.api.tuple.Tuple)18 IOException (java.io.IOException)14 Span (edu.uci.ics.texera.api.span.Span)11 Collectors (java.util.stream.Collectors)10 SchemaConstants (edu.uci.ics.texera.api.constants.SchemaConstants)9 ArrayList (java.util.ArrayList)9 Attribute (edu.uci.ics.texera.api.schema.Attribute)8 IOperator (edu.uci.ics.texera.api.dataflow.IOperator)7 IField (edu.uci.ics.texera.api.field.IField)7 ListField (edu.uci.ics.texera.api.field.ListField)7 List (java.util.List)7 AbstractSingleInputOperator (edu.uci.ics.texera.dataflow.common.AbstractSingleInputOperator)6 ErrorMessages (edu.uci.ics.texera.api.constants.ErrorMessages)5 StorageException (edu.uci.ics.texera.api.exception.StorageException)5 IntegerField (edu.uci.ics.texera.api.field.IntegerField)4 DataflowUtils (edu.uci.ics.texera.dataflow.utils.DataflowUtils)4