Search in sources :

Example 71 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class RFC822ParserTest method testLongHeader.

/**
     * Test for TIKA-640, increase header max beyond 10k bytes
     */
@Test
public void testLongHeader() throws Exception {
    StringBuilder inputBuilder = new StringBuilder();
    for (int i = 0; i < 2000; ++i) {
        //len > 50
        inputBuilder.append("really really really really really really long name ");
    }
    String name = inputBuilder.toString();
    byte[] data = ("From: " + name + "\r\n\r\n").getBytes(US_ASCII);
    Parser parser = new RFC822Parser();
    ContentHandler handler = new DefaultHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    try {
        parser.parse(new ByteArrayInputStream(data), handler, metadata, context);
        fail();
    } catch (TikaException expected) {
    }
    MimeConfig config = new MimeConfig();
    config.setMaxHeaderLen(-1);
    config.setMaxLineLen(-1);
    context.set(MimeConfig.class, config);
    parser.parse(new ByteArrayInputStream(data), handler, metadata, context);
    assertEquals(name.trim(), metadata.get(TikaCoreProperties.CREATOR));
}
Also used : TikaException(org.apache.tika.exception.TikaException) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) Parser(org.apache.tika.parser.Parser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) DefaultHandler(org.xml.sax.helpers.DefaultHandler) MimeConfig(org.apache.james.mime4j.stream.MimeConfig) ByteArrayInputStream(java.io.ByteArrayInputStream) ParseContext(org.apache.tika.parser.ParseContext) TesseractOCRParserTest(org.apache.tika.parser.ocr.TesseractOCRParserTest) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 72 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class OutlookPSTParser method parseMailAttachments.

private void parseMailAttachments(XHTMLContentHandler xhtml, PSTMessage email, EmbeddedDocumentExtractor embeddedExtractor) throws TikaException {
    int numberOfAttachments = email.getNumberOfAttachments();
    for (int i = 0; i < numberOfAttachments; i++) {
        File tempFile = null;
        try {
            PSTAttachment attach = email.getAttachment(i);
            // Get the filename; both long and short filenames can be used for attachments
            String filename = attach.getLongFilename();
            if (filename.isEmpty()) {
                filename = attach.getFilename();
            }
            xhtml.element("p", filename);
            Metadata attachMeta = new Metadata();
            attachMeta.set(Metadata.RESOURCE_NAME_KEY, filename);
            attachMeta.set(Metadata.EMBEDDED_RELATIONSHIP_ID, filename);
            AttributesImpl attributes = new AttributesImpl();
            attributes.addAttribute("", "class", "class", "CDATA", "embedded");
            attributes.addAttribute("", "id", "id", "CDATA", filename);
            xhtml.startElement("div", attributes);
            if (embeddedExtractor.shouldParseEmbedded(attachMeta)) {
                TemporaryResources tmp = new TemporaryResources();
                try {
                    TikaInputStream tis = TikaInputStream.get(attach.getFileInputStream(), tmp);
                    embeddedExtractor.parseEmbedded(tis, xhtml, attachMeta, true);
                } finally {
                    tmp.dispose();
                }
            }
            xhtml.endElement("div");
        } catch (Exception e) {
            throw new TikaException("Unable to unpack document stream", e);
        } finally {
            if (tempFile != null)
                tempFile.delete();
        }
    }
}
Also used : AttributesImpl(org.xml.sax.helpers.AttributesImpl) TikaException(org.apache.tika.exception.TikaException) Metadata(org.apache.tika.metadata.Metadata) TemporaryResources(org.apache.tika.io.TemporaryResources) TikaInputStream(org.apache.tika.io.TikaInputStream) PSTAttachment(com.pff.PSTAttachment) File(java.io.File) PSTFile(com.pff.PSTFile) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) PSTException(com.pff.PSTException) SAXException(org.xml.sax.SAXException)

Example 73 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class OutlookPSTParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    // Use the delegate parser to parse the contained document
    EmbeddedDocumentExtractor embeddedExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
    metadata.set(Metadata.CONTENT_TYPE, MS_OUTLOOK_PST_MIMETYPE.toString());
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    TikaInputStream in = TikaInputStream.get(stream);
    PSTFile pstFile = null;
    try {
        pstFile = new PSTFile(in.getFile().getPath());
        metadata.set(Metadata.CONTENT_LENGTH, valueOf(pstFile.getFileHandle().length()));
        boolean isValid = pstFile.getFileHandle().getFD().valid();
        metadata.set("isValid", valueOf(isValid));
        if (isValid) {
            parseFolder(xhtml, pstFile.getRootFolder(), embeddedExtractor);
        }
    } catch (Exception e) {
        throw new TikaException(e.getMessage(), e);
    } finally {
        if (pstFile != null && pstFile.getFileHandle() != null) {
            try {
                pstFile.getFileHandle().close();
            } catch (IOException e) {
            //swallow closing exception
            }
        }
    }
    xhtml.endDocument();
}
Also used : TikaException(org.apache.tika.exception.TikaException) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) PSTFile(com.pff.PSTFile) TikaInputStream(org.apache.tika.io.TikaInputStream) IOException(java.io.IOException) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) PSTException(com.pff.PSTException) SAXException(org.xml.sax.SAXException)

Example 74 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class AbstractPOIFSExtractor method handleEmbeddedOfficeDoc.

/**
     * Handle an office document that's embedded at the POIFS level
     */
protected void handleEmbeddedOfficeDoc(DirectoryEntry dir, String resourceName, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException {
    if (dir.hasEntry("Package")) {
        // It's OOXML (has a ZipFile):
        Entry ooxml = dir.getEntry("Package");
        try (TikaInputStream stream = TikaInputStream.get(new DocumentInputStream((DocumentEntry) ooxml))) {
            ZipContainerDetector detector = new ZipContainerDetector();
            MediaType type = null;
            try {
                //if there's a stream error while detecting...
                type = detector.detect(stream, new Metadata());
            } catch (Exception e) {
                EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                return;
            }
            handleEmbeddedResource(stream, null, dir.getName(), dir.getStorageClsid(), type.toString(), xhtml, true);
            return;
        }
    }
    // It's regular OLE2:
    // What kind of document is it?
    Metadata metadata = new Metadata();
    metadata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, dir.getName());
    if (dir.getStorageClsid() != null) {
        metadata.set(Metadata.EMBEDDED_STORAGE_CLASS_ID, dir.getStorageClsid().toString());
    }
    POIFSDocumentType type = POIFSDocumentType.detectType(dir);
    TikaInputStream embedded = null;
    String rName = (resourceName == null) ? dir.getName() : resourceName;
    try {
        if (type == POIFSDocumentType.OLE10_NATIVE) {
            try {
                // Try to un-wrap the OLE10Native record:
                Ole10Native ole = Ole10Native.createFromEmbeddedOleObject((DirectoryNode) dir);
                if (ole.getLabel() != null) {
                    metadata.set(Metadata.RESOURCE_NAME_KEY, rName + '/' + ole.getLabel());
                }
                if (ole.getCommand() != null) {
                    metadata.add(TikaCoreProperties.ORIGINAL_RESOURCE_NAME, ole.getCommand());
                }
                if (ole.getFileName() != null) {
                    metadata.add(TikaCoreProperties.ORIGINAL_RESOURCE_NAME, ole.getFileName());
                }
                byte[] data = ole.getDataBuffer();
                embedded = TikaInputStream.get(data);
            } catch (Ole10NativeException ex) {
            // Not a valid OLE10Native record, skip it
            } catch (Exception e) {
                EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                return;
            }
        } else if (type == POIFSDocumentType.COMP_OBJ) {
            try {
                //TODO: figure out if the equivalent of OLE 1.0's
                //getCommand() and getFileName() exist for OLE 2.0 to populate
                //TikaCoreProperties.ORIGINAL_RESOURCE_NAME
                // Grab the contents and process
                DocumentEntry contentsEntry;
                try {
                    contentsEntry = (DocumentEntry) dir.getEntry("CONTENTS");
                } catch (FileNotFoundException ioe) {
                    contentsEntry = (DocumentEntry) dir.getEntry("Contents");
                }
                DocumentInputStream inp = new DocumentInputStream(contentsEntry);
                byte[] contents = new byte[contentsEntry.getSize()];
                inp.readFully(contents);
                embedded = TikaInputStream.get(contents);
                // Try to work out what it is
                MediaType mediaType = getDetector().detect(embedded, new Metadata());
                String extension = type.getExtension();
                try {
                    MimeType mimeType = getMimeTypes().forName(mediaType.toString());
                    extension = mimeType.getExtension();
                } catch (MimeTypeException mte) {
                // No details on this type are known
                }
                // Record what we can do about it
                metadata.set(Metadata.CONTENT_TYPE, mediaType.getType().toString());
                metadata.set(Metadata.RESOURCE_NAME_KEY, rName + extension);
            } catch (Exception e) {
                EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                return;
            }
        } else {
            metadata.set(Metadata.CONTENT_TYPE, type.getType().toString());
            metadata.set(Metadata.RESOURCE_NAME_KEY, rName + '.' + type.getExtension());
        }
        // Should we parse it?
        if (embeddedDocumentUtil.shouldParseEmbedded(metadata)) {
            if (embedded == null) {
                // Make a TikaInputStream that just
                // passes the root directory of the
                // embedded document, and is otherwise
                // empty (byte[0]):
                embedded = TikaInputStream.get(new byte[0]);
                embedded.setOpenContainer(dir);
            }
            embeddedDocumentUtil.parseEmbedded(embedded, xhtml, metadata, true);
        }
    } catch (IOException e) {
        EmbeddedDocumentUtil.recordEmbeddedStreamException(e, metadata);
    } finally {
        if (embedded != null) {
            embedded.close();
        }
    }
}
Also used : ZipContainerDetector(org.apache.tika.parser.pkg.ZipContainerDetector) Ole10Native(org.apache.poi.poifs.filesystem.Ole10Native) Metadata(org.apache.tika.metadata.Metadata) FileNotFoundException(java.io.FileNotFoundException) TikaInputStream(org.apache.tika.io.TikaInputStream) POIFSDocumentType(org.apache.tika.parser.microsoft.OfficeParser.POIFSDocumentType) IOException(java.io.IOException) DocumentInputStream(org.apache.poi.poifs.filesystem.DocumentInputStream) Ole10NativeException(org.apache.poi.poifs.filesystem.Ole10NativeException) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) FileNotFoundException(java.io.FileNotFoundException) SAXException(org.xml.sax.SAXException) MimeTypeException(org.apache.tika.mime.MimeTypeException) MimeType(org.apache.tika.mime.MimeType) Entry(org.apache.poi.poifs.filesystem.Entry) DocumentEntry(org.apache.poi.poifs.filesystem.DocumentEntry) DirectoryEntry(org.apache.poi.poifs.filesystem.DirectoryEntry) Ole10NativeException(org.apache.poi.poifs.filesystem.Ole10NativeException) MimeTypeException(org.apache.tika.mime.MimeTypeException) DocumentEntry(org.apache.poi.poifs.filesystem.DocumentEntry) MediaType(org.apache.tika.mime.MediaType)

Example 75 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class EMFParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    EmbeddedDocumentExtractor embeddedDocumentExtractor = null;
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    try {
        HemfExtractor ex = new HemfExtractor(stream);
        long lastY = -1;
        long lastX = -1;
        //derive this from the font or frame/bounds information
        long fudgeFactorX = 1000;
        StringBuilder buffer = new StringBuilder();
        for (HemfRecord record : ex) {
            if (record.getRecordType() == HemfRecordType.comment) {
                AbstractHemfComment comment = ((HemfCommentRecord) record).getComment();
                if (comment instanceof HemfCommentPublic.MultiFormats) {
                    if (embeddedDocumentExtractor == null) {
                        embeddedDocumentExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
                    }
                    handleMultiFormats((HemfCommentPublic.MultiFormats) comment, xhtml, embeddedDocumentExtractor);
                } else if (comment instanceof HemfCommentPublic.WindowsMetafile) {
                    if (embeddedDocumentExtractor == null) {
                        embeddedDocumentExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
                    }
                    handleWMF((HemfCommentPublic.WindowsMetafile) comment, xhtml, embeddedDocumentExtractor);
                }
            } else if (record.getRecordType().equals(HemfRecordType.exttextoutw)) {
                HemfText.ExtTextOutW extTextOutW = (HemfText.ExtTextOutW) record;
                if (lastY > -1 && lastY != extTextOutW.getY()) {
                    xhtml.startElement("p");
                    xhtml.characters(buffer.toString());
                    xhtml.endElement("p");
                    buffer.setLength(0);
                    lastX = -1;
                }
                if (lastX > -1 && extTextOutW.getX() - lastX > fudgeFactorX) {
                    buffer.append(" ");
                }
                String txt = extTextOutW.getText();
                buffer.append(txt);
                lastY = extTextOutW.getY();
                lastX = extTextOutW.getX();
            }
        }
        if (buffer.length() > 0) {
            xhtml.startElement("p");
            xhtml.characters(buffer.toString());
            xhtml.endElement("p");
        }
    } catch (RecordFormatException e) {
        //POI's hemfparser can throw these for "parse exceptions"
        throw new TikaException(e.getMessage(), e);
    } catch (RuntimeException e) {
        //convert Runtime to RecordFormatExceptions
        throw new TikaException(e.getMessage(), e);
    }
    xhtml.endDocument();
}
Also used : TikaException(org.apache.tika.exception.TikaException) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) HemfRecord(org.apache.poi.hemf.record.HemfRecord) HemfCommentRecord(org.apache.poi.hemf.record.HemfCommentRecord) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) HemfText(org.apache.poi.hemf.record.HemfText) RecordFormatException(org.apache.poi.util.RecordFormatException) AbstractHemfComment(org.apache.poi.hemf.record.AbstractHemfComment) HemfCommentPublic(org.apache.poi.hemf.record.HemfCommentPublic) HemfExtractor(org.apache.poi.hemf.extractor.HemfExtractor)

Aggregations

TikaException (org.apache.tika.exception.TikaException)144 IOException (java.io.IOException)56 SAXException (org.xml.sax.SAXException)44 InputStream (java.io.InputStream)37 Metadata (org.apache.tika.metadata.Metadata)35 TikaInputStream (org.apache.tika.io.TikaInputStream)33 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)29 ParseContext (org.apache.tika.parser.ParseContext)19 Test (org.junit.Test)19 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)17 ContentHandler (org.xml.sax.ContentHandler)17 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)15 TemporaryResources (org.apache.tika.io.TemporaryResources)15 MediaType (org.apache.tika.mime.MediaType)14 Parser (org.apache.tika.parser.Parser)14 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)13 ByteArrayInputStream (java.io.ByteArrayInputStream)12 ArrayList (java.util.ArrayList)11 File (java.io.File)8 EmbeddedContentHandler (org.apache.tika.sax.EmbeddedContentHandler)8