Search in sources :

Example 6 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class OutlookPSTParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    // Use the delegate parser to parse the contained document
    EmbeddedDocumentExtractor embeddedExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
    metadata.set(Metadata.CONTENT_TYPE, MS_OUTLOOK_PST_MIMETYPE.toString());
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    TikaInputStream in = TikaInputStream.get(stream);
    PSTFile pstFile = null;
    try {
        pstFile = new PSTFile(in.getFile().getPath());
        metadata.set(Metadata.CONTENT_LENGTH, valueOf(pstFile.getFileHandle().length()));
        boolean isValid = pstFile.getFileHandle().getFD().valid();
        metadata.set("isValid", valueOf(isValid));
        if (isValid) {
            parseFolder(xhtml, pstFile.getRootFolder(), embeddedExtractor);
        }
    } catch (Exception e) {
        throw new TikaException(e.getMessage(), e);
    } finally {
        if (pstFile != null && pstFile.getFileHandle() != null) {
            try {
                pstFile.getFileHandle().close();
            } catch (IOException e) {
            //swallow closing exception
            }
        }
    }
    xhtml.endDocument();
}
Also used : TikaException(org.apache.tika.exception.TikaException) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) PSTFile(com.pff.PSTFile) TikaInputStream(org.apache.tika.io.TikaInputStream) IOException(java.io.IOException) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) PSTException(com.pff.PSTException) SAXException(org.xml.sax.SAXException)

Example 7 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class EMFParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    EmbeddedDocumentExtractor embeddedDocumentExtractor = null;
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    try {
        HemfExtractor ex = new HemfExtractor(stream);
        long lastY = -1;
        long lastX = -1;
        //derive this from the font or frame/bounds information
        long fudgeFactorX = 1000;
        StringBuilder buffer = new StringBuilder();
        for (HemfRecord record : ex) {
            if (record.getRecordType() == HemfRecordType.comment) {
                AbstractHemfComment comment = ((HemfCommentRecord) record).getComment();
                if (comment instanceof HemfCommentPublic.MultiFormats) {
                    if (embeddedDocumentExtractor == null) {
                        embeddedDocumentExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
                    }
                    handleMultiFormats((HemfCommentPublic.MultiFormats) comment, xhtml, embeddedDocumentExtractor);
                } else if (comment instanceof HemfCommentPublic.WindowsMetafile) {
                    if (embeddedDocumentExtractor == null) {
                        embeddedDocumentExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
                    }
                    handleWMF((HemfCommentPublic.WindowsMetafile) comment, xhtml, embeddedDocumentExtractor);
                }
            } else if (record.getRecordType().equals(HemfRecordType.exttextoutw)) {
                HemfText.ExtTextOutW extTextOutW = (HemfText.ExtTextOutW) record;
                if (lastY > -1 && lastY != extTextOutW.getY()) {
                    xhtml.startElement("p");
                    xhtml.characters(buffer.toString());
                    xhtml.endElement("p");
                    buffer.setLength(0);
                    lastX = -1;
                }
                if (lastX > -1 && extTextOutW.getX() - lastX > fudgeFactorX) {
                    buffer.append(" ");
                }
                String txt = extTextOutW.getText();
                buffer.append(txt);
                lastY = extTextOutW.getY();
                lastX = extTextOutW.getX();
            }
        }
        if (buffer.length() > 0) {
            xhtml.startElement("p");
            xhtml.characters(buffer.toString());
            xhtml.endElement("p");
        }
    } catch (RecordFormatException e) {
        //POI's hemfparser can throw these for "parse exceptions"
        throw new TikaException(e.getMessage(), e);
    } catch (RuntimeException e) {
        //convert Runtime to RecordFormatExceptions
        throw new TikaException(e.getMessage(), e);
    }
    xhtml.endDocument();
}
Also used : TikaException(org.apache.tika.exception.TikaException) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) HemfRecord(org.apache.poi.hemf.record.HemfRecord) HemfCommentRecord(org.apache.poi.hemf.record.HemfCommentRecord) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) HemfText(org.apache.poi.hemf.record.HemfText) RecordFormatException(org.apache.poi.util.RecordFormatException) AbstractHemfComment(org.apache.poi.hemf.record.AbstractHemfComment) HemfCommentPublic(org.apache.poi.hemf.record.HemfCommentPublic) HemfExtractor(org.apache.poi.hemf.extractor.HemfExtractor)

Example 8 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class TNEFParser method parse.

/**
     * Extracts properties and text from an MS Document input stream
     */
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    // We work by recursing, so get the appropriate bits
    EmbeddedDocumentExtractor embeddedExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
    // Ask POI to process the file for us
    HMEFMessage msg = new HMEFMessage(stream);
    // Set the message subject if known
    String subject = msg.getSubject();
    if (subject != null && subject.length() > 0) {
        // TODO: Move to title in Tika 2.0
        metadata.set(TikaCoreProperties.TRANSITION_SUBJECT_TO_DC_TITLE, subject);
    }
    // Recurse into the message body RTF
    MAPIAttribute attr = msg.getMessageMAPIAttribute(MAPIProperty.RTF_COMPRESSED);
    if (attr != null && attr instanceof MAPIRtfAttribute) {
        MAPIRtfAttribute rtf = (MAPIRtfAttribute) attr;
        handleEmbedded("message.rtf", "application/rtf", rtf.getData(), embeddedExtractor, handler);
    }
    // Recurse into each attachment in turn
    for (Attachment attachment : msg.getAttachments()) {
        String name = attachment.getLongFilename();
        if (name == null || name.length() == 0) {
            name = attachment.getFilename();
        }
        if (name == null || name.length() == 0) {
            String ext = attachment.getExtension();
            if (ext != null) {
                name = "unknown" + ext;
            }
        }
        handleEmbedded(name, null, attachment.getContents(), embeddedExtractor, handler);
    }
}
Also used : HMEFMessage(org.apache.poi.hmef.HMEFMessage) MAPIRtfAttribute(org.apache.poi.hmef.attribute.MAPIRtfAttribute) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) MAPIAttribute(org.apache.poi.hmef.attribute.MAPIAttribute) Attachment(org.apache.poi.hmef.Attachment)

Example 9 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class TSDParser method parseTSDContent.

private void parseTSDContent(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) {
    CMSTimeStampedDataParser cmsTimeStampedDataParser = null;
    EmbeddedDocumentExtractor edx = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
    if (edx.shouldParseEmbedded(metadata)) {
        try {
            cmsTimeStampedDataParser = new CMSTimeStampedDataParser(stream);
            try (InputStream is = TikaInputStream.get(cmsTimeStampedDataParser.getContent())) {
                edx.parseEmbedded(is, handler, metadata, false);
            }
        } catch (Exception ex) {
            LOG.error("Error in TSDParser.parseTSDContent {}", ex.getMessage());
        } finally {
            this.closeCMSParser(cmsTimeStampedDataParser);
        }
    }
}
Also used : EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) TikaInputStream(org.apache.tika.io.TikaInputStream) RereadableInputStream(org.apache.tika.utils.RereadableInputStream) InputStream(java.io.InputStream) CMSTimeStampedDataParser(org.bouncycastle.tsp.cms.CMSTimeStampedDataParser) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) NoSuchAlgorithmException(java.security.NoSuchAlgorithmException) SAXException(org.xml.sax.SAXException) NoSuchProviderException(java.security.NoSuchProviderException)

Example 10 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class MockParser method getEmbeddedDocumentExtractor.

protected EmbeddedDocumentExtractor getEmbeddedDocumentExtractor(ParseContext context) {
    EmbeddedDocumentExtractor extractor = context.get(EmbeddedDocumentExtractor.class);
    if (extractor == null) {
        Parser p = context.get(Parser.class);
        if (p == null) {
            context.set(Parser.class, new MockParser());
        }
        extractor = new ParsingEmbeddedDocumentExtractor(context);
    }
    return extractor;
}
Also used : ParsingEmbeddedDocumentExtractor(org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) ParsingEmbeddedDocumentExtractor(org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor) Parser(org.apache.tika.parser.Parser) AbstractParser(org.apache.tika.parser.AbstractParser)

Aggregations

EmbeddedDocumentExtractor (org.apache.tika.extractor.EmbeddedDocumentExtractor)15 Metadata (org.apache.tika.metadata.Metadata)9 TikaException (org.apache.tika.exception.TikaException)8 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)8 TikaInputStream (org.apache.tika.io.TikaInputStream)6 InputStream (java.io.InputStream)4 ByteArrayInputStream (java.io.ByteArrayInputStream)3 IOException (java.io.IOException)3 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)3 BufferedInputStream (java.io.BufferedInputStream)2 ParsingEmbeddedDocumentExtractor (org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor)2 ParseContext (org.apache.tika.parser.ParseContext)2 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)2 ContentHandler (org.xml.sax.ContentHandler)2 SAXException (org.xml.sax.SAXException)2 Archive (com.github.junrar.Archive)1 RarException (com.github.junrar.exception.RarException)1 FileHeader (com.github.junrar.rarfile.FileHeader)1 PSTException (com.pff.PSTException)1 PSTFile (com.pff.PSTFile)1