Search in sources :

Example 76 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class SummaryExtractor method parseSummaryEntryIfExists.

private void parseSummaryEntryIfExists(DirectoryNode root, String entryName) throws IOException, TikaException {
    try {
        DocumentEntry entry = (DocumentEntry) root.getEntry(entryName);
        PropertySet properties = new PropertySet(new DocumentInputStream(entry));
        if (properties.isSummaryInformation()) {
            parse(new SummaryInformation(properties));
        }
        if (properties.isDocumentSummaryInformation()) {
            parse(new DocumentSummaryInformation(properties));
        }
    } catch (FileNotFoundException e) {
    // entry does not exist, just skip it
    } catch (NoPropertySetStreamException e) {
    // no property stream, just skip it
    } catch (UnexpectedPropertySetTypeException e) {
        throw new TikaException("Unexpected HPSF document", e);
    } catch (MarkUnsupportedException e) {
        throw new TikaException("Invalid DocumentInputStream", e);
    } catch (Exception e) {
        LOG.warn("Ignoring unexpected exception while parsing summary entry {}", entryName, e);
    }
}
Also used : TikaException(org.apache.tika.exception.TikaException) SummaryInformation(org.apache.poi.hpsf.SummaryInformation) DocumentSummaryInformation(org.apache.poi.hpsf.DocumentSummaryInformation) DocumentEntry(org.apache.poi.poifs.filesystem.DocumentEntry) FileNotFoundException(java.io.FileNotFoundException) PropertySet(org.apache.poi.hpsf.PropertySet) DocumentSummaryInformation(org.apache.poi.hpsf.DocumentSummaryInformation) NoPropertySetStreamException(org.apache.poi.hpsf.NoPropertySetStreamException) DocumentInputStream(org.apache.poi.poifs.filesystem.DocumentInputStream) UnexpectedPropertySetTypeException(org.apache.poi.hpsf.UnexpectedPropertySetTypeException) MarkUnsupportedException(org.apache.poi.hpsf.MarkUnsupportedException) NoPropertySetStreamException(org.apache.poi.hpsf.NoPropertySetStreamException) UnexpectedPropertySetTypeException(org.apache.poi.hpsf.UnexpectedPropertySetTypeException) TikaException(org.apache.tika.exception.TikaException) MarkUnsupportedException(org.apache.poi.hpsf.MarkUnsupportedException) IOException(java.io.IOException) FileNotFoundException(java.io.FileNotFoundException)

Example 77 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class WMFParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    try {
        HwmfPicture picture = new HwmfPicture(stream);
        //to determine when to keep two text parts on the same line
        for (HwmfRecord record : picture.getRecords()) {
            Charset charset = LocaleUtil.CHARSET_1252;
            //This fix should be done within POI
            if (record.getRecordType().equals(HwmfRecordType.createFontIndirect)) {
                HwmfFont font = ((HwmfText.WmfCreateFontIndirect) record).getFont();
                charset = (font.getCharSet() == null || font.getCharSet().getCharset() == null) ? LocaleUtil.CHARSET_1252 : font.getCharSet().getCharset();
            }
            if (record.getRecordType().equals(HwmfRecordType.extTextOut)) {
                HwmfText.WmfExtTextOut textOut = (HwmfText.WmfExtTextOut) record;
                xhtml.startElement("p");
                xhtml.characters(textOut.getText(charset));
                xhtml.endElement("p");
            } else if (record.getRecordType().equals(HwmfRecordType.textOut)) {
                HwmfText.WmfTextOut textOut = (HwmfText.WmfTextOut) record;
                xhtml.startElement("p");
                xhtml.characters(textOut.getText(charset));
                xhtml.endElement("p");
            }
        }
    } catch (RecordFormatException e) {
        //POI's hwmfparser can throw these for "parse exceptions"
        throw new TikaException(e.getMessage(), e);
    } catch (RuntimeException e) {
        //convert Runtime to RecordFormatExceptions
        throw new TikaException(e.getMessage(), e);
    } catch (AssertionError e) {
        //POI's hwmfparser can throw these for parse exceptions
        throw new TikaException(e.getMessage(), e);
    }
    xhtml.endDocument();
}
Also used : TikaException(org.apache.tika.exception.TikaException) HwmfRecord(org.apache.poi.hwmf.record.HwmfRecord) Charset(java.nio.charset.Charset) HwmfText(org.apache.poi.hwmf.record.HwmfText) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) HwmfFont(org.apache.poi.hwmf.record.HwmfFont) HwmfPicture(org.apache.poi.hwmf.usermodel.HwmfPicture) RecordFormatException(org.apache.poi.util.RecordFormatException)

Example 78 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class ExcelExtractor method parse.

protected void parse(DirectoryNode root, XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, TikaException {
    if (!root.hasEntry(WORKBOOK_ENTRY)) {
        if (root.hasEntry(BOOK_ENTRY)) {
            // Excel 5 / Excel 95 file
            // Records are in a different structure so needs a
            //  different parser to process them
            OldExcelExtractor extractor = new OldExcelExtractor(root);
            OldExcelParser.parse(extractor, xhtml);
            return;
        } else {
            // Corrupt file / very old file, just skip text extraction
            return;
        }
    }
    // If a password was supplied, use it, otherwise the default
    Biff8EncryptionKey.setCurrentUserPassword(getPassword());
    // Have the file processed in event mode
    TikaHSSFListener listener = new TikaHSSFListener(xhtml, locale, this);
    listener.processFile(root, isListenForAllRecords());
    listener.throwStoredException();
    for (Entry entry : root) {
        if (entry.getName().startsWith("MBD") && entry instanceof DirectoryEntry) {
            try {
                handleEmbeddedOfficeDoc((DirectoryEntry) entry, xhtml);
            } catch (TikaException e) {
            // ignore parse errors from embedded documents
            }
        }
    }
}
Also used : OldExcelExtractor(org.apache.poi.hssf.extractor.OldExcelExtractor) Entry(org.apache.poi.poifs.filesystem.Entry) DirectoryEntry(org.apache.poi.poifs.filesystem.DirectoryEntry) TikaException(org.apache.tika.exception.TikaException) DirectoryEntry(org.apache.poi.poifs.filesystem.DirectoryEntry)

Example 79 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class HSLFExtractor method handleSlideEmbeddedResources.

private void handleSlideEmbeddedResources(HSLFSlide slide, XHTMLContentHandler xhtml) throws TikaException, SAXException, IOException {
    List<HSLFShape> shapes;
    try {
        shapes = slide.getShapes();
    } catch (NullPointerException e) {
        // Sometimes HSLF hits problems
        // Please open POI bugs for any you come across!
        EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
        return;
    }
    for (HSLFShape shape : shapes) {
        if (shape instanceof OLEShape) {
            OLEShape oleShape = (OLEShape) shape;
            HSLFObjectData data = null;
            try {
                data = oleShape.getObjectData();
            } catch (NullPointerException e) {
                /* getObjectData throws NPE some times. */
                EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                continue;
            }
            if (data != null) {
                String objID = Integer.toString(oleShape.getObjectID());
                // Embedded Object: add a <div
                // class="embedded" id="X"/> so consumer can see where
                // in the main text each embedded document
                // occurred:
                AttributesImpl attributes = new AttributesImpl();
                attributes.addAttribute("", "class", "class", "CDATA", "embedded");
                attributes.addAttribute("", "id", "id", "CDATA", objID);
                xhtml.startElement("div", attributes);
                xhtml.endElement("div");
                InputStream dataStream = null;
                try {
                    dataStream = data.getData();
                } catch (Exception e) {
                    EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                    continue;
                }
                try (TikaInputStream stream = TikaInputStream.get(dataStream)) {
                    String mediaType = null;
                    if ("Excel.Chart.8".equals(oleShape.getProgID())) {
                        mediaType = "application/vnd.ms-excel";
                    } else {
                        MediaType mt = getTikaConfig().getDetector().detect(stream, new Metadata());
                        mediaType = mt.toString();
                    }
                    if (mediaType.equals("application/x-tika-msoffice-embedded; format=comp_obj")) {
                        try (NPOIFSFileSystem npoifs = new NPOIFSFileSystem(new CloseShieldInputStream(stream))) {
                            handleEmbeddedOfficeDoc(npoifs.getRoot(), objID, xhtml);
                        }
                    } else {
                        handleEmbeddedResource(stream, objID, objID, mediaType, xhtml, false);
                    }
                } catch (IOException e) {
                    EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                }
            }
        }
    }
}
Also used : TikaInputStream(org.apache.tika.io.TikaInputStream) CloseShieldInputStream(org.apache.tika.io.CloseShieldInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) IOException(java.io.IOException) HSLFObjectData(org.apache.poi.hslf.usermodel.HSLFObjectData) OLEShape(org.apache.poi.hslf.model.OLEShape) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) SAXException(org.xml.sax.SAXException) NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) HSLFShape(org.apache.poi.hslf.usermodel.HSLFShape) AttributesImpl(org.xml.sax.helpers.AttributesImpl) MediaType(org.apache.tika.mime.MediaType) CloseShieldInputStream(org.apache.tika.io.CloseShieldInputStream)

Example 80 with TikaException

use of org.apache.tika.exception.TikaException in project tika by apache.

the class JackcessExtractor method handleCompoundContent.

private void handleCompoundContent(OleBlob.CompoundContent cc, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException {
    InputStream is = null;
    NPOIFSFileSystem nfs = null;
    try {
        try {
            is = cc.getStream();
        } catch (IOException e) {
            EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
            return;
        }
        try {
            nfs = new NPOIFSFileSystem(is);
        } catch (Exception e) {
            EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
            return;
        }
        handleEmbeddedOfficeDoc(nfs.getRoot(), xhtml);
    } finally {
        if (nfs != null) {
            try {
                nfs.close();
            } catch (IOException e) {
            //swallow
            }
        }
        if (is != null) {
            IOUtils.closeQuietly(is);
        }
    }
}
Also used : NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) IOException(java.io.IOException) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) SAXException(org.xml.sax.SAXException)

Aggregations

TikaException (org.apache.tika.exception.TikaException)144 IOException (java.io.IOException)56 SAXException (org.xml.sax.SAXException)44 InputStream (java.io.InputStream)37 Metadata (org.apache.tika.metadata.Metadata)35 TikaInputStream (org.apache.tika.io.TikaInputStream)33 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)29 ParseContext (org.apache.tika.parser.ParseContext)19 Test (org.junit.Test)19 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)17 ContentHandler (org.xml.sax.ContentHandler)17 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)15 TemporaryResources (org.apache.tika.io.TemporaryResources)15 MediaType (org.apache.tika.mime.MediaType)14 Parser (org.apache.tika.parser.Parser)14 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)13 ByteArrayInputStream (java.io.ByteArrayInputStream)12 ArrayList (java.util.ArrayList)11 File (java.io.File)8 EmbeddedContentHandler (org.apache.tika.sax.EmbeddedContentHandler)8