Search in sources :

Example 71 with XHTMLContentHandler

use of org.apache.tika.sax.XHTMLContentHandler in project tika by apache.

the class AbstractDBParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    connection = getConnection(stream, metadata, context);
    XHTMLContentHandler xHandler = null;
    List<String> tableNames = null;
    try {
        tableNames = getTableNames(connection, metadata, context);
    } catch (SQLException e) {
        try {
            close();
        } catch (SQLException sqlE) {
        //swallow
        }
        throw new IOExceptionWithCause(e);
    }
    for (String tableName : tableNames) {
        //add table names to parent metadata
        metadata.add(Database.TABLE_NAME, tableName);
    }
    xHandler = new XHTMLContentHandler(handler, metadata);
    xHandler.startDocument();
    try {
        for (String tableName : tableNames) {
            JDBCTableReader tableReader = getTableReader(connection, tableName, context);
            xHandler.startElement("table", "name", tableReader.getTableName());
            xHandler.startElement("thead");
            xHandler.startElement("tr");
            for (String header : tableReader.getHeaders()) {
                xHandler.startElement("th");
                xHandler.characters(header);
                xHandler.endElement("th");
            }
            xHandler.endElement("tr");
            xHandler.endElement("thead");
            xHandler.startElement("tbody");
            while (tableReader.nextRow(xHandler, context)) {
            //no-op
            }
            xHandler.endElement("tbody");
            xHandler.endElement("table");
        }
    } finally {
        try {
            close();
        } catch (IOException | SQLException e) {
        //swallow
        }
        if (xHandler != null) {
            xHandler.endDocument();
        }
    }
}
Also used : IOExceptionWithCause(org.apache.commons.io.IOExceptionWithCause) SQLException(java.sql.SQLException) IOException(java.io.IOException) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler)

Example 72 with XHTMLContentHandler

use of org.apache.tika.sax.XHTMLContentHandler in project tika by apache.

the class JpegParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    TemporaryResources tmp = new TemporaryResources();
    try {
        TikaInputStream tis = TikaInputStream.get(stream, tmp);
        new ImageMetadataExtractor(metadata).parseJpeg(tis.getFile());
        new JempboxExtractor(metadata).parse(tis);
    } finally {
        tmp.dispose();
    }
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    xhtml.endDocument();
}
Also used : JempboxExtractor(org.apache.tika.parser.image.xmp.JempboxExtractor) ImageMetadataExtractor(org.apache.tika.parser.image.ImageMetadataExtractor) TemporaryResources(org.apache.tika.io.TemporaryResources) TikaInputStream(org.apache.tika.io.TikaInputStream) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler)

Aggregations

XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)72 TikaException (org.apache.tika.exception.TikaException)26 TikaInputStream (org.apache.tika.io.TikaInputStream)22 TemporaryResources (org.apache.tika.io.TemporaryResources)14 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)13 IOException (java.io.IOException)12 SAXException (org.xml.sax.SAXException)9 File (java.io.File)6 EmbeddedDocumentExtractor (org.apache.tika.extractor.EmbeddedDocumentExtractor)6 Metadata (org.apache.tika.metadata.Metadata)6 BufferedInputStream (java.io.BufferedInputStream)5 InputStream (java.io.InputStream)5 EmbeddedContentHandler (org.apache.tika.sax.EmbeddedContentHandler)5 ByteArrayInputStream (java.io.ByteArrayInputStream)4 Charset (java.nio.charset.Charset)4 ArrayList (java.util.ArrayList)4 Map (java.util.Map)4 MediaType (org.apache.tika.mime.MediaType)4 OfflineContentHandler (org.apache.tika.sax.OfflineContentHandler)4 InputStreamReader (java.io.InputStreamReader)3