Search in sources :

Example 96 with Metadata

use of org.apache.tika.metadata.Metadata in project tika by apache.

the class TestParsingExample method testRecursiveParserWrapperExample.

@Test
public void testRecursiveParserWrapperExample() throws IOException, SAXException, TikaException {
    List<Metadata> metadataList = parsingExample.recursiveParserWrapperExample();
    assertEquals("Number of embedded documents + 1 for the container document", 12, metadataList.size());
    Metadata m = metadataList.get(6);
    //this is the location the embed3.txt text file within the outer .docx
    assertEquals("/embed1.zip/embed2.zip/embed3.zip/embed3.txt", m.get("X-TIKA:embedded_resource_path"));
    //it contains some html encoded content
    assertContains("When in the Course", m.get("X-TIKA:content"));
}
Also used : Metadata(org.apache.tika.metadata.Metadata) Test(org.junit.Test)

Example 97 with Metadata

use of org.apache.tika.metadata.Metadata in project tika by apache.

the class ChmParser method parsePage.

private void parsePage(byte[] byteObject, Parser htmlParser, ContentHandler xhtml, ParseContext context) throws TikaException {
    // throws IOException
    InputStream stream = null;
    Metadata metadata = new Metadata();
    // -1
    ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));
    try {
        stream = new ByteArrayInputStream(byteObject);
        htmlParser.parse(stream, handler, metadata, context);
    } catch (SAXException e) {
        throw new RuntimeException(e);
    } catch (IOException e) {
    // Pushback overflow from tagsoup
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ByteArrayInputStream(java.io.ByteArrayInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) IOException(java.io.IOException) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) ContentHandler(org.xml.sax.ContentHandler) SAXException(org.xml.sax.SAXException)

Example 98 with Metadata

use of org.apache.tika.metadata.Metadata in project tika by apache.

the class TSDParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    //Try to parse TSD file
    try (RereadableInputStream ris = new RereadableInputStream(stream, 2048, true, true)) {
        Metadata TSDAndEmbeddedMetadata = new Metadata();
        List<TSDMetas> tsdMetasList = this.extractMetas(ris);
        this.buildMetas(tsdMetasList, metadata != null && metadata.size() > 0 ? TSDAndEmbeddedMetadata : metadata);
        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
        xhtml.startDocument();
        ris.rewind();
        //Try to parse embedded file in TSD file
        this.parseTSDContent(ris, handler, TSDAndEmbeddedMetadata, context);
        xhtml.endDocument();
    }
}
Also used : RereadableInputStream(org.apache.tika.utils.RereadableInputStream) Metadata(org.apache.tika.metadata.Metadata) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler)

Example 99 with Metadata

use of org.apache.tika.metadata.Metadata in project tika by apache.

the class TIAParsingExample method parseURLStream.

public static void parseURLStream(String address) throws Exception {
    Parser parser = new AutoDetectParser();
    ContentHandler handler = new DefaultHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    try (InputStream stream = new GZIPInputStream(new URL(address).openStream())) {
        parser.parse(stream, handler, metadata, context);
    }
}
Also used : GZIPInputStream(java.util.zip.GZIPInputStream) GZIPInputStream(java.util.zip.GZIPInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) URL(java.net.URL) Parser(org.apache.tika.parser.Parser) XMLParser(org.apache.tika.parser.xml.XMLParser) HtmlParser(org.apache.tika.parser.html.HtmlParser) TXTParser(org.apache.tika.parser.txt.TXTParser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) DefaultHandler(org.xml.sax.helpers.DefaultHandler)

Example 100 with Metadata

use of org.apache.tika.metadata.Metadata in project tika by apache.

the class TIAParsingExample method useAutoDetectParser.

public static void useAutoDetectParser() throws Exception {
    InputStream stream = new ByteArrayInputStream(new byte[0]);
    ContentHandler handler = new DefaultHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    Parser parser = new AutoDetectParser();
    parser.parse(stream, handler, metadata, context);
}
Also used : ByteArrayInputStream(java.io.ByteArrayInputStream) GZIPInputStream(java.util.zip.GZIPInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) DefaultHandler(org.xml.sax.helpers.DefaultHandler) Parser(org.apache.tika.parser.Parser) XMLParser(org.apache.tika.parser.xml.XMLParser) HtmlParser(org.apache.tika.parser.html.HtmlParser) TXTParser(org.apache.tika.parser.txt.TXTParser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser)

Aggregations

Metadata (org.apache.tika.metadata.Metadata)643 Test (org.junit.Test)467 InputStream (java.io.InputStream)318 ParseContext (org.apache.tika.parser.ParseContext)281 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)268 TikaTest (org.apache.tika.TikaTest)257 ContentHandler (org.xml.sax.ContentHandler)228 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)151 ByteArrayInputStream (java.io.ByteArrayInputStream)141 Parser (org.apache.tika.parser.Parser)134 TikaInputStream (org.apache.tika.io.TikaInputStream)131 IOException (java.io.IOException)62 DefaultHandler (org.xml.sax.helpers.DefaultHandler)59 TikaException (org.apache.tika.exception.TikaException)46 ExcelParserTest (org.apache.tika.parser.microsoft.ExcelParserTest)36 WordParserTest (org.apache.tika.parser.microsoft.WordParserTest)36 StringWriter (java.io.StringWriter)33 Tika (org.apache.tika.Tika)28 FileInputStream (java.io.FileInputStream)27 MediaType (org.apache.tika.mime.MediaType)27