Search in sources :

Example 81 with BodyContentHandler

use of org.apache.tika.sax.BodyContentHandler in project tika by apache.

the class ArParserTest method testArParsing.

@Test
public void testArParsing() throws Exception {
    Parser parser = new AutoDetectParser();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    try (InputStream stream = ArParserTest.class.getResourceAsStream("/test-documents/testARofText.ar")) {
        parser.parse(stream, handler, metadata, recursingContext);
    }
    assertEquals("application/x-archive", metadata.get(Metadata.CONTENT_TYPE));
    String content = handler.toString();
    assertContains("testTXT.txt", content);
    assertContains("Test d'indexation de Txt", content);
    assertContains("http://www.apache.org", content);
    try (InputStream stream = ArParserTest.class.getResourceAsStream("/test-documents/testARofSND.ar")) {
        parser.parse(stream, handler, metadata, recursingContext);
    }
    assertEquals("application/x-archive", metadata.get(Metadata.CONTENT_TYPE));
    content = handler.toString();
    assertContains("testAU.au", content);
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Parser(org.apache.tika.parser.Parser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) Test(org.junit.Test)

Example 82 with BodyContentHandler

use of org.apache.tika.sax.BodyContentHandler in project tika by apache.

the class ChmParser method parsePage.

private void parsePage(byte[] byteObject, Parser htmlParser, ContentHandler xhtml, ParseContext context) throws TikaException {
    // throws IOException
    InputStream stream = null;
    Metadata metadata = new Metadata();
    // -1
    ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));
    try {
        stream = new ByteArrayInputStream(byteObject);
        htmlParser.parse(stream, handler, metadata, context);
    } catch (SAXException e) {
        throw new RuntimeException(e);
    } catch (IOException e) {
    // Pushback overflow from tagsoup
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ByteArrayInputStream(java.io.ByteArrayInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) IOException(java.io.IOException) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) ContentHandler(org.xml.sax.ContentHandler) SAXException(org.xml.sax.SAXException)

Example 83 with BodyContentHandler

use of org.apache.tika.sax.BodyContentHandler in project tika by apache.

the class TIAParsingExample method testTeeContentHandler.

public static void testTeeContentHandler(String filename) throws Exception {
    InputStream stream = new ByteArrayInputStream(new byte[0]);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    Parser parser = new AutoDetectParser();
    LinkContentHandler linkCollector = new LinkContentHandler();
    try (OutputStream output = new FileOutputStream(new File(filename))) {
        ContentHandler handler = new TeeContentHandler(new BodyContentHandler(output), linkCollector);
        parser.parse(stream, handler, metadata, context);
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) GZIPInputStream(java.util.zip.GZIPInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) OutputStream(java.io.OutputStream) FileOutputStream(java.io.FileOutputStream) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) Parser(org.apache.tika.parser.Parser) XMLParser(org.apache.tika.parser.xml.XMLParser) HtmlParser(org.apache.tika.parser.html.HtmlParser) TXTParser(org.apache.tika.parser.txt.TXTParser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) ByteArrayInputStream(java.io.ByteArrayInputStream) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) FileOutputStream(java.io.FileOutputStream) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) File(java.io.File)

Example 84 with BodyContentHandler

use of org.apache.tika.sax.BodyContentHandler in project tika by apache.

the class StringsParserTest method testParse.

@Test
public void testParse() throws Exception {
    assumeTrue(canRun());
    String resource = "/test-documents/testOCTET_header.dbase3";
    String[] content = { "CLASSNO", "TITLE", "ITEMNO", "LISTNO", "LISTDATE" };
    String[] met_attributes = { "min-len", "encoding", "strings:file_output" };
    StringsConfig stringsConfig = new StringsConfig();
    FileConfig fileConfig = new FileConfig();
    Parser parser = new StringsParser();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    context.set(StringsConfig.class, stringsConfig);
    context.set(FileConfig.class, fileConfig);
    try (InputStream stream = StringsParserTest.class.getResourceAsStream(resource)) {
        parser.parse(stream, handler, metadata, context);
    } catch (Exception e) {
        e.printStackTrace();
    }
    // Content
    for (String word : content) {
        assertTrue(handler.toString().contains(word));
    }
    // Metadata
    Arrays.equals(met_attributes, metadata.names());
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) ExternalParser(org.apache.tika.parser.external.ExternalParser) Parser(org.apache.tika.parser.Parser) Test(org.junit.Test)

Example 85 with BodyContentHandler

use of org.apache.tika.sax.BodyContentHandler in project tika by apache.

the class TXTParserTest method testEmptyText.

@Test
public void testEmptyText() throws Exception {
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    parser.parse(new ByteArrayInputStream(new byte[0]), handler, metadata, new ParseContext());
    assertEquals("text/plain; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE));
    assertEquals("\n", handler.toString());
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ByteArrayInputStream(java.io.ByteArrayInputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Aggregations

BodyContentHandler (org.apache.tika.sax.BodyContentHandler)261 Metadata (org.apache.tika.metadata.Metadata)252 Test (org.junit.Test)213 ContentHandler (org.xml.sax.ContentHandler)206 InputStream (java.io.InputStream)194 ParseContext (org.apache.tika.parser.ParseContext)176 TikaTest (org.apache.tika.TikaTest)117 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)92 Parser (org.apache.tika.parser.Parser)84 ByteArrayInputStream (java.io.ByteArrayInputStream)66 TikaInputStream (org.apache.tika.io.TikaInputStream)66 TikaException (org.apache.tika.exception.TikaException)25 ExcelParserTest (org.apache.tika.parser.microsoft.ExcelParserTest)24 WordParserTest (org.apache.tika.parser.microsoft.WordParserTest)24 IOException (java.io.IOException)23 EmptyParser (org.apache.tika.parser.EmptyParser)15 OfficeParser (org.apache.tika.parser.microsoft.OfficeParser)15 SAXException (org.xml.sax.SAXException)15 MediaType (org.apache.tika.mime.MediaType)11 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)10