Search in sources :

Example 6 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class TestParsers method testEXCELExtraction.

@Test
public void testEXCELExtraction() throws Exception {
    final String expected = "Numbers and their Squares";
    File file = getResourceAsFile("/test-documents/testEXCEL.xls");
    String s1 = tika.parseToString(file);
    assertTrue("Text does not contain '" + expected + "'", s1.contains(expected));
    Parser parser = tika.getParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = new FileInputStream(file)) {
        parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
    }
    assertEquals("Simple Excel document", metadata.get(TikaCoreProperties.TITLE));
}
Also used : FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) File(java.io.File) FileInputStream(java.io.FileInputStream) Parser(org.apache.tika.parser.Parser) DefaultHandler(org.xml.sax.helpers.DefaultHandler) Test(org.junit.Test)

Example 7 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class TestParsers method testWORDxtraction.

@Test
public void testWORDxtraction() throws Exception {
    File file = getResourceAsFile("/test-documents/testWORD.doc");
    Parser parser = tika.getParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = new FileInputStream(file)) {
        parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
    }
    assertEquals("Sample Word Document", metadata.get(TikaCoreProperties.TITLE));
}
Also used : FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) File(java.io.File) FileInputStream(java.io.FileInputStream) Parser(org.apache.tika.parser.Parser) DefaultHandler(org.xml.sax.helpers.DefaultHandler) Test(org.junit.Test)

Example 8 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class EmbeddedDocumentUtilTest method testAutomaticAdditionOfAutoDetectParserIfForgotten.

@Test
public void testAutomaticAdditionOfAutoDetectParserIfForgotten() throws Exception {
    String needle = "When in the Course";
    //TIKA-2096
    TikaTest.XMLResult xmlResult = getXML("test_recursive_embedded.doc", new ParseContext());
    assertContains(needle, xmlResult.xml);
    ParseContext context = new ParseContext();
    context.set(Parser.class, new EmptyParser());
    xmlResult = getXML("test_recursive_embedded.doc", context);
    assertNotContained(needle, xmlResult.xml);
}
Also used : ParseContext(org.apache.tika.parser.ParseContext) EmptyParser(org.apache.tika.parser.EmptyParser) TikaTest(org.apache.tika.TikaTest) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 9 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class MyFirstTika method parseUsingAutoDetect.

public static String parseUsingAutoDetect(String filename, TikaConfig tikaConfig, Metadata metadata) throws Exception {
    System.out.println("Handling using AutoDetectParser: [" + filename + "]");
    AutoDetectParser parser = new AutoDetectParser(tikaConfig);
    ContentHandler handler = new BodyContentHandler();
    TikaInputStream stream = TikaInputStream.get(new File(filename), metadata);
    parser.parse(stream, handler, metadata, new ParseContext());
    return handler.toString();
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) TikaInputStream(org.apache.tika.io.TikaInputStream) File(java.io.File) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler)

Example 10 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class ParsingExample method parseEmbeddedExample.

/**
     * This example shows how to extract content from the outer document and all
     * embedded documents.  The key is to specify a {@link Parser} in the {@link ParseContext}.
     *
     * @return content, including from embedded documents
     * @throws IOException
     * @throws SAXException
     * @throws TikaException
     */
public String parseEmbeddedExample() throws IOException, SAXException, TikaException {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    context.set(Parser.class, parser);
    try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) {
        parser.parse(stream, handler, metadata, context);
        return handler.toString();
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser)

Aggregations

ParseContext (org.apache.tika.parser.ParseContext)336 Metadata (org.apache.tika.metadata.Metadata)281 Test (org.junit.Test)260 InputStream (java.io.InputStream)195 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)195 TikaTest (org.apache.tika.TikaTest)186 ContentHandler (org.xml.sax.ContentHandler)163 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)117 Parser (org.apache.tika.parser.Parser)107 ByteArrayInputStream (java.io.ByteArrayInputStream)91 TikaInputStream (org.apache.tika.io.TikaInputStream)77 DefaultHandler (org.xml.sax.helpers.DefaultHandler)52 ExcelParserTest (org.apache.tika.parser.microsoft.ExcelParserTest)31 WordParserTest (org.apache.tika.parser.microsoft.WordParserTest)31 TikaException (org.apache.tika.exception.TikaException)29 StringWriter (java.io.StringWriter)26 IOException (java.io.IOException)24 SAXException (org.xml.sax.SAXException)24 CompositeParser (org.apache.tika.parser.CompositeParser)22 FileInputStream (java.io.FileInputStream)19