Search in sources :

Example 86 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class BundleIT method testForkParser.

@Test
public void testForkParser() throws Exception {
    ForkParser parser = new ForkParser(Activator.class.getClassLoader(), defaultParser);
    String data = "<!DOCTYPE html>\n<html><body><p>test <span>content</span></p></body></html>";
    InputStream stream = new ByteArrayInputStream(data.getBytes(UTF_8));
    Writer writer = new StringWriter();
    ContentHandler contentHandler = new BodyContentHandler(writer);
    Metadata metadata = new Metadata();
    MediaType type = contentTypeDetector.detect(stream, metadata);
    assertEquals(type.toString(), "text/html");
    metadata.add(Metadata.CONTENT_TYPE, type.toString());
    ParseContext parseCtx = new ParseContext();
    parser.parse(stream, contentHandler, metadata, parseCtx);
    writer.flush();
    String content = writer.toString();
    assertTrue(content.length() > 0);
    assertEquals("test content", content.trim());
}
Also used : ForkParser(org.apache.tika.fork.ForkParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) StringWriter(java.io.StringWriter) Activator(org.apache.tika.parser.internal.Activator) ByteArrayInputStream(java.io.ByteArrayInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) JarInputStream(java.util.jar.JarInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) MediaType(org.apache.tika.mime.MediaType) StringWriter(java.io.StringWriter) Writer(java.io.Writer) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test)

Example 87 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class BundleIT method testTikaBundle.

@Test
public void testTikaBundle() throws Exception {
    Tika tika = new Tika();
    // Package extraction
    ContentHandler handler = new BodyContentHandler();
    Parser parser = tika.getParser();
    ParseContext context = new ParseContext();
    context.set(Parser.class, parser);
    try (InputStream stream = new FileInputStream("src/test/resources/test-documents.zip")) {
        parser.parse(stream, handler, new Metadata(), context);
    }
    String content = handler.toString();
    assertTrue(content.contains("testEXCEL.xls"));
    assertTrue(content.contains("Sample Excel Worksheet"));
    assertTrue(content.contains("testHTML.html"));
    assertTrue(content.contains("Test Indexation Html"));
    assertTrue(content.contains("testOpenOffice2.odt"));
    assertTrue(content.contains("This is a sample Open Office document"));
    assertTrue(content.contains("testPDF.pdf"));
    assertTrue(content.contains("Apache Tika"));
    assertTrue(content.contains("testPPT.ppt"));
    assertTrue(content.contains("Sample Powerpoint Slide"));
    assertTrue(content.contains("testRTF.rtf"));
    assertTrue(content.contains("indexation Word"));
    assertTrue(content.contains("testTXT.txt"));
    assertTrue(content.contains("Test d'indexation de Txt"));
    assertTrue(content.contains("testWORD.doc"));
    assertTrue(content.contains("This is a sample Microsoft Word Document"));
    assertTrue(content.contains("testXML.xml"));
    assertTrue(content.contains("Rida Benjelloun"));
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ByteArrayInputStream(java.io.ByteArrayInputStream) JarInputStream(java.util.jar.JarInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) Metadata(org.apache.tika.metadata.Metadata) Tika(org.apache.tika.Tika) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) FileInputStream(java.io.FileInputStream) Parser(org.apache.tika.parser.Parser) CompositeParser(org.apache.tika.parser.CompositeParser) DefaultParser(org.apache.tika.parser.DefaultParser) ForkParser(org.apache.tika.fork.ForkParser) TesseractOCRParser(org.apache.tika.parser.ocr.TesseractOCRParser) Test(org.junit.Test)

Example 88 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class TesseractOCRParser method extractHOCROutput.

private void extractHOCROutput(InputStream is, ParseContext parseContext, XHTMLContentHandler xhtml) throws TikaException, IOException, SAXException {
    if (parseContext == null) {
        parseContext = new ParseContext();
    }
    SAXParser parser = parseContext.getSAXParser();
    xhtml.startElement("div", "class", "ocr");
    parser.parse(is, new OfflineContentHandler(new HOCRPassThroughHandler(xhtml)));
    xhtml.endElement("div");
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) ParseContext(org.apache.tika.parser.ParseContext) SAXParser(javax.xml.parsers.SAXParser)

Example 89 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class TikaConfigSerializer method serialize.

/**
     *
     * @param config config to serialize
     * @param mode serialization mode
     * @param writer writer
     * @param charset charset
     * @throws Exception
     */
public static void serialize(TikaConfig config, Mode mode, Writer writer, Charset charset) throws Exception {
    DocumentBuilder docBuilder = new ParseContext().getDocumentBuilder();
    // root elements
    Document doc = docBuilder.newDocument();
    Element rootElement = doc.createElement("properties");
    doc.appendChild(rootElement);
    addMimeComment(mode, rootElement, doc);
    addServiceLoader(mode, rootElement, doc, config);
    addExecutorService(mode, rootElement, doc, config);
    addEncodingDetectors(mode, rootElement, doc, config);
    addTranslator(mode, rootElement, doc, config);
    addDetectors(mode, rootElement, doc, config);
    addParsers(mode, rootElement, doc, config);
    // TODO Service Loader section
    // now write
    TransformerFactory transformerFactory = TransformerFactory.newInstance();
    Transformer transformer = transformerFactory.newTransformer();
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
    transformer.setOutputProperty(OutputKeys.ENCODING, charset.name());
    DOMSource source = new DOMSource(doc);
    StreamResult result = new StreamResult(writer);
    transformer.transform(source, result);
}
Also used : DOMSource(javax.xml.transform.dom.DOMSource) TransformerFactory(javax.xml.transform.TransformerFactory) Transformer(javax.xml.transform.Transformer) StreamResult(javax.xml.transform.stream.StreamResult) DocumentBuilder(javax.xml.parsers.DocumentBuilder) Element(org.w3c.dom.Element) ParseContext(org.apache.tika.parser.ParseContext) Document(org.w3c.dom.Document)

Example 90 with ParseContext

use of org.apache.tika.parser.ParseContext in project tika by apache.

the class ImageMetadataExtractor method parseRawXMP.

public void parseRawXMP(byte[] xmpData) throws IOException, SAXException, TikaException {
    XMPMetadata xmp = null;
    try (InputStream decoded = new ByteArrayInputStream(xmpData)) {
        Document dom = new ParseContext().getDocumentBuilder().parse(decoded);
        if (dom != null) {
            xmp = new XMPMetadata(dom);
        }
    } catch (IOException | SAXException e) {
    //
    }
    if (xmp != null) {
        JempboxExtractor.extractDublinCore(xmp, metadata);
        JempboxExtractor.extractXMPMM(xmp, metadata);
    }
}
Also used : XMPMetadata(org.apache.jempbox.xmp.XMPMetadata) ByteArrayInputStream(java.io.ByteArrayInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) IOException(java.io.IOException) Document(org.w3c.dom.Document) SAXException(org.xml.sax.SAXException)

Aggregations

ParseContext (org.apache.tika.parser.ParseContext)338 Metadata (org.apache.tika.metadata.Metadata)283 Test (org.junit.Test)260 InputStream (java.io.InputStream)195 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)195 TikaTest (org.apache.tika.TikaTest)186 ContentHandler (org.xml.sax.ContentHandler)164 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)118 Parser (org.apache.tika.parser.Parser)109 ByteArrayInputStream (java.io.ByteArrayInputStream)92 TikaInputStream (org.apache.tika.io.TikaInputStream)77 DefaultHandler (org.xml.sax.helpers.DefaultHandler)52 ExcelParserTest (org.apache.tika.parser.microsoft.ExcelParserTest)31 WordParserTest (org.apache.tika.parser.microsoft.WordParserTest)31 TikaException (org.apache.tika.exception.TikaException)30 StringWriter (java.io.StringWriter)26 IOException (java.io.IOException)25 SAXException (org.xml.sax.SAXException)25 CompositeParser (org.apache.tika.parser.CompositeParser)22 TeeContentHandler (org.apache.tika.sax.TeeContentHandler)20