Search in sources :

Example 51 with TransformerHandler

use of javax.xml.transform.sax.TransformerHandler in project tika by apache.

the class OutlookParserTest method testOutlookHTMLfromRTF.

@Test
public void testOutlookHTMLfromRTF() throws Exception {
    Parser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    // Check the HTML version
    StringWriter sw = new StringWriter();
    SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
    TransformerHandler handler = factory.newTransformerHandler();
    handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
    handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
    handler.setResult(new StreamResult(sw));
    try (InputStream stream = OutlookParserTest.class.getResourceAsStream("/test-documents/test-outlook2003.msg")) {
        parser.parse(stream, handler, metadata, new ParseContext());
    }
    // As the HTML version should have been processed, ensure
    //  we got some of the links
    String content = sw.toString().replaceAll("<p>\\s+", "<p>");
    assertContains("<dd>New Outlook User</dd>", content);
    assertContains("designed <i>to help you", content);
    assertContains("<p><a href=\"http://r.office.microsoft.com/r/rlidOutlookWelcomeMail10?clid=1033\">Cached Exchange Mode</a>", content);
    // Link - check text around it, and the link itself
    assertContains("sign up for a free subscription", content);
    assertContains("Office Newsletter", content);
    assertContains("newsletter will be sent to you", content);
    assertContains("http://r.office.microsoft.com/r/rlidNewsletterSignUp?clid=1033", content);
    // Make sure we don't have nested html docs
    assertEquals(2, content.split("<body>").length);
    assertEquals(2, content.split("<\\/body>").length);
}
Also used : TransformerHandler(javax.xml.transform.sax.TransformerHandler) StringWriter(java.io.StringWriter) StreamResult(javax.xml.transform.stream.StreamResult) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) SAXTransformerFactory(javax.xml.transform.sax.SAXTransformerFactory) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) Parser(org.apache.tika.parser.Parser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 52 with TransformerHandler

use of javax.xml.transform.sax.TransformerHandler in project sling by apache.

the class SlingTransformer method setXMLConsumer.

@Override
protected void setXMLConsumer(XMLConsumer consumer) {
    TransformerHandler transformerHandler;
    try {
        transformerHandler = this.createTransformerHandler();
    } catch (Exception ex) {
        throw new RuntimeException("Could not initialize transformer handler.", ex);
    }
    final Map<String, Object> map = this.getLogicSheetParameters();
    if (map != null) {
        final Transformer transformer = transformerHandler.getTransformer();
        for (Entry<String, Object> entry : map.entrySet()) {
            transformer.setParameter(entry.getKey(), entry.getValue());
        }
    }
    final SAXResult result = new SAXResult();
    result.setHandler(consumer);
    // According to TrAX specs, all TransformerHandlers are LexicalHandlers
    result.setLexicalHandler(consumer);
    transformerHandler.setResult(result);
    super.setXMLConsumer(new XMLConsumerAdapter(transformerHandler, transformerHandler));
}
Also used : XMLConsumerAdapter(org.apache.cocoon.pipeline.component.sax.XMLConsumerAdapter) TransformerHandler(javax.xml.transform.sax.TransformerHandler) Transformer(javax.xml.transform.Transformer) AbstractTransformer(org.apache.cocoon.pipeline.component.sax.AbstractTransformer) SAXResult(javax.xml.transform.sax.SAXResult)

Example 53 with TransformerHandler

use of javax.xml.transform.sax.TransformerHandler in project tika by apache.

the class TikaGUI method getHtmlHandler.

/**
     * Creates and returns a content handler that turns XHTML input to
     * simplified HTML output that can be correctly parsed and displayed
     * by {@link JEditorPane}.
     * <p>
     * The returned content handler is set to output <code>html</code>
     * to the given writer. The XHTML namespace is removed from the output
     * to prevent the serializer from using the &lt;tag/&gt; empty element
     * syntax that causes extra "&gt;" characters to be displayed.
     * The &lt;head&gt; tags are dropped to prevent the serializer from
     * generating a &lt;META&gt; content type tag that makes
     * {@link JEditorPane} fail thinking that the document character set
     * is inconsistent.
     * <p>
     * Additionally, it will use ImageSavingParser to re-write embedded:(image) 
     * image links to be file:///(temporary file) so that they can be loaded.
     *
     * @param writer output writer
     * @return HTML content handler
     * @throws TransformerConfigurationException if an error occurs
     */
private ContentHandler getHtmlHandler(Writer writer) throws TransformerConfigurationException {
    SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
    TransformerHandler handler = factory.newTransformerHandler();
    handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
    handler.setResult(new StreamResult(writer));
    return new ContentHandlerDecorator(handler) {

        @Override
        public void startElement(String uri, String localName, String name, Attributes atts) throws SAXException {
            if (XHTMLContentHandler.XHTML.equals(uri)) {
                uri = null;
            }
            if (!"head".equals(localName)) {
                if ("img".equals(localName)) {
                    AttributesImpl newAttrs;
                    if (atts instanceof AttributesImpl) {
                        newAttrs = (AttributesImpl) atts;
                    } else {
                        newAttrs = new AttributesImpl(atts);
                    }
                    for (int i = 0; i < newAttrs.getLength(); i++) {
                        if ("src".equals(newAttrs.getLocalName(i))) {
                            String src = newAttrs.getValue(i);
                            if (src.startsWith("embedded:")) {
                                String filename = src.substring(src.indexOf(':') + 1);
                                try {
                                    File img = imageParser.requestSave(filename);
                                    String newSrc = img.toURI().toString();
                                    newAttrs.setValue(i, newSrc);
                                } catch (IOException e) {
                                    System.err.println("Error creating temp image file " + filename);
                                // The html viewer will show a broken image too to alert them
                                }
                            }
                        }
                    }
                    super.startElement(uri, localName, name, newAttrs);
                } else {
                    super.startElement(uri, localName, name, atts);
                }
            }
        }

        @Override
        public void endElement(String uri, String localName, String name) throws SAXException {
            if (XHTMLContentHandler.XHTML.equals(uri)) {
                uri = null;
            }
            if (!"head".equals(localName)) {
                super.endElement(uri, localName, name);
            }
        }

        @Override
        public void startPrefixMapping(String prefix, String uri) {
        }

        @Override
        public void endPrefixMapping(String prefix) {
        }
    };
}
Also used : TransformerHandler(javax.xml.transform.sax.TransformerHandler) AttributesImpl(org.xml.sax.helpers.AttributesImpl) StreamResult(javax.xml.transform.stream.StreamResult) SAXTransformerFactory(javax.xml.transform.sax.SAXTransformerFactory) Attributes(org.xml.sax.Attributes) IOException(java.io.IOException) ContentHandlerDecorator(org.apache.tika.sax.ContentHandlerDecorator) File(java.io.File)

Example 54 with TransformerHandler

use of javax.xml.transform.sax.TransformerHandler in project tika by apache.

the class TikaGUI method getXmlContentHandler.

private ContentHandler getXmlContentHandler(Writer writer) throws TransformerConfigurationException {
    SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
    TransformerHandler handler = factory.newTransformerHandler();
    handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
    handler.setResult(new StreamResult(writer));
    return handler;
}
Also used : TransformerHandler(javax.xml.transform.sax.TransformerHandler) StreamResult(javax.xml.transform.stream.StreamResult) SAXTransformerFactory(javax.xml.transform.sax.SAXTransformerFactory)

Example 55 with TransformerHandler

use of javax.xml.transform.sax.TransformerHandler in project tika by apache.

the class HtmlParserTest method makeHtmlTransformer.

/**
     * Create ContentHandler that transforms SAX events into textual HTML output,
     * and writes it out to <writer> - typically this is a StringWriter.
     *
     * @param writer Where to write resulting HTML text.
     * @return ContentHandler suitable for passing to parse() methods.
     * @throws Exception
     */
private ContentHandler makeHtmlTransformer(Writer writer) throws Exception {
    SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
    TransformerHandler handler = factory.newTransformerHandler();
    handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
    handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
    handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "utf-8");
    handler.setResult(new StreamResult(writer));
    return handler;
}
Also used : TransformerHandler(javax.xml.transform.sax.TransformerHandler) StreamResult(javax.xml.transform.stream.StreamResult) SAXTransformerFactory(javax.xml.transform.sax.SAXTransformerFactory)

Aggregations

TransformerHandler (javax.xml.transform.sax.TransformerHandler)84 StreamResult (javax.xml.transform.stream.StreamResult)57 SAXTransformerFactory (javax.xml.transform.sax.SAXTransformerFactory)51 TransformerConfigurationException (javax.xml.transform.TransformerConfigurationException)33 Transformer (javax.xml.transform.Transformer)29 IOException (java.io.IOException)23 SAXException (org.xml.sax.SAXException)22 AttributesImpl (org.xml.sax.helpers.AttributesImpl)17 StringWriter (java.io.StringWriter)13 SAXResult (javax.xml.transform.sax.SAXResult)13 File (java.io.File)11 XMLReader (org.xml.sax.XMLReader)11 FileOutputStream (java.io.FileOutputStream)10 Test (org.junit.Test)10 InputStream (java.io.InputStream)9 ByteArrayOutputStream (java.io.ByteArrayOutputStream)8 OutputStream (java.io.OutputStream)8 ContentHandler (org.xml.sax.ContentHandler)8 InputSource (org.xml.sax.InputSource)8 Metadata (org.apache.tika.metadata.Metadata)7