Search in sources :

Example 1 with RichTextContentHandler

use of org.apache.tika.sax.RichTextContentHandler in project tika by apache.

the class UnpackerResource method process.

private Map<String, byte[]> process(InputStream is, @Context HttpHeaders httpHeaders, @Context UriInfo info, boolean saveAll) throws Exception {
    Metadata metadata = new Metadata();
    ParseContext pc = new ParseContext();
    Parser parser = TikaResource.createParser();
    if (parser instanceof DigestingParser) {
        //no need to digest for unwrapping
        parser = ((DigestingParser) parser).getWrappedParser();
    }
    TikaResource.fillMetadata(parser, metadata, pc, httpHeaders.getRequestHeaders());
    TikaResource.logRequest(LOG, info, metadata);
    ContentHandler ch;
    ByteArrayOutputStream text = new ByteArrayOutputStream();
    if (saveAll) {
        ch = new BodyContentHandler(new RichTextContentHandler(new OutputStreamWriter(text, UTF_8)));
    } else {
        ch = new DefaultHandler();
    }
    Map<String, byte[]> files = new HashMap<>();
    MutableInt count = new MutableInt();
    pc.set(EmbeddedDocumentExtractor.class, new MyEmbeddedDocumentExtractor(count, files));
    TikaResource.parse(parser, LOG, info.getPath(), is, ch, metadata, pc);
    if (count.intValue() == 0 && !saveAll) {
        throw new WebApplicationException(Response.Status.NO_CONTENT);
    }
    if (saveAll) {
        files.put(TEXT_FILENAME, text.toByteArray());
        ByteArrayOutputStream metaStream = new ByteArrayOutputStream();
        metadataToCsv(metadata, metaStream);
        files.put(META_FILENAME, metaStream.toByteArray());
    }
    return files;
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) WebApplicationException(javax.ws.rs.WebApplicationException) HashMap(java.util.HashMap) Metadata(org.apache.tika.metadata.Metadata) DigestingParser(org.apache.tika.parser.DigestingParser) ByteArrayOutputStream(java.io.ByteArrayOutputStream) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) RichTextContentHandler(org.apache.tika.sax.RichTextContentHandler) Parser(org.apache.tika.parser.Parser) OfficeParser(org.apache.tika.parser.microsoft.OfficeParser) DigestingParser(org.apache.tika.parser.DigestingParser) DefaultHandler(org.xml.sax.helpers.DefaultHandler) RichTextContentHandler(org.apache.tika.sax.RichTextContentHandler) MutableInt(org.apache.commons.lang.mutable.MutableInt) ParseContext(org.apache.tika.parser.ParseContext) OutputStreamWriter(java.io.OutputStreamWriter)

Example 2 with RichTextContentHandler

use of org.apache.tika.sax.RichTextContentHandler in project tika by apache.

the class TikaResource method produceText.

public StreamingOutput produceText(final InputStream is, MultivaluedMap<String, String> httpHeaders, final UriInfo info) {
    final Parser parser = createParser();
    final Metadata metadata = new Metadata();
    final ParseContext context = new ParseContext();
    fillMetadata(parser, metadata, context, httpHeaders);
    fillParseContext(context, httpHeaders, parser);
    logRequest(LOG, info, metadata);
    return new StreamingOutput() {

        public void write(OutputStream outputStream) throws IOException, WebApplicationException {
            Writer writer = new OutputStreamWriter(outputStream, UTF_8);
            BodyContentHandler body = new BodyContentHandler(new RichTextContentHandler(writer));
            parse(parser, LOG, info.getPath(), is, body, metadata, context);
        }
    };
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) RichTextContentHandler(org.apache.tika.sax.RichTextContentHandler) OutputStream(java.io.OutputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) StreamingOutput(javax.ws.rs.core.StreamingOutput) OutputStreamWriter(java.io.OutputStreamWriter) Writer(java.io.Writer) OutputStreamWriter(java.io.OutputStreamWriter) Parser(org.apache.tika.parser.Parser) HtmlParser(org.apache.tika.parser.html.HtmlParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) DigestingParser(org.apache.tika.parser.DigestingParser)

Aggregations

OutputStreamWriter (java.io.OutputStreamWriter)2 Metadata (org.apache.tika.metadata.Metadata)2 DigestingParser (org.apache.tika.parser.DigestingParser)2 ParseContext (org.apache.tika.parser.ParseContext)2 Parser (org.apache.tika.parser.Parser)2 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)2 RichTextContentHandler (org.apache.tika.sax.RichTextContentHandler)2 ByteArrayOutputStream (java.io.ByteArrayOutputStream)1 OutputStream (java.io.OutputStream)1 Writer (java.io.Writer)1 HashMap (java.util.HashMap)1 WebApplicationException (javax.ws.rs.WebApplicationException)1 StreamingOutput (javax.ws.rs.core.StreamingOutput)1 MutableInt (org.apache.commons.lang.mutable.MutableInt)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 HtmlParser (org.apache.tika.parser.html.HtmlParser)1 OfficeParser (org.apache.tika.parser.microsoft.OfficeParser)1 ContentHandler (org.xml.sax.ContentHandler)1 DefaultHandler (org.xml.sax.helpers.DefaultHandler)1