Search in sources :

Example 66 with TikaInputStream

use of org.apache.tika.io.TikaInputStream in project tika by apache.

the class JournalParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    TikaInputStream tis = TikaInputStream.get(stream, new TemporaryResources());
    File tmpFile = tis.getFile();
    GrobidRESTParser grobidParser = new GrobidRESTParser();
    grobidParser.parse(tmpFile.getAbsolutePath(), handler, metadata, context);
    PDFParser parser = new PDFParser();
    parser.parse(new FileInputStream(tmpFile), handler, metadata, context);
}
Also used : PDFParser(org.apache.tika.parser.pdf.PDFParser) TemporaryResources(org.apache.tika.io.TemporaryResources) TikaInputStream(org.apache.tika.io.TikaInputStream) File(java.io.File) FileInputStream(java.io.FileInputStream)

Example 67 with TikaInputStream

use of org.apache.tika.io.TikaInputStream in project tika by apache.

the class JpegParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    TemporaryResources tmp = new TemporaryResources();
    try {
        TikaInputStream tis = TikaInputStream.get(stream, tmp);
        new ImageMetadataExtractor(metadata).parseJpeg(tis.getFile());
        new JempboxExtractor(metadata).parse(tis);
    } finally {
        tmp.dispose();
    }
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    xhtml.endDocument();
}
Also used : JempboxExtractor(org.apache.tika.parser.image.xmp.JempboxExtractor) ImageMetadataExtractor(org.apache.tika.parser.image.ImageMetadataExtractor) TemporaryResources(org.apache.tika.io.TemporaryResources) TikaInputStream(org.apache.tika.io.TikaInputStream) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler)

Example 68 with TikaInputStream

use of org.apache.tika.io.TikaInputStream in project tika by apache.

the class DetectorResource method detect.

@PUT
@Path("stream")
@Consumes("*/*")
@Produces("text/plain")
public String detect(final InputStream is, @Context HttpHeaders httpHeaders, @Context final UriInfo info) {
    Metadata met = new Metadata();
    TikaInputStream tis = TikaInputStream.get(TikaResource.getInputStream(is, httpHeaders));
    String filename = TikaResource.detectFilename(httpHeaders.getRequestHeaders());
    LOG.info("Detecting media type for Filename: {}", filename);
    met.add(Metadata.RESOURCE_NAME_KEY, filename);
    try {
        return TikaResource.getConfig().getDetector().detect(tis, met).toString();
    } catch (IOException e) {
        LOG.warn("Unable to detect MIME type for file. Reason: {}", e.getMessage(), e);
        return MediaType.OCTET_STREAM.toString();
    }
}
Also used : Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) IOException(java.io.IOException) Path(javax.ws.rs.Path) Consumes(javax.ws.rs.Consumes) Produces(javax.ws.rs.Produces) PUT(javax.ws.rs.PUT)

Example 69 with TikaInputStream

use of org.apache.tika.io.TikaInputStream in project tika by apache.

the class RTFParserTest method testEmbeddedLinkedDocument.

//TIKA-1010 test linked embedded doc
@Test
public void testEmbeddedLinkedDocument() throws Exception {
    Set<MediaType> skipTypes = new HashSet<MediaType>();
    skipTypes.add(MediaType.parse("image/emf"));
    skipTypes.add(MediaType.parse("image/wmf"));
    TrackingHandler tracker = new TrackingHandler(skipTypes);
    try (TikaInputStream tis = TikaInputStream.get(getResourceAsStream("/test-documents/testRTFEmbeddedLink.rtf"))) {
        ContainerExtractor ex = new ParserContainerExtractor();
        assertEquals(true, ex.isSupported(tis));
        ex.extract(tis, ex, tracker);
    }
    //should gracefully skip link and not throw NPE, IOEx, etc
    assertEquals(0, tracker.filenames.size());
    tracker = new TrackingHandler();
    try (TikaInputStream tis = TikaInputStream.get(getResourceAsStream("/test-documents/testRTFEmbeddedLink.rtf"))) {
        ContainerExtractor ex = new ParserContainerExtractor();
        assertEquals(true, ex.isSupported(tis));
        ex.extract(tis, ex, tracker);
    }
    //should gracefully skip link and not throw NPE, IOEx, etc
    assertEquals(2, tracker.filenames.size());
}
Also used : MediaType(org.apache.tika.mime.MediaType) TikaInputStream(org.apache.tika.io.TikaInputStream) ContainerExtractor(org.apache.tika.extractor.ContainerExtractor) ParserContainerExtractor(org.apache.tika.extractor.ParserContainerExtractor) ParserContainerExtractor(org.apache.tika.extractor.ParserContainerExtractor) HashSet(java.util.HashSet) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 70 with TikaInputStream

use of org.apache.tika.io.TikaInputStream in project tika by apache.

the class RTFParserTest method testRegularImages.

//TIKA-1010 test regular (not "embedded") images/picts
@Test
public void testRegularImages() throws Exception {
    Parser base = new AutoDetectParser();
    ParseContext ctx = new ParseContext();
    RecursiveParserWrapper parser = new RecursiveParserWrapper(base, new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1));
    ContentHandler handler = new BodyContentHandler();
    Metadata rootMetadata = new Metadata();
    rootMetadata.add(Metadata.RESOURCE_NAME_KEY, "testRTFRegularImages.rtf");
    try (TikaInputStream tis = TikaInputStream.get(getResourceAsStream("/test-documents/testRTFRegularImages.rtf"))) {
        parser.parse(tis, handler, rootMetadata, ctx);
    }
    List<Metadata> metadatas = parser.getMetadata();
    //("testJPEG_EXIF_普林斯顿.jpg");
    Metadata meta_jpg_exif = metadatas.get(1);
    //("testJPEG_普林斯顿.jpg");
    Metadata meta_jpg = metadatas.get(3);
    assertTrue(meta_jpg_exif != null);
    assertTrue(meta_jpg != null);
    assertTrue(Arrays.asList(meta_jpg_exif.getValues("dc:subject")).contains("serbor"));
    assertTrue(meta_jpg.get("Comments").contains("Licensed to the Apache"));
    //make sure old metadata doesn't linger between objects
    assertFalse(Arrays.asList(meta_jpg.getValues("dc:subject")).contains("serbor"));
    assertEquals("false", meta_jpg.get(RTFMetadata.THUMBNAIL));
    assertEquals("false", meta_jpg_exif.get(RTFMetadata.THUMBNAIL));
    assertEquals(49, meta_jpg.names().length);
    assertEquals(113, meta_jpg_exif.names().length);
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) BasicContentHandlerFactory(org.apache.tika.sax.BasicContentHandlerFactory) ParseContext(org.apache.tika.parser.ParseContext) Metadata(org.apache.tika.metadata.Metadata) RTFMetadata(org.apache.tika.metadata.RTFMetadata) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) TikaInputStream(org.apache.tika.io.TikaInputStream) RecursiveParserWrapper(org.apache.tika.parser.RecursiveParserWrapper) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) Parser(org.apache.tika.parser.Parser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Aggregations

TikaInputStream (org.apache.tika.io.TikaInputStream)100 Metadata (org.apache.tika.metadata.Metadata)40 TemporaryResources (org.apache.tika.io.TemporaryResources)28 IOException (java.io.IOException)27 TikaException (org.apache.tika.exception.TikaException)24 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)23 Test (org.junit.Test)20 InputStream (java.io.InputStream)19 File (java.io.File)15 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)15 ContentHandler (org.xml.sax.ContentHandler)14 TikaTest (org.apache.tika.TikaTest)13 MediaType (org.apache.tika.mime.MediaType)13 SAXException (org.xml.sax.SAXException)13 ParseContext (org.apache.tika.parser.ParseContext)12 ParserContainerExtractor (org.apache.tika.extractor.ParserContainerExtractor)8 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)6 NPOIFSFileSystem (org.apache.poi.poifs.filesystem.NPOIFSFileSystem)6 EncryptedDocumentException (org.apache.tika.exception.EncryptedDocumentException)6 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)6