Search in sources :

Example 11 with TemporaryResources

use of org.apache.tika.io.TemporaryResources in project tika by apache.

the class RarParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    EmbeddedDocumentExtractor extractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
    Archive rar = null;
    try (TemporaryResources tmp = new TemporaryResources()) {
        TikaInputStream tis = TikaInputStream.get(stream, tmp);
        rar = new Archive(tis.getFile());
        if (rar.isEncrypted()) {
            throw new EncryptedDocumentException();
        }
        //Without this BodyContentHandler does not work
        xhtml.element("div", " ");
        FileHeader header = rar.nextFileHeader();
        while (header != null && !Thread.currentThread().isInterrupted()) {
            if (!header.isDirectory()) {
                try (InputStream subFile = rar.getInputStream(header)) {
                    Metadata entrydata = PackageParser.handleEntryMetadata("".equals(header.getFileNameW()) ? header.getFileNameString() : header.getFileNameW(), header.getCTime(), header.getMTime(), header.getFullUnpackSize(), xhtml);
                    if (extractor.shouldParseEmbedded(entrydata)) {
                        extractor.parseEmbedded(subFile, handler, entrydata, true);
                    }
                }
            }
            header = rar.nextFileHeader();
        }
    } catch (RarException e) {
        throw new TikaException("RarParser Exception", e);
    } finally {
        if (rar != null)
            rar.close();
    }
    xhtml.endDocument();
}
Also used : Archive(com.github.junrar.Archive) EncryptedDocumentException(org.apache.tika.exception.EncryptedDocumentException) TikaException(org.apache.tika.exception.TikaException) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) TemporaryResources(org.apache.tika.io.TemporaryResources) Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) RarException(com.github.junrar.exception.RarException) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) FileHeader(com.github.junrar.rarfile.FileHeader)

Example 12 with TemporaryResources

use of org.apache.tika.io.TemporaryResources in project tika by apache.

the class ParsingEmbeddedDocumentExtractor method parseEmbedded.

public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml) throws SAXException, IOException {
    if (outputHtml) {
        AttributesImpl attributes = new AttributesImpl();
        attributes.addAttribute("", "class", "class", "CDATA", "package-entry");
        handler.startElement(XHTML, "div", "div", attributes);
    }
    String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
    if (name != null && name.length() > 0 && outputHtml) {
        handler.startElement(XHTML, "h1", "h1", new AttributesImpl());
        char[] chars = name.toCharArray();
        handler.characters(chars, 0, chars.length);
        handler.endElement(XHTML, "h1", "h1");
    }
    // Use the delegate parser to parse this entry
    try (TemporaryResources tmp = new TemporaryResources()) {
        final TikaInputStream newStream = TikaInputStream.get(new CloseShieldInputStream(stream), tmp);
        if (stream instanceof TikaInputStream) {
            final Object container = ((TikaInputStream) stream).getOpenContainer();
            if (container != null) {
                newStream.setOpenContainer(container);
            }
        }
        DELEGATING_PARSER.parse(newStream, new EmbeddedContentHandler(new BodyContentHandler(handler)), metadata, context);
    } catch (EncryptedDocumentException ede) {
    // TODO: can we log a warning that we lack the password?
    // For now, just skip the content
    } catch (TikaException e) {
    // TODO: can we log a warning somehow?
    // Could not parse the entry, just skip the content
    }
    if (outputHtml) {
        handler.endElement(XHTML, "div", "div");
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) AttributesImpl(org.xml.sax.helpers.AttributesImpl) EncryptedDocumentException(org.apache.tika.exception.EncryptedDocumentException) TikaException(org.apache.tika.exception.TikaException) TemporaryResources(org.apache.tika.io.TemporaryResources) TikaInputStream(org.apache.tika.io.TikaInputStream) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) CloseShieldInputStream(org.apache.tika.io.CloseShieldInputStream)

Example 13 with TemporaryResources

use of org.apache.tika.io.TemporaryResources in project tika by apache.

the class WebPParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    TemporaryResources tmp = new TemporaryResources();
    try {
        TikaInputStream tis = TikaInputStream.get(stream, tmp);
        new ImageMetadataExtractor(metadata).parseWebP(tis.getFile());
    } finally {
        tmp.dispose();
    }
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    xhtml.endDocument();
}
Also used : TemporaryResources(org.apache.tika.io.TemporaryResources) TikaInputStream(org.apache.tika.io.TikaInputStream) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler)

Example 14 with TemporaryResources

use of org.apache.tika.io.TemporaryResources in project tika by apache.

the class MatParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    //Set MIME type as Matlab
    metadata.set(Metadata.CONTENT_TYPE, MATLAB_MIME_TYPE);
    TemporaryResources tmp = TikaInputStream.isTikaInputStream(stream) ? null : new TemporaryResources();
    try {
        // Use TIS so we can spool a temp file for parsing.
        TikaInputStream tis = TikaInputStream.get(stream, tmp);
        //Extract information from header file
        //input .mat file
        MatFileReader mfr = new MatFileReader(tis.getFile());
        //.mat header information
        MatFileHeader hdr = mfr.getMatFileHeader();
        // Example header: "MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Sun Mar  2 23:41:57 2014"
        // Break header information into its parts
        String[] parts = hdr.getDescription().split(",");
        if (parts[2].contains("Created")) {
            int lastIndex1 = parts[2].lastIndexOf("Created on:");
            String dateCreated = parts[2].substring(lastIndex1 + "Created on:".length()).trim();
            metadata.set("createdOn", dateCreated);
        }
        if (parts[1].contains("Platform")) {
            int lastIndex2 = parts[1].lastIndexOf("Platform:");
            String platform = parts[1].substring(lastIndex2 + "Platform:".length()).trim();
            metadata.set("platform", platform);
        }
        if (parts[0].contains("MATLAB")) {
            metadata.set("fileType", parts[0]);
        }
        // Get endian indicator from header file
        // Retrieve endian bytes and convert to string
        String endianBytes = new String(hdr.getEndianIndicator(), UTF_8);
        // Convert bytes to characters to string
        String endianCode = String.valueOf(endianBytes.toCharArray());
        metadata.set("endian", endianCode);
        //Text output	
        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
        xhtml.startDocument();
        xhtml.newline();
        //Loop through each variable
        for (Map.Entry<String, MLArray> entry : mfr.getContent().entrySet()) {
            String varName = entry.getKey();
            MLArray varData = entry.getValue();
            xhtml.element("p", varName + ":" + String.valueOf(varData));
            // If the variable is a structure, extract variable info from structure
            if (varData.isStruct()) {
                MLStructure mlStructure = (MLStructure) mfr.getMLArray(varName);
                xhtml.startElement("ul");
                xhtml.newline();
                for (MLArray element : mlStructure.getAllFields()) {
                    xhtml.startElement("li");
                    xhtml.characters(String.valueOf(element));
                    // If there is an embedded structure, extract variable info.
                    if (element.isStruct()) {
                        xhtml.startElement("ul");
                        // Should this actually be a recursive call?
                        xhtml.element("li", element.contentToString());
                        xhtml.endElement("ul");
                    }
                    xhtml.endElement("li");
                }
                xhtml.endElement("ul");
            }
        }
        xhtml.endDocument();
    } catch (IOException e) {
        throw new TikaException("Error parsing Matlab file with MatParser", e);
    } finally {
        if (tmp != null) {
            tmp.dispose();
        }
    }
}
Also used : MatFileReader(com.jmatio.io.MatFileReader) MLArray(com.jmatio.types.MLArray) TikaException(org.apache.tika.exception.TikaException) MatFileHeader(com.jmatio.io.MatFileHeader) TemporaryResources(org.apache.tika.io.TemporaryResources) TikaInputStream(org.apache.tika.io.TikaInputStream) IOException(java.io.IOException) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) MLStructure(com.jmatio.types.MLStructure) Map(java.util.Map)

Example 15 with TemporaryResources

use of org.apache.tika.io.TemporaryResources in project tika by apache.

the class MP4Parser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    // The MP4Parser library accepts either a File, or a byte array
    // As MP4 video files are typically large, always use a file to
    //  avoid OOMs that may occur with in-memory buffering
    TemporaryResources tmp = new TemporaryResources();
    TikaInputStream tstream = TikaInputStream.get(stream, tmp);
    try (DataSource dataSource = new DirectFileReadDataSource(tstream.getFile())) {
        try (IsoFile isoFile = new IsoFile(dataSource)) {
            tmp.addResource(isoFile);
            // Grab the file type box
            FileTypeBox fileType = getOrNull(isoFile, FileTypeBox.class);
            if (fileType != null) {
                // Identify the type
                MediaType type = MediaType.application("mp4");
                for (Map.Entry<MediaType, List<String>> e : typesMap.entrySet()) {
                    if (e.getValue().contains(fileType.getMajorBrand())) {
                        type = e.getKey();
                        break;
                    }
                }
                metadata.set(Metadata.CONTENT_TYPE, type.toString());
                if (type.getType().equals("audio")) {
                    metadata.set(XMPDM.AUDIO_COMPRESSOR, fileType.getMajorBrand().trim());
                }
            } else {
                // Some older QuickTime files lack the FileType
                metadata.set(Metadata.CONTENT_TYPE, "video/quicktime");
            }
            // Get the main MOOV box
            MovieBox moov = getOrNull(isoFile, MovieBox.class);
            if (moov == null) {
                // Bail out
                return;
            }
            XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
            xhtml.startDocument();
            // Pull out some information from the header box
            MovieHeaderBox mHeader = getOrNull(moov, MovieHeaderBox.class);
            if (mHeader != null) {
                // Get the creation and modification dates
                metadata.set(Metadata.CREATION_DATE, mHeader.getCreationTime());
                metadata.set(TikaCoreProperties.MODIFIED, mHeader.getModificationTime());
                // Get the duration
                double durationSeconds = ((double) mHeader.getDuration()) / mHeader.getTimescale();
                metadata.set(XMPDM.DURATION, DURATION_FORMAT.format(durationSeconds));
                // The timescale is normally the sampling rate
                metadata.set(XMPDM.AUDIO_SAMPLE_RATE, (int) mHeader.getTimescale());
            }
            // Get some more information from the track header
            // TODO Decide how to handle multiple tracks
            List<TrackBox> tb = moov.getBoxes(TrackBox.class);
            if (tb.size() > 0) {
                TrackBox track = tb.get(0);
                TrackHeaderBox header = track.getTrackHeaderBox();
                // Get the creation and modification dates
                metadata.set(TikaCoreProperties.CREATED, header.getCreationTime());
                metadata.set(TikaCoreProperties.MODIFIED, header.getModificationTime());
                // Get the video with and height
                metadata.set(Metadata.IMAGE_WIDTH, (int) header.getWidth());
                metadata.set(Metadata.IMAGE_LENGTH, (int) header.getHeight());
                // Get the sample information
                SampleTableBox samples = track.getSampleTableBox();
                SampleDescriptionBox sampleDesc = samples.getSampleDescriptionBox();
                if (sampleDesc != null) {
                    // Look for the first Audio Sample, if present
                    AudioSampleEntry sample = getOrNull(sampleDesc, AudioSampleEntry.class);
                    if (sample != null) {
                        XMPDM.ChannelTypePropertyConverter.convertAndSet(metadata, sample.getChannelCount());
                        //metadata.set(XMPDM.AUDIO_SAMPLE_TYPE, sample.getSampleSize());    // TODO Num -> Type mapping
                        metadata.set(XMPDM.AUDIO_SAMPLE_RATE, (int) sample.getSampleRate());
                    //metadata.set(XMPDM.AUDIO_, sample.getSamplesPerPacket());
                    //metadata.set(XMPDM.AUDIO_, sample.getBytesPerSample());
                    }
                }
            }
            // Get metadata from the User Data Box
            UserDataBox userData = getOrNull(moov, UserDataBox.class);
            if (userData != null) {
                MetaBox meta = getOrNull(userData, MetaBox.class);
                // Check for iTunes Metadata
                // See http://atomicparsley.sourceforge.net/mpeg-4files.html and
                //  http://code.google.com/p/mp4v2/wiki/iTunesMetadata for more on these
                AppleItemListBox apple = getOrNull(meta, AppleItemListBox.class);
                if (apple != null) {
                    // Title
                    AppleNameBox title = getOrNull(apple, AppleNameBox.class);
                    addMetadata(TikaCoreProperties.TITLE, metadata, title);
                    // Artist
                    AppleArtistBox artist = getOrNull(apple, AppleArtistBox.class);
                    addMetadata(TikaCoreProperties.CREATOR, metadata, artist);
                    addMetadata(XMPDM.ARTIST, metadata, artist);
                    // Album Artist
                    AppleArtist2Box artist2 = getOrNull(apple, AppleArtist2Box.class);
                    addMetadata(XMPDM.ALBUM_ARTIST, metadata, artist2);
                    // Album
                    AppleAlbumBox album = getOrNull(apple, AppleAlbumBox.class);
                    addMetadata(XMPDM.ALBUM, metadata, album);
                    // Composer
                    AppleTrackAuthorBox composer = getOrNull(apple, AppleTrackAuthorBox.class);
                    addMetadata(XMPDM.COMPOSER, metadata, composer);
                    // Genre
                    AppleGenreBox genre = getOrNull(apple, AppleGenreBox.class);
                    addMetadata(XMPDM.GENRE, metadata, genre);
                    // Year
                    AppleRecordingYear2Box year = getOrNull(apple, AppleRecordingYear2Box.class);
                    if (year != null) {
                        metadata.set(XMPDM.RELEASE_DATE, year.getValue());
                    }
                    // Track number
                    AppleTrackNumberBox trackNum = getOrNull(apple, AppleTrackNumberBox.class);
                    if (trackNum != null) {
                        metadata.set(XMPDM.TRACK_NUMBER, trackNum.getA());
                    //metadata.set(XMPDM.NUMBER_OF_TRACKS, trackNum.getB()); // TODO
                    }
                    // Disc number
                    AppleDiskNumberBox discNum = getOrNull(apple, AppleDiskNumberBox.class);
                    if (discNum != null) {
                        metadata.set(XMPDM.DISC_NUMBER, discNum.getA());
                    }
                    // Compilation
                    AppleCompilationBox compilation = getOrNull(apple, AppleCompilationBox.class);
                    if (compilation != null) {
                        metadata.set(XMPDM.COMPILATION, (int) compilation.getValue());
                    }
                    // Comment
                    AppleCommentBox comment = getOrNull(apple, AppleCommentBox.class);
                    addMetadata(XMPDM.LOG_COMMENT, metadata, comment);
                    // Encoder
                    AppleEncoderBox encoder = getOrNull(apple, AppleEncoderBox.class);
                    if (encoder != null) {
                        metadata.set(XMP.CREATOR_TOOL, encoder.getValue());
                    }
                    // As text
                    for (Box box : apple.getBoxes()) {
                        if (box instanceof Utf8AppleDataBox) {
                            xhtml.element("p", ((Utf8AppleDataBox) box).getValue());
                        }
                    }
                }
            // TODO Check for other kinds too
            }
            // All done
            xhtml.endDocument();
        }
    } finally {
        tmp.dispose();
    }
}
Also used : AudioSampleEntry(com.coremedia.iso.boxes.sampleentry.AudioSampleEntry) AppleAlbumBox(com.googlecode.mp4parser.boxes.apple.AppleAlbumBox) TikaInputStream(org.apache.tika.io.TikaInputStream) FileTypeBox(com.coremedia.iso.boxes.FileTypeBox) AppleTrackNumberBox(com.googlecode.mp4parser.boxes.apple.AppleTrackNumberBox) MetaBox(com.coremedia.iso.boxes.MetaBox) AppleCompilationBox(com.googlecode.mp4parser.boxes.apple.AppleCompilationBox) AppleArtist2Box(com.googlecode.mp4parser.boxes.apple.AppleArtist2Box) AppleRecordingYear2Box(com.googlecode.mp4parser.boxes.apple.AppleRecordingYear2Box) AppleGenreBox(com.googlecode.mp4parser.boxes.apple.AppleGenreBox) Utf8AppleDataBox(com.googlecode.mp4parser.boxes.apple.Utf8AppleDataBox) MediaType(org.apache.tika.mime.MediaType) List(java.util.List) SampleDescriptionBox(com.coremedia.iso.boxes.SampleDescriptionBox) TrackHeaderBox(com.coremedia.iso.boxes.TrackHeaderBox) IsoFile(com.coremedia.iso.IsoFile) AppleCommentBox(com.googlecode.mp4parser.boxes.apple.AppleCommentBox) UserDataBox(com.coremedia.iso.boxes.UserDataBox) MovieHeaderBox(com.coremedia.iso.boxes.MovieHeaderBox) TemporaryResources(org.apache.tika.io.TemporaryResources) AppleEncoderBox(com.googlecode.mp4parser.boxes.apple.AppleEncoderBox) AppleArtistBox(com.googlecode.mp4parser.boxes.apple.AppleArtistBox) AppleCompilationBox(com.googlecode.mp4parser.boxes.apple.AppleCompilationBox) UserDataBox(com.coremedia.iso.boxes.UserDataBox) MovieHeaderBox(com.coremedia.iso.boxes.MovieHeaderBox) AppleArtist2Box(com.googlecode.mp4parser.boxes.apple.AppleArtist2Box) AppleArtistBox(com.googlecode.mp4parser.boxes.apple.AppleArtistBox) AppleEncoderBox(com.googlecode.mp4parser.boxes.apple.AppleEncoderBox) AppleTrackNumberBox(com.googlecode.mp4parser.boxes.apple.AppleTrackNumberBox) AppleNameBox(com.googlecode.mp4parser.boxes.apple.AppleNameBox) SampleTableBox(com.coremedia.iso.boxes.SampleTableBox) TrackBox(com.coremedia.iso.boxes.TrackBox) AppleDiskNumberBox(com.googlecode.mp4parser.boxes.apple.AppleDiskNumberBox) AppleRecordingYear2Box(com.googlecode.mp4parser.boxes.apple.AppleRecordingYear2Box) AppleGenreBox(com.googlecode.mp4parser.boxes.apple.AppleGenreBox) MetaBox(com.coremedia.iso.boxes.MetaBox) MovieBox(com.coremedia.iso.boxes.MovieBox) Utf8AppleDataBox(com.googlecode.mp4parser.boxes.apple.Utf8AppleDataBox) AppleCommentBox(com.googlecode.mp4parser.boxes.apple.AppleCommentBox) Box(com.coremedia.iso.boxes.Box) SampleDescriptionBox(com.coremedia.iso.boxes.SampleDescriptionBox) FileTypeBox(com.coremedia.iso.boxes.FileTypeBox) AppleAlbumBox(com.googlecode.mp4parser.boxes.apple.AppleAlbumBox) TrackHeaderBox(com.coremedia.iso.boxes.TrackHeaderBox) AppleItemListBox(com.coremedia.iso.boxes.apple.AppleItemListBox) AppleTrackAuthorBox(com.googlecode.mp4parser.boxes.apple.AppleTrackAuthorBox) AppleDiskNumberBox(com.googlecode.mp4parser.boxes.apple.AppleDiskNumberBox) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) DataSource(com.googlecode.mp4parser.DataSource) SampleTableBox(com.coremedia.iso.boxes.SampleTableBox) TrackBox(com.coremedia.iso.boxes.TrackBox) AppleTrackAuthorBox(com.googlecode.mp4parser.boxes.apple.AppleTrackAuthorBox) MovieBox(com.coremedia.iso.boxes.MovieBox) AppleItemListBox(com.coremedia.iso.boxes.apple.AppleItemListBox) Map(java.util.Map) HashMap(java.util.HashMap) AppleNameBox(com.googlecode.mp4parser.boxes.apple.AppleNameBox)

Aggregations

TemporaryResources (org.apache.tika.io.TemporaryResources)31 TikaInputStream (org.apache.tika.io.TikaInputStream)30 TikaException (org.apache.tika.exception.TikaException)15 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)14 File (java.io.File)11 IOException (java.io.IOException)8 InputStream (java.io.InputStream)6 SAXException (org.xml.sax.SAXException)6 FileInputStream (java.io.FileInputStream)4 EncryptedDocumentException (org.apache.tika.exception.EncryptedDocumentException)4 Metadata (org.apache.tika.metadata.Metadata)4 MediaType (org.apache.tika.mime.MediaType)4 ZipArchiveEntry (org.apache.commons.compress.archivers.zip.ZipArchiveEntry)2 EmbeddedDocumentExtractor (org.apache.tika.extractor.EmbeddedDocumentExtractor)2 JempboxExtractor (org.apache.tika.parser.image.xmp.JempboxExtractor)2 IsoFile (com.coremedia.iso.IsoFile)1 Box (com.coremedia.iso.boxes.Box)1 FileTypeBox (com.coremedia.iso.boxes.FileTypeBox)1 MetaBox (com.coremedia.iso.boxes.MetaBox)1 MovieBox (com.coremedia.iso.boxes.MovieBox)1