Search in sources :

Example 1 with TikaMetadataExtractor

use of ddf.catalog.transformer.common.tika.TikaMetadataExtractor in project ddf by codice.

the class TikaInputTransformer method transform.

@Override
public Metacard transform(InputStream input, String id) throws IOException, CatalogTransformerException {
    LOGGER.debug("Transforming input stream using Tika.");
    long bytes;
    if (input == null) {
        throw new CatalogTransformerException("Cannot transform null input.");
    }
    try (TemporaryFileBackedOutputStream fileBackedOutputStream = new TemporaryFileBackedOutputStream()) {
        try {
            bytes = IOUtils.copyLarge(input, fileBackedOutputStream);
        } catch (IOException e) {
            throw new CatalogTransformerException("Could not copy bytes of content message.", e);
        }
        Metadata metadata;
        String bodyText = null;
        String metadataText;
        Metacard metacard = new MetacardImpl(commonTikaMetacardType);
        String contentType = DataType.DATASET.name();
        TikaMetadataExtractor extractor = null;
        try (InputStream inputStreamCopy = fileBackedOutputStream.asByteSource().openStream()) {
            extractor = new TikaMetadataExtractor(inputStreamCopy, previewMaxLength, metadataMaxLength);
        } catch (TikaException | RuntimeException t) {
            LOGGER.debug("Unable to extract tika metadata", t);
        }
        if (extractor != null) {
            metadataText = getMetadataXml(extractor.getMetadataXml());
            Attribute validationAttribute = null;
            if (metadataText.equals(TikaMetadataExtractor.METADATA_LIMIT_REACHED_MSG)) {
                validationAttribute = new AttributeImpl(Validation.VALIDATION_WARNINGS, Collections.singletonList(metadataText));
                metadataText = "";
            }
            bodyText = extractor.getBodyText();
            metadata = extractor.getMetadata();
            contentType = metadata.get(Metadata.CONTENT_TYPE);
            MetacardType metacardType = mergeAttributes(getMetacardType(contentType));
            metacard = MetacardCreator.createMetacard(metadata, id, metadataText, metacardType, useResourceTitleAsTitle);
            if (StringUtils.isNotBlank(bodyText)) {
                metacard.setAttribute(new AttributeImpl(Extracted.EXTRACTED_TEXT, bodyText));
                processContentMetadataExtractors(bodyText, metacard);
            }
            if (StringUtils.isNotBlank(metadataText)) {
                processMetadataExtractors(metadataText, metacard);
            }
            if (validationAttribute != null) {
                metacard.setAttribute(validationAttribute);
            }
        }
        enrichMetacard(fileBackedOutputStream, contentType, bytes, metacard);
        LOGGER.debug("Finished transforming input stream using Tika.");
        return metacard;
    }
}
Also used : TikaException(org.apache.tika.exception.TikaException) TemporaryFileBackedOutputStream(org.codice.ddf.platform.util.TemporaryFileBackedOutputStream) Attribute(ddf.catalog.data.Attribute) CloseShieldInputStream(org.apache.tika.io.CloseShieldInputStream) InputStream(java.io.InputStream) AttributeImpl(ddf.catalog.data.impl.AttributeImpl) Metadata(org.apache.tika.metadata.Metadata) CatalogTransformerException(ddf.catalog.transform.CatalogTransformerException) IOException(java.io.IOException) MetacardImpl(ddf.catalog.data.impl.MetacardImpl) MetacardType(ddf.catalog.data.MetacardType) TikaMetadataExtractor(ddf.catalog.transformer.common.tika.TikaMetadataExtractor) Metacard(ddf.catalog.data.Metacard)

Example 2 with TikaMetadataExtractor

use of ddf.catalog.transformer.common.tika.TikaMetadataExtractor in project ddf by codice.

the class VideoInputTransformer method transform.

@Override
public Metacard transform(InputStream input, String id) throws IOException, CatalogTransformerException {
    Metacard metacard;
    try {
        TikaMetadataExtractor tikaMetadataExtractor = new TikaMetadataExtractor(input);
        Metadata metadata = tikaMetadataExtractor.getMetadata();
        String metadataText = tikaMetadataExtractor.getMetadataXml();
        metacard = MetacardCreator.createMetacard(metadata, id, metadataText, metacardType);
        metacard.setAttribute(new AttributeImpl(Core.DATATYPE, DataType.MOVING_IMAGE.toString()));
    } catch (TikaException e) {
        throw new CatalogTransformerException(e);
    }
    return metacard;
}
Also used : TikaMetadataExtractor(ddf.catalog.transformer.common.tika.TikaMetadataExtractor) Metacard(ddf.catalog.data.Metacard) TikaException(org.apache.tika.exception.TikaException) AttributeImpl(ddf.catalog.data.impl.AttributeImpl) Metadata(org.apache.tika.metadata.Metadata) CatalogTransformerException(ddf.catalog.transform.CatalogTransformerException)

Example 3 with TikaMetadataExtractor

use of ddf.catalog.transformer.common.tika.TikaMetadataExtractor in project ddf by codice.

the class PdfInputTransformer method transformWithExtractors.

private Metacard transformWithExtractors(InputStream input, String id) throws IOException, CatalogTransformerException {
    try (TemporaryFileBackedOutputStream fbos = new TemporaryFileBackedOutputStream()) {
        try {
            IOUtils.copy(input, fbos);
        } catch (IOException e) {
            throw new CatalogTransformerException("Could not copy bytes of content message.", e);
        }
        String plainText = null;
        try (InputStream isCopy = fbos.asByteSource().openStream()) {
            Parser parser = new AutoDetectParser();
            ContentHandler contentHandler = new ToTextContentHandler();
            TikaMetadataExtractor tikaMetadataExtractor = new TikaMetadataExtractor(parser, contentHandler);
            tikaMetadataExtractor.parseMetadata(isCopy, new ParseContext());
            plainText = contentHandler.toString();
        } catch (CatalogTransformerException e) {
            LOGGER.warn("Cannot extract metadata from pdf", e);
        }
        try (InputStream isCopy = fbos.asByteSource().openStream();
            PDDocument pdfDocument = pdDocumentGenerator.apply(isCopy)) {
            return transformPdf(id, pdfDocument, plainText);
        } catch (InvalidPasswordException e) {
            LOGGER.debug("Cannot transform encrypted pdf", e);
            return initializeMetacard(id);
        }
    }
}
Also used : TemporaryFileBackedOutputStream(org.codice.ddf.platform.util.TemporaryFileBackedOutputStream) InputStream(java.io.InputStream) CatalogTransformerException(ddf.catalog.transform.CatalogTransformerException) IOException(java.io.IOException) ContentHandler(org.xml.sax.ContentHandler) ToTextContentHandler(org.apache.tika.sax.ToTextContentHandler) Parser(org.apache.tika.parser.Parser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) ToTextContentHandler(org.apache.tika.sax.ToTextContentHandler) TikaMetadataExtractor(ddf.catalog.transformer.common.tika.TikaMetadataExtractor) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) InvalidPasswordException(org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException)

Example 4 with TikaMetadataExtractor

use of ddf.catalog.transformer.common.tika.TikaMetadataExtractor in project ddf by codice.

the class PdfInputTransformer method transformPdf.

private Metacard transformPdf(String id, PDDocument pdfDocument, InputStream contentInput) throws IOException, CatalogTransformerException {
    if (pdfDocument.isEncrypted()) {
        LOGGER.debug("Cannot transform encrypted pdf");
        return initializeMetacard(id);
    }
    String bodyText = null;
    String metadataXml = null;
    TikaMetadataExtractor tikaMetadataExtractor = null;
    try {
        tikaMetadataExtractor = new TikaMetadataExtractor(contentInput, previewMaxLength, metadataMaxLength);
    } catch (TikaException e) {
        throw new CatalogTransformerException(e);
    }
    metadataXml = tikaMetadataExtractor.getMetadataXml();
    Attribute validationAttribute = null;
    if (metadataXml.equals(TikaMetadataExtractor.METADATA_LIMIT_REACHED_MSG)) {
        validationAttribute = new AttributeImpl(Validation.VALIDATION_WARNINGS, Collections.singletonList(metadataXml));
        metadataXml = "";
    }
    bodyText = tikaMetadataExtractor.getBodyText();
    MetacardImpl metacard = initializeMetacard(id, bodyText, metadataXml);
    if (validationAttribute != null) {
        metacard.setAttribute(validationAttribute);
    }
    extractPdfMetadata(pdfDocument, metacard);
    pdfThumbnailGenerator.apply(pdfDocument).ifPresent(metacard::setThumbnail);
    Optional.ofNullable(geoParser.apply(pdfDocument)).ifPresent(metacard::setLocation);
    return metacard;
}
Also used : TikaMetadataExtractor(ddf.catalog.transformer.common.tika.TikaMetadataExtractor) TikaException(org.apache.tika.exception.TikaException) Attribute(ddf.catalog.data.Attribute) AttributeImpl(ddf.catalog.data.impl.AttributeImpl) CatalogTransformerException(ddf.catalog.transform.CatalogTransformerException) MetacardImpl(ddf.catalog.data.impl.MetacardImpl)

Aggregations

CatalogTransformerException (ddf.catalog.transform.CatalogTransformerException)4 TikaMetadataExtractor (ddf.catalog.transformer.common.tika.TikaMetadataExtractor)4 AttributeImpl (ddf.catalog.data.impl.AttributeImpl)3 TikaException (org.apache.tika.exception.TikaException)3 Attribute (ddf.catalog.data.Attribute)2 Metacard (ddf.catalog.data.Metacard)2 MetacardImpl (ddf.catalog.data.impl.MetacardImpl)2 IOException (java.io.IOException)2 InputStream (java.io.InputStream)2 Metadata (org.apache.tika.metadata.Metadata)2 TemporaryFileBackedOutputStream (org.codice.ddf.platform.util.TemporaryFileBackedOutputStream)2 MetacardType (ddf.catalog.data.MetacardType)1 PDDocument (org.apache.pdfbox.pdmodel.PDDocument)1 InvalidPasswordException (org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException)1 CloseShieldInputStream (org.apache.tika.io.CloseShieldInputStream)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 ParseContext (org.apache.tika.parser.ParseContext)1 Parser (org.apache.tika.parser.Parser)1 ToTextContentHandler (org.apache.tika.sax.ToTextContentHandler)1 ContentHandler (org.xml.sax.ContentHandler)1