Search in sources :

Example 1 with TikaHTMLConverter

use of org.opensextant.xtext.converters.TikaHTMLConverter in project Xponents by OpenSextant.

the class XText method setup.

/**
     * If by this point you have taken items out of the requested types the
     * converters will not be setup. E.g., if you don't want PDF or HTML
     * conversion - those resources will not be initialized.
     * 
     * @throws IOException
     *             on err
     */
public void setup() throws IOException {
    defaultConversion = new DefaultConverter(maxBuffer);
    embeddedConversion = new EmbeddedContentConverter(maxBuffer);
    paths.configure();
    // Invoke converter instances only as requested types suggest.
    // If caller has removed file types from the list, then
    String mimetype = "txt";
    if (requestedFileTypes.contains(mimetype)) {
        converters.put(mimetype, new TextTranscodingConverter());
    }
    mimetype = "html";
    if (requestedFileTypes.contains(mimetype)) {
        Converter webConv = new TikaHTMLConverter(this.scrubHTML, maxHTMLBuffer);
        converters.put(mimetype, webConv);
        converters.put("htm", webConv);
        converters.put("xhtml", webConv);
        requestedFileTypes.add("htm");
        requestedFileTypes.add("xhtml");
    }
    MessageConverter emailParser = new MessageConverter();
    mimetype = "eml";
    if (requestedFileTypes.contains(mimetype)) {
        converters.put(mimetype, emailParser);
    }
    mimetype = "msg";
    if (requestedFileTypes.contains(mimetype)) {
        converters.put(mimetype, emailParser);
    }
    WebArchiveConverter webArchiveParser = new WebArchiveConverter();
    mimetype = "mht";
    /* RFC822 */
    if (requestedFileTypes.contains(mimetype)) {
        converters.put(mimetype, webArchiveParser);
    }
    ImageMetadataConverter imgConv = new ImageMetadataConverter();
    String[] imageTypes = { "jpeg", "jpg" };
    for (String img : imageTypes) {
        if (requestedFileTypes.contains(img)) {
            converters.put(img, imgConv);
        }
    }
    //
    for (String t : requestedFileTypes) {
        ignoreFileType(t + ".txt");
    }
    fileFilters = requestedFileTypes.toArray(new String[requestedFileTypes.size()]);
}
Also used : ImageMetadataConverter(org.opensextant.xtext.converters.ImageMetadataConverter) WebArchiveConverter(org.opensextant.xtext.converters.WebArchiveConverter) TextTranscodingConverter(org.opensextant.xtext.converters.TextTranscodingConverter) EmbeddedContentConverter(org.opensextant.xtext.converters.EmbeddedContentConverter) ImageMetadataConverter(org.opensextant.xtext.converters.ImageMetadataConverter) TikaHTMLConverter(org.opensextant.xtext.converters.TikaHTMLConverter) MessageConverter(org.opensextant.xtext.converters.MessageConverter) WebArchiveConverter(org.opensextant.xtext.converters.WebArchiveConverter) TextTranscodingConverter(org.opensextant.xtext.converters.TextTranscodingConverter) DefaultConverter(org.opensextant.xtext.converters.DefaultConverter) EmbeddedContentConverter(org.opensextant.xtext.converters.EmbeddedContentConverter) MessageConverter(org.opensextant.xtext.converters.MessageConverter) TikaHTMLConverter(org.opensextant.xtext.converters.TikaHTMLConverter) DefaultConverter(org.opensextant.xtext.converters.DefaultConverter)

Aggregations

DefaultConverter (org.opensextant.xtext.converters.DefaultConverter)1 EmbeddedContentConverter (org.opensextant.xtext.converters.EmbeddedContentConverter)1 ImageMetadataConverter (org.opensextant.xtext.converters.ImageMetadataConverter)1 MessageConverter (org.opensextant.xtext.converters.MessageConverter)1 TextTranscodingConverter (org.opensextant.xtext.converters.TextTranscodingConverter)1 TikaHTMLConverter (org.opensextant.xtext.converters.TikaHTMLConverter)1 WebArchiveConverter (org.opensextant.xtext.converters.WebArchiveConverter)1