Search in sources :

Example 1 with ToHTMLContentHandler

use of org.apache.tika.sax.ToHTMLContentHandler in project tika by apache.

the class OutlookPSTParserTest method testParse.

@Test
public void testParse() throws Exception {
    Parser pstParser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    ContentHandler handler = new ToHTMLContentHandler();
    ParseContext context = new ParseContext();
    EmbeddedTrackingExtrator trackingExtrator = new EmbeddedTrackingExtrator(context);
    context.set(EmbeddedDocumentExtractor.class, trackingExtrator);
    context.set(Parser.class, new AutoDetectParser());
    pstParser.parse(getResourceAsStream("/test-documents/testPST.pst"), handler, metadata, context);
    String output = handler.toString();
    assertFalse(output.isEmpty());
    assertTrue(output.contains("<meta name=\"Content-Length\" content=\"271360\">"));
    assertTrue(output.contains("<meta name=\"Content-Type\" content=\"application/vnd.ms-outlook-pst\">"));
    assertTrue(output.contains("<body><div class=\"email-folder\"><h1>"));
    assertTrue(output.contains("<div class=\"embedded\" id=\"&lt;530D9CAC.5080901@gmail.com&gt;\"><h1>Re: Feature Generators</h1>"));
    assertTrue(output.contains("<div class=\"embedded\" id=\"&lt;1393363252.28814.YahooMailNeo@web140906.mail.bf1.yahoo.com&gt;\"><h1>Re: init tokenizer fails: \"Bad type in putfield/putstatic\"</h1>"));
    assertTrue(output.contains("Gary Murphy commented on TIKA-1250:"));
    assertTrue(output.contains("<div class=\"email-folder\"><h1>Racine (pour la recherche)</h1>"));
    List<Metadata> metaList = trackingExtrator.trackingMetadata;
    assertEquals(6, metaList.size());
    Metadata firstMail = metaList.get(0);
    assertEquals("Jörn Kottmann", firstMail.get(TikaCoreProperties.CREATOR));
    assertEquals("Re: Feature Generators", firstMail.get(TikaCoreProperties.TITLE));
    assertEquals("kottmann@gmail.com", firstMail.get("senderEmailAddress"));
    assertEquals("users@opennlp.apache.org", firstMail.get("displayTo"));
    assertEquals("", firstMail.get("displayCC"));
    assertEquals("", firstMail.get("displayBCC"));
}
Also used : ToHTMLContentHandler(org.apache.tika.sax.ToHTMLContentHandler) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) ToHTMLContentHandler(org.apache.tika.sax.ToHTMLContentHandler) ContentHandler(org.xml.sax.ContentHandler) Parser(org.apache.tika.parser.Parser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Aggregations

TikaTest (org.apache.tika.TikaTest)1 Metadata (org.apache.tika.metadata.Metadata)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 ParseContext (org.apache.tika.parser.ParseContext)1 Parser (org.apache.tika.parser.Parser)1 ToHTMLContentHandler (org.apache.tika.sax.ToHTMLContentHandler)1 Test (org.junit.Test)1 ContentHandler (org.xml.sax.ContentHandler)1