Search in sources :

Example 51 with Tika

use of org.apache.tika.Tika in project tika by apache.

the class HtmlParserTest method testLineBreak.

/**
     * Test case for HTML content like
     * ">div<foo>br<bar>/div>" that should result
     * in three whitespace-separated tokens "foo", "bar" and "baz" instead
     * of a single token "foobarbaz".
     *
     * @see <a href="https://issues.apache.org/jira/browse/TIKA-343">TIKA-343</a>
     */
@Test
public void testLineBreak() throws Exception {
    String test = "<html><body><div>foo<br>bar</div>baz</body></html>";
    String text = new Tika().parseToString(new ByteArrayInputStream(test.getBytes(US_ASCII)));
    String[] parts = text.trim().split("\\s+");
    assertEquals(3, parts.length);
    assertEquals("foo", parts[0]);
    assertEquals("bar", parts[1]);
    assertEquals("baz", parts[2]);
}
Also used : ByteArrayInputStream(java.io.ByteArrayInputStream) Tika(org.apache.tika.Tika) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 52 with Tika

use of org.apache.tika.Tika in project tika by apache.

the class HtmlParserTest method testCharactersDirectlyUnderBodyElement.

/**
     * Test case for TIKA-210
     *
     * @see <a href="https://issues.apache.org/jira/browse/TIKA-210">TIKA-210</a>
     */
@Test
public void testCharactersDirectlyUnderBodyElement() throws Exception {
    String test = "<html><body>test</body></html>";
    String content = new Tika().parseToString(new ByteArrayInputStream(test.getBytes(UTF_8)));
    assertEquals("test", content);
}
Also used : ByteArrayInputStream(java.io.ByteArrayInputStream) Tika(org.apache.tika.Tika) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 53 with Tika

use of org.apache.tika.Tika in project tika by apache.

the class RegexNERecogniserTest method testGetEntityTypes.

@Test
public void testGetEntityTypes() throws Exception {
    String text = "Hey, Lets meet on this Sunday or MONDAY because i am busy on Saturday";
    System.setProperty(NamedEntityParser.SYS_PROP_NER_IMPL, RegexNERecogniser.class.getName());
    Tika tika = new Tika(new TikaConfig(NamedEntityParser.class.getResourceAsStream("tika-config.xml")));
    Metadata md = new Metadata();
    tika.parse(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8)), md);
    Set<String> days = new HashSet<>(Arrays.asList(md.getValues("NER_WEEK_DAY")));
    assertTrue(days.contains("Sunday"));
    assertTrue(days.contains("MONDAY"));
    assertTrue(days.contains("Saturday"));
    //and nothing else
    assertTrue(days.size() == 3);
}
Also used : TikaConfig(org.apache.tika.config.TikaConfig) ByteArrayInputStream(java.io.ByteArrayInputStream) Metadata(org.apache.tika.metadata.Metadata) Tika(org.apache.tika.Tika) HashSet(java.util.HashSet) Test(org.junit.Test)

Example 54 with Tika

use of org.apache.tika.Tika in project Lucee by lucee.

the class IOUtil method getMimeType.

public static String getMimeType(Resource res, String defaultValue) {
    Metadata md = new Metadata();
    md.set(Metadata.RESOURCE_NAME_KEY, res.getName());
    md.set(Metadata.CONTENT_LENGTH, Long.toString(res.length()));
    InputStream is = null;
    try {
        Tika tika = new Tika();
        String result = tika.detect(is = res.getInputStream(), md);
        if (result.indexOf("tika") != -1) {
            String tmp = ResourceUtil.EXT_MT.get(ResourceUtil.getExtension(res, "").toLowerCase());
            if (!StringUtil.isEmpty(tmp))
                return tmp;
        }
        return result;
    } catch (Exception e) {
        String tmp = ResourceUtil.EXT_MT.get(ResourceUtil.getExtension(res, "").toLowerCase());
        if (!StringUtil.isEmpty(tmp))
            return tmp;
        return defaultValue;
    } finally {
        IOUtil.closeEL(is);
    }
}
Also used : BufferedInputStream(java.io.BufferedInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) Tika(org.apache.tika.Tika) PageException(lucee.runtime.exp.PageException) IOException(java.io.IOException) UnsupportedEncodingException(java.io.UnsupportedEncodingException)

Aggregations

Tika (org.apache.tika.Tika)54 Test (org.junit.Test)32 Metadata (org.apache.tika.metadata.Metadata)29 ByteArrayInputStream (java.io.ByteArrayInputStream)14 TikaTest (org.apache.tika.TikaTest)12 TikaConfig (org.apache.tika.config.TikaConfig)12 File (java.io.File)8 InputStream (java.io.InputStream)7 URL (java.net.URL)6 TikaInputStream (org.apache.tika.io.TikaInputStream)5 IOException (java.io.IOException)4 HashSet (java.util.HashSet)4 Ignore (org.junit.Ignore)4 FileInputStream (java.io.FileInputStream)3 ArrayList (java.util.ArrayList)3 HashMap (java.util.HashMap)3 Content (org.apache.nutch.protocol.Content)3 Before (org.junit.Before)3 FileOutputStream (java.io.FileOutputStream)2 UnsupportedEncodingException (java.io.UnsupportedEncodingException)2