Search in sources :

Example 41 with Metadata

use of org.apache.nutch.metadata.Metadata in project nutch by apache.

the class TestContent method testGetContentType.

/**
 * Unit tests for getContentType(String, String, byte[]) method.
 */
@Test
public void testGetContentType() throws Exception {
    Content c = null;
    Metadata p = new Metadata();
    c = new Content("http://www.foo.com/", "http://www.foo.com/", "".getBytes("UTF8"), "text/html; charset=UTF-8", p, conf);
    Assert.assertEquals("text/html", c.getContentType());
    c = new Content("http://www.foo.com/foo.html", "http://www.foo.com/", "".getBytes("UTF8"), "", p, conf);
    Assert.assertEquals("text/html", c.getContentType());
    c = new Content("http://www.foo.com/foo.html", "http://www.foo.com/", "".getBytes("UTF8"), null, p, conf);
    Assert.assertEquals("text/html", c.getContentType());
    c = new Content("http://www.foo.com/", "http://www.foo.com/", "<html></html>".getBytes("UTF8"), "", p, conf);
    Assert.assertEquals("text/html", c.getContentType());
    c = new Content("http://www.foo.com/foo.html", "http://www.foo.com/", "<html></html>".getBytes("UTF8"), "text/plain", p, conf);
    Assert.assertEquals("text/html", c.getContentType());
    c = new Content("http://www.foo.com/foo.png", "http://www.foo.com/", "<html></html>".getBytes("UTF8"), "text/plain", p, conf);
    Assert.assertEquals("text/html", c.getContentType());
    c = new Content("http://www.foo.com/", "http://www.foo.com/", "".getBytes("UTF8"), "", p, conf);
    Assert.assertEquals(MimeTypes.OCTET_STREAM, c.getContentType());
    c = new Content("http://www.foo.com/", "http://www.foo.com/", "".getBytes("UTF8"), null, p, conf);
    Assert.assertNotNull(c.getContentType());
}
Also used : Metadata(org.apache.nutch.metadata.Metadata) SpellCheckedMetadata(org.apache.nutch.metadata.SpellCheckedMetadata) Test(org.junit.Test)

Example 42 with Metadata

use of org.apache.nutch.metadata.Metadata in project nutch by apache.

the class TestEncodingDetector method testGuessing.

@Test
public void testGuessing() {
    // first disable auto detection
    conf.setInt(EncodingDetector.MIN_CONFIDENCE_KEY, -1);
    Metadata metadata = new Metadata();
    EncodingDetector detector;
    Content content;
    String encoding;
    content = new Content("http://www.example.com", "http://www.example.com/", contentInOctets, "text/plain", metadata, conf);
    detector = new EncodingDetector(conf);
    detector.autoDetectClues(content, true);
    encoding = detector.guessEncoding(content, "windows-1252");
    // no information is available, so it should return default encoding
    Assert.assertEquals("windows-1252", encoding.toLowerCase());
    metadata.clear();
    metadata.set(Response.CONTENT_TYPE, "text/plain; charset=UTF-16");
    content = new Content("http://www.example.com", "http://www.example.com/", contentInOctets, "text/plain", metadata, conf);
    detector = new EncodingDetector(conf);
    detector.autoDetectClues(content, true);
    encoding = detector.guessEncoding(content, "windows-1252");
    Assert.assertEquals("utf-16", encoding.toLowerCase());
    metadata.clear();
    content = new Content("http://www.example.com", "http://www.example.com/", contentInOctets, "text/plain", metadata, conf);
    detector = new EncodingDetector(conf);
    detector.autoDetectClues(content, true);
    detector.addClue("windows-1254", "sniffed");
    encoding = detector.guessEncoding(content, "windows-1252");
    Assert.assertEquals("windows-1254", encoding.toLowerCase());
    // enable autodetection
    conf.setInt(EncodingDetector.MIN_CONFIDENCE_KEY, 50);
    metadata.clear();
    metadata.set(Response.CONTENT_TYPE, "text/plain; charset=UTF-16");
    content = new Content("http://www.example.com", "http://www.example.com/", contentInOctets, "text/plain", metadata, conf);
    detector = new EncodingDetector(conf);
    detector.autoDetectClues(content, true);
    detector.addClue("utf-32", "sniffed");
    encoding = detector.guessEncoding(content, "windows-1252");
    Assert.assertEquals("utf-8", encoding.toLowerCase());
}
Also used : Content(org.apache.nutch.protocol.Content) Metadata(org.apache.nutch.metadata.Metadata) Test(org.junit.Test)

Aggregations

Metadata (org.apache.nutch.metadata.Metadata)42 Configuration (org.apache.hadoop.conf.Configuration)20 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)20 ParseData (org.apache.nutch.parse.ParseData)19 Content (org.apache.nutch.protocol.Content)18 Test (org.junit.Test)17 Text (org.apache.hadoop.io.Text)16 Parse (org.apache.nutch.parse.Parse)16 ParseImpl (org.apache.nutch.parse.ParseImpl)15 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)14 Inlinks (org.apache.nutch.crawl.Inlinks)11 Outlink (org.apache.nutch.parse.Outlink)10 ParseStatus (org.apache.nutch.parse.ParseStatus)9 NutchDocument (org.apache.nutch.indexer.NutchDocument)7 ParseResult (org.apache.nutch.parse.ParseResult)7 FileInputStream (java.io.FileInputStream)5 IOException (java.io.IOException)5 File (java.io.File)4 ArrayList (java.util.ArrayList)4 ParseUtil (org.apache.nutch.parse.ParseUtil)4