Search in sources :

Example 41 with Content

use of org.apache.nutch.protocol.Content in project nutch by apache.

the class TestMetatagParser method parseMeta.

public Metadata parseMeta(String fileName, Configuration conf) {
    Metadata metadata = null;
    try {
        String urlString = "file:" + sampleDir + fileSeparator + fileName;
        Protocol protocol = new ProtocolFactory(conf).getProtocol(urlString);
        Content content = protocol.getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
        Parse parse = new ParseUtil(conf).parse(content).get(content.getUrl());
        metadata = parse.getData().getParseMeta();
    } catch (Exception e) {
        e.printStackTrace();
        Assert.fail(e.toString());
    }
    return metadata;
}
Also used : ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) ParseUtil(org.apache.nutch.parse.ParseUtil) Content(org.apache.nutch.protocol.Content) Parse(org.apache.nutch.parse.Parse) Metadata(org.apache.nutch.metadata.Metadata) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Protocol(org.apache.nutch.protocol.Protocol)

Example 42 with Content

use of org.apache.nutch.protocol.Content in project nutch by apache.

the class TestFeedParser method testIt.

/**
 * <p>
 * The test method: tests out the following 2 asserts:
 * </p>
 *
 * <ul>
 * <li>There are 3 outlinks read from the sample rss file</li>
 * <li>The 3 outlinks read are in fact the correct outlinks from the sample
 * file</li>
 * </ul>
 */
@Test
public void testIt() throws ProtocolException, ParseException {
    String urlString;
    Protocol protocol;
    Content content;
    Parse parse;
    Configuration conf = NutchConfiguration.create();
    for (int i = 0; i < sampleFiles.length; i++) {
        urlString = "file:" + sampleDir + fileSeparator + sampleFiles[i];
        protocol = new ProtocolFactory(conf).getProtocol(urlString);
        content = protocol.getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
        parse = new ParseUtil(conf).parseByExtensionId("parse-tika", content).get(content.getUrl());
        // check that there are 2 outlinks:
        // unlike the original parse-rss
        // tika ignores the URL and description of the channel
        // http://test.channel.com
        // http://www-scf.usc.edu/~mattmann/
        // http://www.nutch.org
        ParseData theParseData = parse.getData();
        Outlink[] theOutlinks = theParseData.getOutlinks();
        Assert.assertTrue("There aren't 2 outlinks read!", theOutlinks.length == 2);
        // now check to make sure that those are the two outlinks
        boolean hasLink1 = false, hasLink2 = false;
        for (int j = 0; j < theOutlinks.length; j++) {
            if (theOutlinks[j].getToUrl().equals("http://www-scf.usc.edu/~mattmann/")) {
                hasLink1 = true;
            }
            if (theOutlinks[j].getToUrl().equals("http://www.nutch.org/")) {
                hasLink2 = true;
            }
        }
        if (!hasLink1 || !hasLink2) {
            Assert.fail("Outlinks read from sample rss file are not correct!");
        }
    }
}
Also used : Outlink(org.apache.nutch.parse.Outlink) NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) ParseUtil(org.apache.nutch.parse.ParseUtil) Parse(org.apache.nutch.parse.Parse) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) ParseData(org.apache.nutch.parse.ParseData) Content(org.apache.nutch.protocol.Content) Protocol(org.apache.nutch.protocol.Protocol) Test(org.junit.Test)

Example 43 with Content

use of org.apache.nutch.protocol.Content in project nutch by apache.

the class TestImageMetadata method testIt.

@Test
public void testIt() throws ProtocolException, ParseException {
    String urlString;
    Protocol protocol;
    Content content;
    Parse parse;
    for (int i = 0; i < sampleFiles.length; i++) {
        urlString = "file:" + sampleDir + fileSeparator + sampleFiles[i];
        Configuration conf = NutchConfiguration.create();
        protocol = new ProtocolFactory(conf).getProtocol(urlString);
        content = protocol.getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
        parse = new ParseUtil(conf).parseByExtensionId("parse-tika", content).get(content.getUrl());
        Assert.assertEquals("121", parse.getData().getMeta("width"));
        Assert.assertEquals("48", parse.getData().getMeta("height"));
    }
}
Also used : ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) ParseUtil(org.apache.nutch.parse.ParseUtil) Content(org.apache.nutch.protocol.Content) Parse(org.apache.nutch.parse.Parse) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Protocol(org.apache.nutch.protocol.Protocol) Test(org.junit.Test)

Example 44 with Content

use of org.apache.nutch.protocol.Content in project nutch by apache.

the class ZipParser method main.

public static void main(String[] args) throws IOException {
    if (args.length < 1) {
        System.out.println("ZipParser <zip_file>");
        System.exit(1);
    }
    File file = new File(args[0]);
    String url = "file:" + file.getCanonicalPath();
    FileInputStream in = new FileInputStream(file);
    byte[] bytes = new byte[in.available()];
    in.read(bytes);
    in.close();
    Configuration conf = NutchConfiguration.create();
    ZipParser parser = new ZipParser();
    parser.setConf(conf);
    Metadata meta = new Metadata();
    meta.add(Response.CONTENT_LENGTH, "" + file.length());
    ParseResult parseResult = parser.getParse(new Content(url, url, bytes, "application/zip", meta, conf));
    Parse p = parseResult.get(url);
    System.out.println(parseResult.size());
    System.out.println("Parse Text:");
    System.out.println(p.getText());
    System.out.println("Parse Data:");
    System.out.println(p.getData());
}
Also used : NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) ParseResult(org.apache.nutch.parse.ParseResult) Content(org.apache.nutch.protocol.Content) Parse(org.apache.nutch.parse.Parse) Metadata(org.apache.nutch.metadata.Metadata) File(java.io.File) FileInputStream(java.io.FileInputStream)

Example 45 with Content

use of org.apache.nutch.protocol.Content in project nutch by apache.

the class TestHTMLLanguageParser method getContent.

private Content getContent(String text) {
    Metadata meta = new Metadata();
    meta.add("Content-Type", "text/html");
    return new Content(URL, BASE, text.getBytes(), "text/html", meta, NutchConfiguration.create());
}
Also used : Content(org.apache.nutch.protocol.Content) Metadata(org.apache.nutch.metadata.Metadata)

Aggregations

Content (org.apache.nutch.protocol.Content)51 Text (org.apache.hadoop.io.Text)30 Parse (org.apache.nutch.parse.Parse)29 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)27 Configuration (org.apache.hadoop.conf.Configuration)23 Metadata (org.apache.nutch.metadata.Metadata)23 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)22 ParseUtil (org.apache.nutch.parse.ParseUtil)20 Test (org.junit.Test)19 Protocol (org.apache.nutch.protocol.Protocol)17 ProtocolFactory (org.apache.nutch.protocol.ProtocolFactory)16 ParseData (org.apache.nutch.parse.ParseData)8 ProtocolOutput (org.apache.nutch.protocol.ProtocolOutput)8 ParseResult (org.apache.nutch.parse.ParseResult)7 URL (java.net.URL)6 File (java.io.File)5 FileInputStream (java.io.FileInputStream)5 IOException (java.io.IOException)5 Outlink (org.apache.nutch.parse.Outlink)5 HashMap (java.util.HashMap)4