Examples with CrawlDatum - org.apache.nutch.crawl.CrawlDatum

Example 21 with CrawlDatum

use of org.apache.nutch.crawl.CrawlDatum in project nutch by apache.

the class MimeTypeIndexingFilterTest method testMissingConfigFile.

@Test
public void testMissingConfigFile() throws Exception {
    String file = conf.get(MimeTypeIndexingFilter.MIMEFILTER_REGEX_FILE, "");
    Assert.assertEquals(String.format("Property %s must not be present in the the configuration file", MimeTypeIndexingFilter.MIMEFILTER_REGEX_FILE), "", file);
    filter.setConf(conf);
    // property not set so in this cases all documents must pass the filter
    for (int i = 0; i < parses.length; i++) {
        NutchDocument doc = filter.filter(new NutchDocument(), parses[i], new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
        Assert.assertNotNull("All documents must be allowed by default", doc);
    }
}

Also used : NutchDocument(org.apache.nutch.indexer.NutchDocument) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) Test(org.junit.Test)

Example 22 with CrawlDatum

use of org.apache.nutch.crawl.CrawlDatum in project nutch by apache.

the class TestExtParser method setUp.

@Before
protected void setUp() throws ProtocolException, IOException {
    // prepare a temp file with expectedText as its content
    // This system property is defined in ./src/plugin/build-plugin.xml
    String path = System.getProperty("test.data");
    if (path != null) {
        File tempDir = new File(path);
        if (!tempDir.exists())
            tempDir.mkdir();
        tempFile = File.createTempFile("nutch.test.plugin.ExtParser.", ".txt", tempDir);
    } else {
        // otherwise in java.io.tmpdir
        tempFile = File.createTempFile("nutch.test.plugin.ExtParser.", ".txt");
    }
    urlString = tempFile.toURI().toURL().toString();
    FileOutputStream fos = new FileOutputStream(tempFile);
    fos.write(expectedText.getBytes());
    fos.close();
    // get nutch content
    Protocol protocol = new ProtocolFactory(NutchConfiguration.create()).getProtocol(urlString);
    content = protocol.getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
    protocol = null;
}

Also used : ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) FileOutputStream(java.io.FileOutputStream) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Protocol(org.apache.nutch.protocol.Protocol) File(java.io.File) Before(org.junit.Before)

Example 23 with CrawlDatum

use of org.apache.nutch.crawl.CrawlDatum in project nutch by apache.

the class TestRTFParser method testIt.

@Ignore("There seems to be an issue with line 71 e.g. text.trim()")
@Test
public void testIt() throws ProtocolException, ParseException {
    String urlString;
    Protocol protocol;
    Content content;
    Parse parse;
    Configuration conf = NutchConfiguration.create();
    urlString = "file:" + sampleDir + fileSeparator + rtfFile;
    protocol = new ProtocolFactory(conf).getProtocol(urlString);
    content = protocol.getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
    parse = new ParseUtil(conf).parseByExtensionId("parse-tika", content).get(content.getUrl());
    String text = parse.getText();
    Assert.assertEquals("The quick brown fox jumps over the lazy dog", text.trim());
    String title = parse.getData().getTitle();
    Metadata meta = parse.getData().getParseMeta();
    Assert.assertEquals("test rft document", title);
    Assert.assertEquals("tests", meta.get(DublinCore.SUBJECT));
}

Also used : ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) ParseUtil(org.apache.nutch.parse.ParseUtil) Content(org.apache.nutch.protocol.Content) Parse(org.apache.nutch.parse.Parse) Metadata(org.apache.nutch.metadata.Metadata) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Protocol(org.apache.nutch.protocol.Protocol) Ignore(org.junit.Ignore) Test(org.junit.Test)

Example 24 with CrawlDatum

use of org.apache.nutch.crawl.CrawlDatum in project nutch by apache.

the class TestZipParser method testIt.

@Test
public void testIt() throws ProtocolException, ParseException {
    String urlString;
    Protocol protocol;
    Content content;
    Parse parse;
    Configuration conf = NutchConfiguration.create();
    for (int i = 0; i < sampleFiles.length; i++) {
        urlString = "file:" + sampleDir + fileSeparator + sampleFiles[i];
        protocol = new ProtocolFactory(conf).getProtocol(urlString);
        content = protocol.getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
        parse = new ParseUtil(conf).parseByExtensionId("parse-zip", content).get(content.getUrl());
        Assert.assertTrue("Extracted text does not start with <" + expectedText + ">: <" + parse.getText() + ">", parse.getText().startsWith(expectedText));
    }
}

Also used : ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) ParseUtil(org.apache.nutch.parse.ParseUtil) Content(org.apache.nutch.protocol.Content) Parse(org.apache.nutch.parse.Parse) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Protocol(org.apache.nutch.protocol.Protocol) Test(org.junit.Test)

Example 25 with CrawlDatum

use of org.apache.nutch.crawl.CrawlDatum in project nutch by apache.

the class File method main.

/**
 * Quick way for running this class. Useful for debugging.
 */
public static void main(String[] args) throws Exception {
    int maxContentLength = Integer.MIN_VALUE;
    boolean dumpContent = false;
    String urlString = null;
    String usage = "Usage: File [-maxContentLength L] [-dumpContent] url";
    if (args.length == 0) {
        System.err.println(usage);
        System.exit(-1);
    }
    for (int i = 0; i < args.length; i++) {
        if (args[i].equals("-maxContentLength")) {
            maxContentLength = Integer.parseInt(args[++i]);
        } else if (args[i].equals("-dumpContent")) {
            dumpContent = true;
        } else if (i != args.length - 1) {
            System.err.println(usage);
            System.exit(-1);
        } else
            urlString = args[i];
    }
    File file = new File();
    file.setConf(NutchConfiguration.create());
    if (// set maxContentLength
    maxContentLength != Integer.MIN_VALUE)
        file.setMaxContentLength(maxContentLength);
    // set log level
    // LOG.setLevel(Level.parse((new String(logLevel)).toUpperCase()));
    ProtocolOutput output = file.getProtocolOutput(new Text(urlString), new CrawlDatum());
    Content content = output.getContent();
    System.err.println("URL: " + content.getUrl());
    System.err.println("Status: " + output.getStatus());
    System.err.println("Content-Type: " + content.getContentType());
    System.err.println("Content-Length: " + content.getMetadata().get(Response.CONTENT_LENGTH));
    System.err.println("Last-Modified: " + content.getMetadata().get(Response.LAST_MODIFIED));
    String redirectLocation = content.getMetadata().get("Location");
    if (redirectLocation != null) {
        System.err.println("Location: " + redirectLocation);
    }
    if (dumpContent) {
        System.out.print(new String(content.getContent()));
    }
    file = null;
}

Also used : ProtocolOutput(org.apache.nutch.protocol.ProtocolOutput) Content(org.apache.nutch.protocol.Content) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text)

Aggregations

CrawlDatum (org.apache.nutch.crawl.CrawlDatum)66 Text (org.apache.hadoop.io.Text)60 Test (org.junit.Test)31 Inlinks (org.apache.nutch.crawl.Inlinks)25 Configuration (org.apache.hadoop.conf.Configuration)24 ParseData (org.apache.nutch.parse.ParseData)22 ParseImpl (org.apache.nutch.parse.ParseImpl)21 NutchDocument (org.apache.nutch.indexer.NutchDocument)20 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)20 Content (org.apache.nutch.protocol.Content)19 Parse (org.apache.nutch.parse.Parse)15 Metadata (org.apache.nutch.metadata.Metadata)14 ParseStatus (org.apache.nutch.parse.ParseStatus)14 ParseUtil (org.apache.nutch.parse.ParseUtil)13 Protocol (org.apache.nutch.protocol.Protocol)13 ProtocolFactory (org.apache.nutch.protocol.ProtocolFactory)13 URL (java.net.URL)11 Outlink (org.apache.nutch.parse.Outlink)11 IOException (java.io.IOException)7 ArrayList (java.util.ArrayList)5