Examples with NutchDocument - org.apache.nutch.indexer.NutchDocument

Example 1 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class TestElasticIndexWriter method testBackoffPolicy.

@Test
public void testBackoffPolicy() throws IOException {
    // set a non-zero "max-retry" value, **implying the cluster is saturated**
    maxNumFailures = 5;
    conf.setInt(ElasticConstants.EXPONENTIAL_BACKOFF_RETRIES, maxNumFailures);
    int numDocs = 10;
    conf.setInt(ElasticConstants.MAX_BULK_DOCS, numDocs);
    Job job = Job.getInstance(conf);
    testIndexWriter.setConf(conf);
    testIndexWriter.open(conf, "name");
    NutchDocument doc = new NutchDocument();
    doc.add("id", "http://www.example.com");
    // pretend the remote cluster is "saturated"
    clusterSaturated = true;
    Assert.assertFalse(bulkRequestSuccessful);
    // write enough docs to initiate one bulk request
    for (int i = 0; i < numDocs; i++) {
        testIndexWriter.write(doc);
    }
    testIndexWriter.close();
    // the BulkProcessor should have retried `maxNumFailures + 1` times, then succeeded
    Assert.assertTrue(bulkRequestSuccessful);
}

Also used : NutchDocument(org.apache.nutch.indexer.NutchDocument) Job(org.apache.hadoop.mapreduce.Job) Test(org.junit.Test)

Example 2 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class TestElasticIndexWriter method testBulkMaxLength.

@Test
public void testBulkMaxLength() throws IOException {
    String key = "id";
    String value = "http://www.example.com";
    int defaultMaxBulkLength = conf.getInt(ElasticConstants.MAX_BULK_LENGTH, 2500500);
    // Test that MAX_BULK_LENGTH is respected by lowering it 10x
    int testMaxBulkLength = defaultMaxBulkLength / 10;
    // This number is somewhat arbitrary, but must be a function of:
    // - testMaxBulkLength
    // - approximate size of each doc
    int numDocs = testMaxBulkLength / (key.length() + value.length());
    conf.setInt(ElasticConstants.MAX_BULK_LENGTH, testMaxBulkLength);
    Job job = Job.getInstance(conf);
    testIndexWriter.setConf(conf);
    testIndexWriter.open(conf, "name");
    NutchDocument doc = new NutchDocument();
    doc.add(key, value);
    Assert.assertFalse(bulkRequestSuccessful);
    for (int i = 0; i < numDocs; i++) {
        testIndexWriter.write(doc);
    }
    testIndexWriter.close();
    Assert.assertTrue(bulkRequestSuccessful);
}

Also used : NutchDocument(org.apache.nutch.indexer.NutchDocument) Job(org.apache.hadoop.mapreduce.Job) Test(org.junit.Test)

Example 3 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class TestJexlIndexingFilter method testBlockNotMatchingDocuments.

@Test
public void testBlockNotMatchingDocuments() throws Exception {
    Configuration conf = NutchConfiguration.create();
    conf.set("index.jexl.filter", "doc.lang=='en'");
    JexlIndexingFilter filter = new JexlIndexingFilter();
    filter.setConf(conf);
    Assert.assertNotNull(filter);
    NutchDocument doc = new NutchDocument();
    String title = "The Foo Page";
    Outlink[] outlinks = new Outlink[] { new Outlink("http://foo.com/", "Foo") };
    Metadata metaData = new Metadata();
    metaData.add("Language", "en/us");
    ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, title, outlinks, metaData);
    ParseImpl parse = new ParseImpl("this is a sample foo bar page. hope you enjoy it.", parseData);
    CrawlDatum crawlDatum = new CrawlDatum();
    crawlDatum.setFetchTime(100L);
    Inlinks inlinks = new Inlinks();
    doc.add("lang", "ru");
    NutchDocument result = filter.filter(doc, parse, new Text("http://nutch.apache.org/index.html"), crawlDatum, inlinks);
    Assert.assertNull(result);
}

Also used : Outlink(org.apache.nutch.parse.Outlink) NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) NutchDocument(org.apache.nutch.indexer.NutchDocument) ParseData(org.apache.nutch.parse.ParseData) Metadata(org.apache.nutch.metadata.Metadata) ParseImpl(org.apache.nutch.parse.ParseImpl) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) Test(org.junit.Test)

Example 4 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class TestLinksIndexingFilter method testFilterInlinks.

@Test
public void testFilterInlinks() throws Exception {
    conf.set(LinksIndexingFilter.LINKS_INLINKS_HOST, "true");
    filter.setConf(conf);
    Inlinks inlinks = new Inlinks();
    inlinks.add(new Inlink("http://www.test.com", "test"));
    inlinks.add(new Inlink("http://www.example.com", "example"));
    NutchDocument doc = filter.filter(new NutchDocument(), new ParseImpl("text", new ParseData(new ParseStatus(), "title", new Outlink[0], metadata)), new Text("http://www.example.com/"), new CrawlDatum(), inlinks);
    Assert.assertEquals(1, doc.getField("inlinks").getValues().size());
    Assert.assertEquals("Filter inlinks, allow only those from a different host", "http://www.test.com", doc.getFieldValue("inlinks"));
}

Also used : ParseStatus(org.apache.nutch.parse.ParseStatus) NutchDocument(org.apache.nutch.indexer.NutchDocument) ParseData(org.apache.nutch.parse.ParseData) ParseImpl(org.apache.nutch.parse.ParseImpl) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) Inlink(org.apache.nutch.crawl.Inlink) Test(org.junit.Test)

Example 5 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class TestLinksIndexingFilter method testIndexHostsOnlyAndFilterOutlinks.

@Test
public void testIndexHostsOnlyAndFilterOutlinks() throws Exception {
    conf = NutchConfiguration.create();
    conf.set(LinksIndexingFilter.LINKS_ONLY_HOSTS, "true");
    conf.set(LinksIndexingFilter.LINKS_OUTLINKS_HOST, "true");
    Outlink[] outlinks = generateOutlinks(true);
    filter.setConf(conf);
    NutchDocument doc = filter.filter(new NutchDocument(), new ParseImpl("text", new ParseData(new ParseStatus(), "title", outlinks, metadata)), new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
    Assert.assertEquals(1, doc.getField("outlinks").getValues().size());
    Assert.assertEquals("Index only the host portion of the outlinks after filtering", new URL("http://www.test.com").getHost(), doc.getFieldValue("outlinks"));
}

Also used : Outlink(org.apache.nutch.parse.Outlink) ParseStatus(org.apache.nutch.parse.ParseStatus) NutchDocument(org.apache.nutch.indexer.NutchDocument) ParseData(org.apache.nutch.parse.ParseData) ParseImpl(org.apache.nutch.parse.ParseImpl) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) URL(java.net.URL) Test(org.junit.Test)

Aggregations

NutchDocument (org.apache.nutch.indexer.NutchDocument)42 Test (org.junit.Test)36 Text (org.apache.hadoop.io.Text)21 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)21 Inlinks (org.apache.nutch.crawl.Inlinks)21 Configuration (org.apache.hadoop.conf.Configuration)19 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)19 ParseData (org.apache.nutch.parse.ParseData)17 ParseImpl (org.apache.nutch.parse.ParseImpl)17 ParseStatus (org.apache.nutch.parse.ParseStatus)11 Outlink (org.apache.nutch.parse.Outlink)10 Metadata (org.apache.nutch.metadata.Metadata)8 Inlink (org.apache.nutch.crawl.Inlink)5 URL (java.net.URL)3 Date (java.util.Date)3 Job (org.apache.hadoop.mapreduce.Job)3 IndexingException (org.apache.nutch.indexer.IndexingException)2 BasicIndexingFilter (org.apache.nutch.indexer.basic.BasicIndexingFilter)2 BufferedReader (java.io.BufferedReader)1 IOException (java.io.IOException)1