Examples with NutchDocument - org.apache.nutch.indexer.NutchDocument

Example 16 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class TestElasticIndexWriter method testBackoffPolicy.

@Test
public void testBackoffPolicy() throws IOException {
    // set a non-zero "max-retry" value, **implying the cluster is saturated**
    maxNumFailures = 5;
    conf.setInt(ElasticConstants.EXPONENTIAL_BACKOFF_RETRIES, maxNumFailures);
    int numDocs = 10;
    conf.setInt(ElasticConstants.MAX_BULK_DOCS, numDocs);
    Job job = Job.getInstance(conf);
    testIndexWriter.setConf(conf);
    testIndexWriter.open(conf, "name");
    NutchDocument doc = new NutchDocument();
    doc.add("id", "http://www.example.com");
    // pretend the remote cluster is "saturated"
    clusterSaturated = true;
    Assert.assertFalse(bulkRequestSuccessful);
    // write enough docs to initiate one bulk request
    for (int i = 0; i < numDocs; i++) {
        testIndexWriter.write(doc);
    }
    testIndexWriter.close();
    // the BulkProcessor should have retried `maxNumFailures + 1` times, then succeeded
    Assert.assertTrue(bulkRequestSuccessful);
}

Also used : NutchDocument(org.apache.nutch.indexer.NutchDocument) Job(org.apache.hadoop.mapreduce.Job) Test(org.junit.Test)

Example 17 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class TestElasticIndexWriter method testBulkMaxLength.

@Test
public void testBulkMaxLength() throws IOException {
    String key = "id";
    String value = "http://www.example.com";
    int defaultMaxBulkLength = conf.getInt(ElasticConstants.MAX_BULK_LENGTH, 2500500);
    // Test that MAX_BULK_LENGTH is respected by lowering it 10x
    int testMaxBulkLength = defaultMaxBulkLength / 10;
    // This number is somewhat arbitrary, but must be a function of:
    // - testMaxBulkLength
    // - approximate size of each doc
    int numDocs = testMaxBulkLength / (key.length() + value.length());
    conf.setInt(ElasticConstants.MAX_BULK_LENGTH, testMaxBulkLength);
    Job job = Job.getInstance(conf);
    testIndexWriter.setConf(conf);
    testIndexWriter.open(conf, "name");
    NutchDocument doc = new NutchDocument();
    doc.add(key, value);
    Assert.assertFalse(bulkRequestSuccessful);
    for (int i = 0; i < numDocs; i++) {
        testIndexWriter.write(doc);
    }
    testIndexWriter.close();
    Assert.assertTrue(bulkRequestSuccessful);
}

Also used : NutchDocument(org.apache.nutch.indexer.NutchDocument) Job(org.apache.hadoop.mapreduce.Job) Test(org.junit.Test)

Example 18 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class MimeTypeIndexingFilterTest method testAllowOnlyImages.

@Test
public void testAllowOnlyImages() throws Exception {
    conf.set(MimeTypeIndexingFilter.MIMEFILTER_REGEX_FILE, "allow-images.txt");
    filter.setConf(conf);
    for (int i = 0; i < parses.length; i++) {
        NutchDocument doc = filter.filter(new NutchDocument(), parses[i], new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
        if (MIME_TYPES[i].contains("image")) {
            Assert.assertNotNull("Allow only images", doc);
        } else {
            Assert.assertNull("Block everything else", doc);
        }
    }
}

Also used : NutchDocument(org.apache.nutch.indexer.NutchDocument) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) Test(org.junit.Test)

Example 19 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class MimeTypeIndexingFilterTest method testBlockHTML.

@Test
public void testBlockHTML() throws Exception {
    conf.set(MimeTypeIndexingFilter.MIMEFILTER_REGEX_FILE, "block-html.txt");
    filter.setConf(conf);
    for (int i = 0; i < parses.length; i++) {
        NutchDocument doc = filter.filter(new NutchDocument(), parses[i], new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
        if (MIME_TYPES[i].contains("html")) {
            Assert.assertNull("Block only HTML documents", doc);
        } else {
            Assert.assertNotNull("Allow everything else", doc);
        }
    }
}

Example 20 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class MimeTypeIndexingFilterTest method testMissingConfigFile.

@Test
public void testMissingConfigFile() throws Exception {
    String file = conf.get(MimeTypeIndexingFilter.MIMEFILTER_REGEX_FILE, "");
    Assert.assertEquals(String.format("Property %s must not be present in the the configuration file", MimeTypeIndexingFilter.MIMEFILTER_REGEX_FILE), "", file);
    filter.setConf(conf);
    // property not set so in this cases all documents must pass the filter
    for (int i = 0; i < parses.length; i++) {
        NutchDocument doc = filter.filter(new NutchDocument(), parses[i], new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
        Assert.assertNotNull("All documents must be allowed by default", doc);
    }
}

Aggregations

NutchDocument (org.apache.nutch.indexer.NutchDocument)37 Test (org.junit.Test)33 Text (org.apache.hadoop.io.Text)20 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)20 Inlinks (org.apache.nutch.crawl.Inlinks)20 Configuration (org.apache.hadoop.conf.Configuration)17 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)17 ParseData (org.apache.nutch.parse.ParseData)16 ParseImpl (org.apache.nutch.parse.ParseImpl)16 ParseStatus (org.apache.nutch.parse.ParseStatus)10 Outlink (org.apache.nutch.parse.Outlink)9 Metadata (org.apache.nutch.metadata.Metadata)7 Inlink (org.apache.nutch.crawl.Inlink)5 URL (java.net.URL)3 Job (org.apache.hadoop.mapreduce.Job)3 IndexingException (org.apache.nutch.indexer.IndexingException)2 BasicIndexingFilter (org.apache.nutch.indexer.basic.BasicIndexingFilter)2 BufferedReader (java.io.BufferedReader)1 IOException (java.io.IOException)1 InputStreamReader (java.io.InputStreamReader)1