Search in sources :

Example 1 with BasicIndexingFilter

use of org.apache.nutch.indexer.basic.BasicIndexingFilter in project nutch by apache.

the class TestBasicIndexingFilter method testBasicIndexingFilter.

@Test
public void testBasicIndexingFilter() throws Exception {
    Configuration conf = NutchConfiguration.create();
    conf.setInt("indexer.max.title.length", 10);
    conf.setBoolean("indexer.add.domain", true);
    conf.setInt("indexer.max.content.length", 20);
    BasicIndexingFilter filter = new BasicIndexingFilter();
    filter.setConf(conf);
    Assert.assertNotNull(filter);
    NutchDocument doc = new NutchDocument();
    String title = "The Foo Page";
    Outlink[] outlinks = new Outlink[] { new Outlink("http://foo.com/", "Foo") };
    Metadata metaData = new Metadata();
    metaData.add("Language", "en/us");
    ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, title, outlinks, metaData);
    ParseImpl parse = new ParseImpl("this is a sample foo bar page. hope you enjoy it.", parseData);
    CrawlDatum crawlDatum = new CrawlDatum();
    crawlDatum.setFetchTime(100L);
    Inlinks inlinks = new Inlinks();
    try {
        filter.filter(doc, parse, new Text("http://nutch.apache.org/index.html"), crawlDatum, inlinks);
    } catch (Exception e) {
        e.printStackTrace();
        Assert.fail(e.getMessage());
    }
    Assert.assertNotNull(doc);
    Assert.assertEquals("test title, expect \"The Foo Pa\"", "The Foo Pa", doc.getField("title").getValues().get(0));
    Assert.assertEquals("test domain, expect \"apache.org\"", "apache.org", doc.getField("domain").getValues().get(0));
    Assert.assertEquals("test host, expect \"nutch.apache.org\"", "nutch.apache.org", doc.getField("host").getValues().get(0));
    Assert.assertEquals("test url, expect \"http://nutch.apache.org/index.html\"", "http://nutch.apache.org/index.html", doc.getField("url").getValues().get(0));
    Assert.assertEquals("test content", "this is a sample foo", doc.getField("content").getValues().get(0));
    Assert.assertEquals("test fetch time", new Date(100L), (Date) doc.getField("tstamp").getValues().get(0));
}
Also used : Outlink(org.apache.nutch.parse.Outlink) NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) NutchDocument(org.apache.nutch.indexer.NutchDocument) Metadata(org.apache.nutch.metadata.Metadata) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) Date(java.util.Date) ParseData(org.apache.nutch.parse.ParseData) ParseImpl(org.apache.nutch.parse.ParseImpl) BasicIndexingFilter(org.apache.nutch.indexer.basic.BasicIndexingFilter) Test(org.junit.Test)

Example 2 with BasicIndexingFilter

use of org.apache.nutch.indexer.basic.BasicIndexingFilter in project nutch by apache.

the class TestIndexReplace method parseAndFilterFile.

/**
 * Run a test file through the Nutch parser and index filters.
 *
 * @param fileName
 * @param conf
 * @return the Nutch document with the replace indexer applied
 */
public NutchDocument parseAndFilterFile(String fileName, Configuration conf) {
    NutchDocument doc = new NutchDocument();
    BasicIndexingFilter basicIndexer = new BasicIndexingFilter();
    basicIndexer.setConf(conf);
    Assert.assertNotNull(basicIndexer);
    MetadataIndexer metaIndexer = new MetadataIndexer();
    metaIndexer.setConf(conf);
    Assert.assertNotNull(basicIndexer);
    ReplaceIndexer replaceIndexer = new ReplaceIndexer();
    replaceIndexer.setConf(conf);
    Assert.assertNotNull(replaceIndexer);
    try {
        String urlString = "file:" + sampleDir + fileSeparator + fileName;
        Text text = new Text(urlString);
        CrawlDatum crawlDatum = new CrawlDatum();
        Protocol protocol = new ProtocolFactory(conf).getProtocol(urlString);
        Content content = protocol.getProtocolOutput(text, crawlDatum).getContent();
        Parse parse = new ParseUtil(conf).parse(content).get(content.getUrl());
        crawlDatum.setFetchTime(100L);
        Inlinks inlinks = new Inlinks();
        doc = basicIndexer.filter(doc, parse, text, crawlDatum, inlinks);
        doc = metaIndexer.filter(doc, parse, text, crawlDatum, inlinks);
        doc = replaceIndexer.filter(doc, parse, text, crawlDatum, inlinks);
    } catch (Exception e) {
        e.printStackTrace();
        Assert.fail(e.toString());
    }
    return doc;
}
Also used : NutchDocument(org.apache.nutch.indexer.NutchDocument) ParseUtil(org.apache.nutch.parse.ParseUtil) Parse(org.apache.nutch.parse.Parse) MetadataIndexer(org.apache.nutch.indexer.metadata.MetadataIndexer) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) Content(org.apache.nutch.protocol.Content) BasicIndexingFilter(org.apache.nutch.indexer.basic.BasicIndexingFilter) Protocol(org.apache.nutch.protocol.Protocol)

Aggregations

Text (org.apache.hadoop.io.Text)2 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)2 Inlinks (org.apache.nutch.crawl.Inlinks)2 NutchDocument (org.apache.nutch.indexer.NutchDocument)2 BasicIndexingFilter (org.apache.nutch.indexer.basic.BasicIndexingFilter)2 Date (java.util.Date)1 Configuration (org.apache.hadoop.conf.Configuration)1 MetadataIndexer (org.apache.nutch.indexer.metadata.MetadataIndexer)1 Metadata (org.apache.nutch.metadata.Metadata)1 Outlink (org.apache.nutch.parse.Outlink)1 Parse (org.apache.nutch.parse.Parse)1 ParseData (org.apache.nutch.parse.ParseData)1 ParseImpl (org.apache.nutch.parse.ParseImpl)1 ParseUtil (org.apache.nutch.parse.ParseUtil)1 Content (org.apache.nutch.protocol.Content)1 Protocol (org.apache.nutch.protocol.Protocol)1 ProtocolFactory (org.apache.nutch.protocol.ProtocolFactory)1 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)1 Test (org.junit.Test)1