Search in sources :

Example 36 with ParseData

use of org.apache.nutch.parse.ParseData in project nutch by apache.

the class TestIndexingFilters method testFilterCacheIndexingFilter.

/**
 * Test behaviour when reset the index filter order will not take effect
 *
 * @throws IndexingException
 */
@Test
public void testFilterCacheIndexingFilter() throws IndexingException {
    Configuration conf = NutchConfiguration.create();
    conf.addResource("nutch-default.xml");
    conf.addResource("crawl-tests.xml");
    String class1 = "org.apache.nutch.indexer.basic.BasicIndexingFilter";
    conf.set(IndexingFilters.INDEXINGFILTER_ORDER, class1);
    IndexingFilters filters1 = new IndexingFilters(conf);
    NutchDocument fdoc1 = filters1.filter(new NutchDocument(), new ParseImpl("text", new ParseData(new ParseStatus(), "title", new Outlink[0], new Metadata())), new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
    // add another index filter
    String class2 = "org.apache.nutch.indexer.metadata.MetadataIndexer";
    // set content metadata
    Metadata md = new Metadata();
    md.add("example", "data");
    // set content metadata property defined in MetadataIndexer
    conf.set("index.content.md", "example");
    // add MetadataIndxer filter
    conf.set(IndexingFilters.INDEXINGFILTER_ORDER, class1 + " " + class2);
    IndexingFilters filters2 = new IndexingFilters(conf);
    NutchDocument fdoc2 = filters2.filter(new NutchDocument(), new ParseImpl("text", new ParseData(new ParseStatus(), "title", new Outlink[0], md)), new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
    Assert.assertEquals(fdoc1.getFieldNames().size(), fdoc2.getFieldNames().size());
}
Also used : ParseStatus(org.apache.nutch.parse.ParseStatus) NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) ParseData(org.apache.nutch.parse.ParseData) Metadata(org.apache.nutch.metadata.Metadata) ParseImpl(org.apache.nutch.parse.ParseImpl) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) Test(org.junit.Test)

Example 37 with ParseData

use of org.apache.nutch.parse.ParseData in project nutch by apache.

the class SegmentReader method getStats.

public void getStats(Path segment, final SegmentReaderStats stats) throws Exception {
    long cnt = 0L;
    Text key = new Text();
    Text val = new Text();
    FileSystem fs = segment.getFileSystem(getConf());
    if (ge) {
        SequenceFile.Reader[] readers = SegmentReaderUtil.getReaders(new Path(segment, CrawlDatum.GENERATE_DIR_NAME), getConf());
        for (int i = 0; i < readers.length; i++) {
            while (readers[i].next(key, val)) cnt++;
            readers[i].close();
        }
        stats.generated = cnt;
    }
    if (fe) {
        Path fetchDir = new Path(segment, CrawlDatum.FETCH_DIR_NAME);
        if (fs.exists(fetchDir) && fs.getFileStatus(fetchDir).isDirectory()) {
            cnt = 0L;
            long start = Long.MAX_VALUE;
            long end = Long.MIN_VALUE;
            CrawlDatum value = new CrawlDatum();
            MapFile.Reader[] mreaders = MapFileOutputFormat.getReaders(fetchDir, getConf());
            for (int i = 0; i < mreaders.length; i++) {
                while (mreaders[i].next(key, value)) {
                    cnt++;
                    if (value.getFetchTime() < start)
                        start = value.getFetchTime();
                    if (value.getFetchTime() > end)
                        end = value.getFetchTime();
                }
                mreaders[i].close();
            }
            stats.start = start;
            stats.end = end;
            stats.fetched = cnt;
        }
    }
    if (pd) {
        Path parseDir = new Path(segment, ParseData.DIR_NAME);
        if (fs.exists(parseDir) && fs.getFileStatus(parseDir).isDirectory()) {
            cnt = 0L;
            long errors = 0L;
            ParseData value = new ParseData();
            MapFile.Reader[] mreaders = MapFileOutputFormat.getReaders(parseDir, getConf());
            for (int i = 0; i < mreaders.length; i++) {
                while (mreaders[i].next(key, value)) {
                    cnt++;
                    if (!value.getStatus().isSuccess())
                        errors++;
                }
                mreaders[i].close();
            }
            stats.parsed = cnt;
            stats.parseErrors = errors;
        }
    }
}
Also used : Path(org.apache.hadoop.fs.Path) ParseData(org.apache.nutch.parse.ParseData) FileSystem(org.apache.hadoop.fs.FileSystem) InputStreamReader(java.io.InputStreamReader) BufferedReader(java.io.BufferedReader) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) ParseText(org.apache.nutch.parse.ParseText)

Aggregations

ParseData (org.apache.nutch.parse.ParseData)37 ParseImpl (org.apache.nutch.parse.ParseImpl)29 Text (org.apache.hadoop.io.Text)23 ParseStatus (org.apache.nutch.parse.ParseStatus)23 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)22 Outlink (org.apache.nutch.parse.Outlink)22 Inlinks (org.apache.nutch.crawl.Inlinks)19 Metadata (org.apache.nutch.metadata.Metadata)19 Test (org.junit.Test)19 NutchDocument (org.apache.nutch.indexer.NutchDocument)16 Configuration (org.apache.hadoop.conf.Configuration)14 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)14 Parse (org.apache.nutch.parse.Parse)9 URL (java.net.URL)7 ArrayList (java.util.ArrayList)6 ParseResult (org.apache.nutch.parse.ParseResult)6 ByteArrayInputStream (java.io.ByteArrayInputStream)5 IOException (java.io.IOException)5 Inlink (org.apache.nutch.crawl.Inlink)5 Content (org.apache.nutch.protocol.Content)5