Search in sources :

Example 36 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class TestElasticIndexWriter method testBulkMaxDocs.

@Test
public void testBulkMaxDocs() throws IOException {
    int numDocs = 10;
    conf.setInt(ElasticConstants.MAX_BULK_DOCS, numDocs);
    Job job = Job.getInstance(conf);
    testIndexWriter.setConf(conf);
    testIndexWriter.open(conf, "name");
    NutchDocument doc = new NutchDocument();
    doc.add("id", "http://www.example.com");
    Assert.assertFalse(bulkRequestSuccessful);
    for (int i = 0; i < numDocs; i++) {
        testIndexWriter.write(doc);
    }
    testIndexWriter.close();
    Assert.assertTrue(bulkRequestSuccessful);
}
Also used : NutchDocument(org.apache.nutch.indexer.NutchDocument) Job(org.apache.hadoop.mapreduce.Job) Test(org.junit.Test)

Example 37 with NutchDocument

use of org.apache.nutch.indexer.NutchDocument in project nutch by apache.

the class MimeTypeIndexingFilter method main.

/**
 * Main method for invoking this tool
 *
 * @throws IOException
 * @throws IndexingException
 */
public static void main(String[] args) throws IOException, IndexingException {
    Option helpOpt = new Option("h", "help", false, "show this help message");
    Option rulesOpt = OptionBuilder.withArgName("file").hasArg().withDescription("Rules file to be used in the tests relative to the conf directory").isRequired().create("rules");
    Options options = new Options();
    options.addOption(helpOpt).addOption(rulesOpt);
    CommandLineParser parser = new GnuParser();
    HelpFormatter formatter = new HelpFormatter();
    String rulesFile;
    try {
        CommandLine line = parser.parse(options, args);
        if (line.hasOption("help") || !line.hasOption("rules")) {
            formatter.printHelp("org.apache.nutch.indexer.filter.MimeTypeIndexingFilter", options, true);
            return;
        }
        rulesFile = line.getOptionValue("rules");
    } catch (UnrecognizedOptionException e) {
        formatter.printHelp("org.apache.nutch.indexer.filter.MimeTypeIndexingFilter", options, true);
        return;
    } catch (Exception e) {
        LOG.error(StringUtils.stringifyException(e));
        e.printStackTrace();
        return;
    }
    MimeTypeIndexingFilter filter = new MimeTypeIndexingFilter();
    Configuration conf = NutchConfiguration.create();
    conf.set(MimeTypeIndexingFilter.MIMEFILTER_REGEX_FILE, rulesFile);
    filter.setConf(conf);
    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
    String line;
    while ((line = in.readLine()) != null && !line.isEmpty()) {
        Metadata metadata = new Metadata();
        metadata.set(Response.CONTENT_TYPE, line);
        ParseImpl parse = new ParseImpl("text", new ParseData(new ParseStatus(), "title", new Outlink[0], metadata));
        NutchDocument doc = filter.filter(new NutchDocument(), parse, new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
        if (doc != null) {
            System.out.print("+ ");
            System.out.println(line);
        } else {
            System.out.print("- ");
            System.out.println(line);
        }
    }
}
Also used : Outlink(org.apache.nutch.parse.Outlink) Options(org.apache.commons.cli.Options) Configuration(org.apache.hadoop.conf.Configuration) NutchConfiguration(org.apache.nutch.util.NutchConfiguration) InputStreamReader(java.io.InputStreamReader) NutchDocument(org.apache.nutch.indexer.NutchDocument) GnuParser(org.apache.commons.cli.GnuParser) Metadata(org.apache.nutch.metadata.Metadata) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) UnrecognizedOptionException(org.apache.commons.cli.UnrecognizedOptionException) UnrecognizedOptionException(org.apache.commons.cli.UnrecognizedOptionException) IOException(java.io.IOException) IndexingException(org.apache.nutch.indexer.IndexingException) HelpFormatter(org.apache.commons.cli.HelpFormatter) ParseStatus(org.apache.nutch.parse.ParseStatus) CommandLine(org.apache.commons.cli.CommandLine) ParseData(org.apache.nutch.parse.ParseData) BufferedReader(java.io.BufferedReader) ParseImpl(org.apache.nutch.parse.ParseImpl) Option(org.apache.commons.cli.Option) CommandLineParser(org.apache.commons.cli.CommandLineParser)

Aggregations

NutchDocument (org.apache.nutch.indexer.NutchDocument)37 Test (org.junit.Test)33 Text (org.apache.hadoop.io.Text)20 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)20 Inlinks (org.apache.nutch.crawl.Inlinks)20 Configuration (org.apache.hadoop.conf.Configuration)17 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)17 ParseData (org.apache.nutch.parse.ParseData)16 ParseImpl (org.apache.nutch.parse.ParseImpl)16 ParseStatus (org.apache.nutch.parse.ParseStatus)10 Outlink (org.apache.nutch.parse.Outlink)9 Metadata (org.apache.nutch.metadata.Metadata)7 Inlink (org.apache.nutch.crawl.Inlink)5 URL (java.net.URL)3 Job (org.apache.hadoop.mapreduce.Job)3 IndexingException (org.apache.nutch.indexer.IndexingException)2 BasicIndexingFilter (org.apache.nutch.indexer.basic.BasicIndexingFilter)2 BufferedReader (java.io.BufferedReader)1 IOException (java.io.IOException)1 InputStreamReader (java.io.InputStreamReader)1