Search in sources :

Example 1 with ContentSource

use of org.apache.lucene.benchmark.byTask.feeds.ContentSource in project lucene-solr by apache.

the class ExtractWikipedia method main.

public static void main(String[] args) throws Exception {
    Path wikipedia = null;
    Path outputDir = Paths.get("enwiki");
    boolean keepImageOnlyDocs = true;
    for (int i = 0; i < args.length; i++) {
        String arg = args[i];
        if (arg.equals("--input") || arg.equals("-i")) {
            wikipedia = Paths.get(args[i + 1]);
            i++;
        } else if (arg.equals("--output") || arg.equals("-o")) {
            outputDir = Paths.get(args[i + 1]);
            i++;
        } else if (arg.equals("--discardImageOnlyDocs") || arg.equals("-d")) {
            keepImageOnlyDocs = false;
        }
    }
    Properties properties = new Properties();
    properties.setProperty("docs.file", wikipedia.toAbsolutePath().toString());
    properties.setProperty("content.source.forever", "false");
    properties.setProperty("keep.image.only.docs", String.valueOf(keepImageOnlyDocs));
    Config config = new Config(properties);
    ContentSource source = new EnwikiContentSource();
    source.setConfig(config);
    DocMaker docMaker = new DocMaker();
    docMaker.setConfig(config, source);
    docMaker.resetInputs();
    if (Files.exists(wikipedia)) {
        System.out.println("Extracting Wikipedia to: " + outputDir + " using EnwikiContentSource");
        Files.createDirectories(outputDir);
        ExtractWikipedia extractor = new ExtractWikipedia(docMaker, outputDir);
        extractor.extract();
    } else {
        printUsage();
    }
}
Also used : Path(java.nio.file.Path) ContentSource(org.apache.lucene.benchmark.byTask.feeds.ContentSource) EnwikiContentSource(org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource) EnwikiContentSource(org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource) DocMaker(org.apache.lucene.benchmark.byTask.feeds.DocMaker) Config(org.apache.lucene.benchmark.byTask.utils.Config) Properties(java.util.Properties)

Aggregations

Path (java.nio.file.Path)1 Properties (java.util.Properties)1 ContentSource (org.apache.lucene.benchmark.byTask.feeds.ContentSource)1 DocMaker (org.apache.lucene.benchmark.byTask.feeds.DocMaker)1 EnwikiContentSource (org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource)1 Config (org.apache.lucene.benchmark.byTask.utils.Config)1