Search in sources :

Example 11 with ParseResult

use of org.apache.nutch.parse.ParseResult in project nutch by apache.

the class ZipParser method main.

public static void main(String[] args) throws IOException {
    if (args.length < 1) {
        System.out.println("ZipParser <zip_file>");
        System.exit(1);
    }
    File file = new File(args[0]);
    String url = "file:" + file.getCanonicalPath();
    FileInputStream in = new FileInputStream(file);
    byte[] bytes = new byte[in.available()];
    in.read(bytes);
    in.close();
    Configuration conf = NutchConfiguration.create();
    ZipParser parser = new ZipParser();
    parser.setConf(conf);
    Metadata meta = new Metadata();
    meta.add(Response.CONTENT_LENGTH, "" + file.length());
    ParseResult parseResult = parser.getParse(new Content(url, url, bytes, "application/zip", meta, conf));
    Parse p = parseResult.get(url);
    System.out.println(parseResult.size());
    System.out.println("Parse Text:");
    System.out.println(p.getText());
    System.out.println("Parse Data:");
    System.out.println(p.getData());
}
Also used : NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) ParseResult(org.apache.nutch.parse.ParseResult) Content(org.apache.nutch.protocol.Content) Parse(org.apache.nutch.parse.Parse) Metadata(org.apache.nutch.metadata.Metadata) File(java.io.File) FileInputStream(java.io.FileInputStream)

Aggregations

ParseResult (org.apache.nutch.parse.ParseResult)11 Parse (org.apache.nutch.parse.Parse)10 Metadata (org.apache.nutch.metadata.Metadata)7 Content (org.apache.nutch.protocol.Content)7 ParseData (org.apache.nutch.parse.ParseData)6 Configuration (org.apache.hadoop.conf.Configuration)5 ParseImpl (org.apache.nutch.parse.ParseImpl)5 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)5 Map (java.util.Map)4 Text (org.apache.hadoop.io.Text)4 Outlink (org.apache.nutch.parse.Outlink)4 ParseStatus (org.apache.nutch.parse.ParseStatus)4 ByteArrayInputStream (java.io.ByteArrayInputStream)3 FileInputStream (java.io.FileInputStream)3 MalformedURLException (java.net.MalformedURLException)3 URL (java.net.URL)3 ArrayList (java.util.ArrayList)3 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)3 ParseText (org.apache.nutch.parse.ParseText)3 File (java.io.File)2