Search in sources :

Example 31 with ParseImpl

use of org.apache.nutch.parse.ParseImpl in project nutch by apache.

the class TextProfileSignature method main.

public static void main(String[] args) throws Exception {
    TextProfileSignature sig = new TextProfileSignature();
    sig.setConf(NutchConfiguration.create());
    HashMap<String, byte[]> res = new HashMap<>();
    File[] files = new File(args[0]).listFiles();
    for (int i = 0; i < files.length; i++) {
        FileInputStream fis = new FileInputStream(files[i]);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
        StringBuffer text = new StringBuffer();
        String line = null;
        while ((line = br.readLine()) != null) {
            if (text.length() > 0)
                text.append("\n");
            text.append(line);
        }
        br.close();
        byte[] signature = sig.calculate(null, new ParseImpl(text.toString(), null));
        res.put(files[i].toString(), signature);
    }
    Iterator<String> it = res.keySet().iterator();
    while (it.hasNext()) {
        String name = it.next();
        byte[] signature = res.get(name);
        System.out.println(name + "\t" + StringUtil.toHexString(signature));
    }
}
Also used : InputStreamReader(java.io.InputStreamReader) HashMap(java.util.HashMap) FileInputStream(java.io.FileInputStream) BufferedReader(java.io.BufferedReader) ParseImpl(org.apache.nutch.parse.ParseImpl) File(java.io.File)

Aggregations

ParseImpl (org.apache.nutch.parse.ParseImpl)31 ParseData (org.apache.nutch.parse.ParseData)29 Text (org.apache.hadoop.io.Text)21 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)21 ParseStatus (org.apache.nutch.parse.ParseStatus)21 Inlinks (org.apache.nutch.crawl.Inlinks)20 Outlink (org.apache.nutch.parse.Outlink)17 Test (org.junit.Test)17 NutchDocument (org.apache.nutch.indexer.NutchDocument)16 Metadata (org.apache.nutch.metadata.Metadata)15 Configuration (org.apache.hadoop.conf.Configuration)13 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)13 URL (java.net.URL)6 IOException (java.io.IOException)5 Inlink (org.apache.nutch.crawl.Inlink)5 Parse (org.apache.nutch.parse.Parse)5 ByteArrayInputStream (java.io.ByteArrayInputStream)4 ArrayList (java.util.ArrayList)4 ParseResult (org.apache.nutch.parse.ParseResult)4 MalformedURLException (java.net.MalformedURLException)3