Examples with ParseUtil - org.apache.nutch.parse.ParseUtil

Example 6 with ParseUtil

use of org.apache.nutch.parse.ParseUtil in project nutch by apache.

the class TestZipParser method testIt.

@Test
public void testIt() throws ProtocolException, ParseException {
    String urlString;
    Protocol protocol;
    Content content;
    Parse parse;
    Configuration conf = NutchConfiguration.create();
    for (int i = 0; i < sampleFiles.length; i++) {
        urlString = "file:" + sampleDir + fileSeparator + sampleFiles[i];
        protocol = new ProtocolFactory(conf).getProtocol(urlString);
        content = protocol.getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
        parse = new ParseUtil(conf).parseByExtensionId("parse-zip", content).get(content.getUrl());
        Assert.assertTrue("Extracted text does not start with <" + expectedText + ">: <" + parse.getText() + ">", parse.getText().startsWith(expectedText));
    }
}

Also used : ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) NutchConfiguration(org.apache.nutch.util.NutchConfiguration) Configuration(org.apache.hadoop.conf.Configuration) ParseUtil(org.apache.nutch.parse.ParseUtil) Content(org.apache.nutch.protocol.Content) Parse(org.apache.nutch.parse.Parse) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Protocol(org.apache.nutch.protocol.Protocol) Test(org.junit.Test)

Example 7 with ParseUtil

use of org.apache.nutch.parse.ParseUtil in project nutch by apache.

the class TestIndexReplace method parseAndFilterFile.

/**
 * Run a test file through the Nutch parser and index filters.
 *
 * @param fileName
 * @param conf
 * @return the Nutch document with the replace indexer applied
 */
public NutchDocument parseAndFilterFile(String fileName, Configuration conf) {
    NutchDocument doc = new NutchDocument();
    BasicIndexingFilter basicIndexer = new BasicIndexingFilter();
    basicIndexer.setConf(conf);
    Assert.assertNotNull(basicIndexer);
    MetadataIndexer metaIndexer = new MetadataIndexer();
    metaIndexer.setConf(conf);
    Assert.assertNotNull(basicIndexer);
    ReplaceIndexer replaceIndexer = new ReplaceIndexer();
    replaceIndexer.setConf(conf);
    Assert.assertNotNull(replaceIndexer);
    try {
        String urlString = "file:" + sampleDir + fileSeparator + fileName;
        Text text = new Text(urlString);
        CrawlDatum crawlDatum = new CrawlDatum();
        Protocol protocol = new ProtocolFactory(conf).getProtocol(urlString);
        Content content = protocol.getProtocolOutput(text, crawlDatum).getContent();
        Parse parse = new ParseUtil(conf).parse(content).get(content.getUrl());
        crawlDatum.setFetchTime(100L);
        Inlinks inlinks = new Inlinks();
        doc = basicIndexer.filter(doc, parse, text, crawlDatum, inlinks);
        doc = metaIndexer.filter(doc, parse, text, crawlDatum, inlinks);
        doc = replaceIndexer.filter(doc, parse, text, crawlDatum, inlinks);
    } catch (Exception e) {
        e.printStackTrace();
        Assert.fail(e.toString());
    }
    return doc;
}

Also used : NutchDocument(org.apache.nutch.indexer.NutchDocument) ParseUtil(org.apache.nutch.parse.ParseUtil) Parse(org.apache.nutch.parse.Parse) MetadataIndexer(org.apache.nutch.indexer.metadata.MetadataIndexer) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Inlinks(org.apache.nutch.crawl.Inlinks) ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) Content(org.apache.nutch.protocol.Content) BasicIndexingFilter(org.apache.nutch.indexer.basic.BasicIndexingFilter) Protocol(org.apache.nutch.protocol.Protocol)

Example 8 with ParseUtil

use of org.apache.nutch.parse.ParseUtil in project nutch by apache.

the class TestHTMLLanguageParser method testMetaHTMLParsing.

/**
 * Test parsing of language identifiers from html
 */
@Test
public void testMetaHTMLParsing() {
    try {
        ParseUtil parser = new ParseUtil(NutchConfiguration.create());
        /* loop through the test documents and validate result */
        for (int t = 0; t < docs.length; t++) {
            Content content = getContent(docs[t]);
            Parse parse = parser.parse(content).get(content.getUrl());
            Assert.assertEquals(metalanguages[t], (String) parse.getData().getParseMeta().get(Metadata.LANGUAGE));
        }
    } catch (Exception e) {
        e.printStackTrace(System.out);
        Assert.fail(e.toString());
    }
}

Also used : ParseUtil(org.apache.nutch.parse.ParseUtil) Content(org.apache.nutch.protocol.Content) Parse(org.apache.nutch.parse.Parse) Test(org.junit.Test)

Example 9 with ParseUtil

use of org.apache.nutch.parse.ParseUtil in project nutch by apache.

the class TestMetatagParser method parseMeta.

public Metadata parseMeta(String fileName, Configuration conf) {
    Metadata metadata = null;
    try {
        String urlString = "file:" + sampleDir + fileSeparator + fileName;
        Protocol protocol = new ProtocolFactory(conf).getProtocol(urlString);
        Content content = protocol.getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
        Parse parse = new ParseUtil(conf).parse(content).get(content.getUrl());
        metadata = parse.getData().getParseMeta();
    } catch (Exception e) {
        e.printStackTrace();
        Assert.fail(e.toString());
    }
    return metadata;
}

Also used : ProtocolFactory(org.apache.nutch.protocol.ProtocolFactory) ParseUtil(org.apache.nutch.parse.ParseUtil) Content(org.apache.nutch.protocol.Content) Parse(org.apache.nutch.parse.Parse) Metadata(org.apache.nutch.metadata.Metadata) CrawlDatum(org.apache.nutch.crawl.CrawlDatum) Text(org.apache.hadoop.io.Text) Protocol(org.apache.nutch.protocol.Protocol)

Example 10 with ParseUtil

use of org.apache.nutch.parse.ParseUtil in project nutch by apache.

the class TestSWFParser method testIt.

@Test
public void testIt() throws ProtocolException, ParseException {
    String urlString;
    Protocol protocol;
    Content content;
    Parse parse;
    Configuration conf = NutchConfiguration.create();
    for (int i = 0; i < sampleFiles.length; i++) {
        urlString = "file:" + sampleDir + fileSeparator + sampleFiles[i];
        protocol = new ProtocolFactory(conf).getProtocol(urlString);
        content = protocol.getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
        parse = new ParseUtil(conf).parse(content).get(content.getUrl());
        String text = parse.getText().replaceAll("[ \t\r\n]+", " ").trim();
        Assert.assertTrue(sampleTexts[i].equals(text));
    }
}

Aggregations

ParseUtil (org.apache.nutch.parse.ParseUtil)17 Parse (org.apache.nutch.parse.Parse)16 Content (org.apache.nutch.protocol.Content)15 Text (org.apache.hadoop.io.Text)13 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)13 Protocol (org.apache.nutch.protocol.Protocol)11 ProtocolFactory (org.apache.nutch.protocol.ProtocolFactory)11 Configuration (org.apache.hadoop.conf.Configuration)10 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)10 Test (org.junit.Test)10 Metadata (org.apache.nutch.metadata.Metadata)4 Map (java.util.Map)2 Inlinks (org.apache.nutch.crawl.Inlinks)2 Outlink (org.apache.nutch.parse.Outlink)2 ParseData (org.apache.nutch.parse.ParseData)2 ParseException (org.apache.nutch.parse.ParseException)2 ParseResult (org.apache.nutch.parse.ParseResult)2 IOException (java.io.IOException)1 URL (java.net.URL)1 HashMap (java.util.HashMap)1