Search in sources :

Example 36 with Outlink

use of org.apache.nutch.parse.Outlink in project nutch by apache.

the class ZipParser method getParse.

public ParseResult getParse(final Content content) {
    String resultText = null;
    String resultTitle = null;
    Outlink[] outlinks = null;
    List<Outlink> outLinksList = new ArrayList<Outlink>();
    try {
        final String contentLen = content.getMetadata().get(Response.CONTENT_LENGTH);
        final int len = Integer.parseInt(contentLen);
        if (LOG.isDebugEnabled()) {
            LOG.debug("ziplen: " + len);
        }
        final byte[] contentInBytes = content.getContent();
        if (contentLen != null && contentInBytes.length != len) {
            return new ParseStatus(ParseStatus.FAILED, ParseStatus.FAILED_TRUNCATED, "Content truncated at " + contentInBytes.length + " bytes. Parser can't handle incomplete zip file.").getEmptyParseResult(content.getUrl(), getConf());
        }
        ZipTextExtractor extractor = new ZipTextExtractor(getConf());
        // extract text
        resultText = extractor.extractText(new ByteArrayInputStream(contentInBytes), content.getUrl(), outLinksList);
    } catch (Exception e) {
        return new ParseStatus(ParseStatus.FAILED, "Can't be handled as Zip document. " + e).getEmptyParseResult(content.getUrl(), getConf());
    }
    if (resultText == null) {
        resultText = "";
    }
    if (resultTitle == null) {
        resultTitle = "";
    }
    outlinks = (Outlink[]) outLinksList.toArray(new Outlink[0]);
    final ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, resultTitle, outlinks, content.getMetadata());
    if (LOG.isTraceEnabled()) {
        LOG.trace("Zip file parsed sucessfully !!");
    }
    return ParseResult.createParseResult(content.getUrl(), new ParseImpl(resultText, parseData));
}
Also used : Outlink(org.apache.nutch.parse.Outlink) ArrayList(java.util.ArrayList) IOException(java.io.IOException) ParseStatus(org.apache.nutch.parse.ParseStatus) ByteArrayInputStream(java.io.ByteArrayInputStream) ParseData(org.apache.nutch.parse.ParseData) ParseImpl(org.apache.nutch.parse.ParseImpl)

Example 37 with Outlink

use of org.apache.nutch.parse.Outlink in project nutch by apache.

the class FetchNodeDbInfo method setChildNodes.

public void setChildNodes(Outlink[] links) {
    ChildNode childNode;
    for (Outlink outlink : links) {
        childNode = new ChildNode(outlink.getToUrl(), outlink.getAnchor());
        children.add(childNode);
    }
}
Also used : Outlink(org.apache.nutch.parse.Outlink)

Aggregations

Outlink (org.apache.nutch.parse.Outlink)37 ParseData (org.apache.nutch.parse.ParseData)22 ParseImpl (org.apache.nutch.parse.ParseImpl)17 ParseStatus (org.apache.nutch.parse.ParseStatus)16 URL (java.net.URL)13 Text (org.apache.hadoop.io.Text)13 CrawlDatum (org.apache.nutch.crawl.CrawlDatum)11 Test (org.junit.Test)11 Parse (org.apache.nutch.parse.Parse)10 MalformedURLException (java.net.MalformedURLException)9 Inlinks (org.apache.nutch.crawl.Inlinks)9 NutchDocument (org.apache.nutch.indexer.NutchDocument)9 Metadata (org.apache.nutch.metadata.Metadata)9 ArrayList (java.util.ArrayList)8 ByteArrayInputStream (java.io.ByteArrayInputStream)7 Configuration (org.apache.hadoop.conf.Configuration)6 NutchConfiguration (org.apache.nutch.util.NutchConfiguration)6 IOException (java.io.IOException)5 ParseText (org.apache.nutch.parse.ParseText)4 Map (java.util.Map)3