Search in sources :

Example 1 with HTMLDocumentImpl

use of org.apache.html.dom.HTMLDocumentImpl in project nutch by apache.

the class TestHeadingsParseFilter method testExtractHeadingFromNestedNodes.

@Test
public void testExtractHeadingFromNestedNodes() throws IOException, SAXException {
    conf.setStrings("headings", "h1", "h2");
    HtmlParseFilter filter = new HeadingsParseFilter();
    filter.setConf(conf);
    Content content = new Content("http://www.foo.com/", "http://www.foo.com/", "".getBytes("UTF8"), "text/html; charset=UTF-8", new Metadata(), conf);
    ParseImpl parse = new ParseImpl("foo bar", new ParseData());
    ParseResult parseResult = ParseResult.createParseResult("http://www.foo.com/", parse);
    HTMLMetaTags metaTags = new HTMLMetaTags();
    DOMFragmentParser parser = new DOMFragmentParser();
    DocumentFragment node = new HTMLDocumentImpl().createDocumentFragment();
    parser.parse(new InputSource(new ByteArrayInputStream(("<html><head><title>test header with span element</title></head><body><h1>header with <span>span element</span></h1></body></html>").getBytes())), node);
    parseResult = filter.filter(content, parseResult, metaTags, node);
    Assert.assertEquals("The h1 tag must include the content of the inner span node", "header with span element", parseResult.get(content.getUrl()).getData().getParseMeta().get("h1"));
}
Also used : InputSource(org.xml.sax.InputSource) Metadata(org.apache.nutch.metadata.Metadata) DOMFragmentParser(org.cyberneko.html.parsers.DOMFragmentParser) HTMLDocumentImpl(org.apache.html.dom.HTMLDocumentImpl) ByteArrayInputStream(java.io.ByteArrayInputStream) Content(org.apache.nutch.protocol.Content) DocumentFragment(org.w3c.dom.DocumentFragment) Test(org.junit.Test)

Example 2 with HTMLDocumentImpl

use of org.apache.html.dom.HTMLDocumentImpl in project nutch by apache.

the class HtmlParser method parseNeko.

private DocumentFragment parseNeko(InputSource input) throws Exception {
    DOMFragmentParser parser = new DOMFragmentParser();
    try {
        parser.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe", true);
        parser.setFeature("http://cyberneko.org/html/features/augmentations", true);
        parser.setProperty("http://cyberneko.org/html/properties/default-encoding", defaultCharEncoding);
        parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true);
        parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content", false);
        parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
        parser.setFeature("http://cyberneko.org/html/features/report-errors", LOG.isTraceEnabled());
    } catch (SAXException e) {
    }
    // convert Document to DocumentFragment
    HTMLDocumentImpl doc = new HTMLDocumentImpl();
    doc.setErrorChecking(false);
    DocumentFragment res = doc.createDocumentFragment();
    DocumentFragment frag = doc.createDocumentFragment();
    parser.parse(input, frag);
    res.appendChild(frag);
    try {
        while (true) {
            frag = doc.createDocumentFragment();
            parser.parse(input, frag);
            if (!frag.hasChildNodes())
                break;
            if (LOG.isInfoEnabled()) {
                LOG.info(" - new frag, " + frag.getChildNodes().getLength() + " nodes.");
            }
            res.appendChild(frag);
        }
    } catch (Exception e) {
        LOG.error("Error: ", e);
    }
    ;
    return res;
}
Also used : HTMLDocumentImpl(org.apache.html.dom.HTMLDocumentImpl) DOMFragmentParser(org.cyberneko.html.parsers.DOMFragmentParser) DocumentFragment(org.w3c.dom.DocumentFragment) DOMException(org.w3c.dom.DOMException) MalformedURLException(java.net.MalformedURLException) IOException(java.io.IOException) SAXException(org.xml.sax.SAXException) SAXException(org.xml.sax.SAXException)

Example 3 with HTMLDocumentImpl

use of org.apache.html.dom.HTMLDocumentImpl in project nutch by apache.

the class HtmlParser method parseTagSoup.

private DocumentFragment parseTagSoup(InputSource input) throws Exception {
    HTMLDocumentImpl doc = new HTMLDocumentImpl();
    DocumentFragment frag = doc.createDocumentFragment();
    DOMBuilder builder = new DOMBuilder(doc, frag);
    org.ccil.cowan.tagsoup.Parser reader = new org.ccil.cowan.tagsoup.Parser();
    reader.setContentHandler(builder);
    reader.setFeature(org.ccil.cowan.tagsoup.Parser.ignoreBogonsFeature, true);
    reader.setFeature(org.ccil.cowan.tagsoup.Parser.bogonsEmptyFeature, false);
    reader.setProperty("http://xml.org/sax/properties/lexical-handler", builder);
    reader.parse(input);
    return frag;
}
Also used : HTMLDocumentImpl(org.apache.html.dom.HTMLDocumentImpl) DocumentFragment(org.w3c.dom.DocumentFragment) Parser(org.apache.nutch.parse.Parser) DOMFragmentParser(org.cyberneko.html.parsers.DOMFragmentParser)

Example 4 with HTMLDocumentImpl

use of org.apache.html.dom.HTMLDocumentImpl in project nutch by apache.

the class TikaParser method getParse.

@SuppressWarnings("deprecation")
public ParseResult getParse(Content content) {
    String mimeType = content.getContentType();
    boolean useBoilerpipe = getConf().get("tika.extractor", "none").equals("boilerpipe");
    String boilerpipeExtractorName = getConf().get("tika.extractor.boilerpipe.algorithm", "ArticleExtractor");
    URL base;
    try {
        base = new URL(content.getBaseUrl());
    } catch (MalformedURLException e) {
        return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
    }
    // get the right parser using the mime type as a clue
    Parser parser = tikaConfig.getParser(MediaType.parse(mimeType));
    byte[] raw = content.getContent();
    if (parser == null) {
        String message = "Can't retrieve Tika parser for mime-type " + mimeType;
        LOG.error(message);
        return new ParseStatus(ParseStatus.FAILED, message).getEmptyParseResult(content.getUrl(), getConf());
    }
    LOG.debug("Using Tika parser " + parser.getClass().getName() + " for mime-type " + mimeType);
    Metadata tikamd = new Metadata();
    HTMLDocumentImpl doc = new HTMLDocumentImpl();
    doc.setErrorChecking(false);
    DocumentFragment root = doc.createDocumentFragment();
    ContentHandler domHandler;
    // Check whether to use Tika's BoilerplateContentHandler
    if (useBoilerpipe) {
        BoilerpipeContentHandler bpHandler = new BoilerpipeContentHandler((ContentHandler) new DOMBuilder(doc, root), BoilerpipeExtractorRepository.getExtractor(boilerpipeExtractorName));
        bpHandler.setIncludeMarkup(true);
        domHandler = (ContentHandler) bpHandler;
    } else {
        DOMBuilder domBuilder = new DOMBuilder(doc, root);
        domBuilder.setUpperCaseElementNames(upperCaseElementNames);
        domBuilder.setDefaultNamespaceURI(XHTMLContentHandler.XHTML);
        domHandler = (ContentHandler) domBuilder;
    }
    LinkContentHandler linkContentHandler = new LinkContentHandler();
    ParseContext context = new ParseContext();
    TeeContentHandler teeContentHandler = new TeeContentHandler(domHandler, linkContentHandler);
    if (HTMLMapper != null)
        context.set(HtmlMapper.class, HTMLMapper);
    tikamd.set(Metadata.CONTENT_TYPE, mimeType);
    try {
        parser.parse(new ByteArrayInputStream(raw), (ContentHandler) teeContentHandler, tikamd, context);
    } catch (Exception e) {
        LOG.error("Error parsing " + content.getUrl(), e);
        return new ParseStatus(ParseStatus.FAILED, e.getMessage()).getEmptyParseResult(content.getUrl(), getConf());
    }
    HTMLMetaTags metaTags = new HTMLMetaTags();
    String text = "";
    String title = "";
    Outlink[] outlinks = new Outlink[0];
    org.apache.nutch.metadata.Metadata nutchMetadata = new org.apache.nutch.metadata.Metadata();
    // we have converted the sax events generated by Tika into a DOM object
    // so we can now use the usual HTML resources from Nutch
    // get meta directives
    HTMLMetaProcessor.getMetaTags(metaTags, root, base);
    if (LOG.isTraceEnabled()) {
        LOG.trace("Meta tags for " + base + ": " + metaTags.toString());
    }
    // check meta directives
    if (!metaTags.getNoIndex()) {
        // okay to index
        StringBuffer sb = new StringBuffer();
        if (LOG.isTraceEnabled()) {
            LOG.trace("Getting text...");
        }
        // extract text
        utils.getText(sb, root);
        text = sb.toString();
        sb.setLength(0);
        if (LOG.isTraceEnabled()) {
            LOG.trace("Getting title...");
        }
        // extract title
        utils.getTitle(sb, root);
        title = sb.toString().trim();
    }
    if (!metaTags.getNoFollow()) {
        // okay to follow links
        // extract outlinks
        ArrayList<Outlink> l = new ArrayList<Outlink>();
        URL baseTag = base;
        String baseTagHref = tikamd.get("Content-Location");
        if (baseTagHref != null) {
            try {
                baseTag = new URL(base, baseTagHref);
            } catch (MalformedURLException e) {
                LOG.trace("Invalid <base href=\"{}\">", baseTagHref);
            }
        }
        if (LOG.isTraceEnabled()) {
            LOG.trace("Getting links (base URL = {}) ...", baseTag);
        }
        // pre-1233 outlink extraction
        // utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
        // Get outlinks from Tika
        List<Link> tikaExtractedOutlinks = linkContentHandler.getLinks();
        utils.getOutlinks(baseTag, l, tikaExtractedOutlinks);
        outlinks = l.toArray(new Outlink[l.size()]);
        if (LOG.isTraceEnabled()) {
            LOG.trace("found " + outlinks.length + " outlinks in " + content.getUrl());
        }
    }
    // populate Nutch metadata with Tika metadata
    String[] TikaMDNames = tikamd.names();
    for (String tikaMDName : TikaMDNames) {
        if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
            continue;
        String[] values = tikamd.getValues(tikaMDName);
        for (String v : values) nutchMetadata.add(tikaMDName, v);
    }
    if (outlinks.length == 0) {
        outlinks = OutlinkExtractor.getOutlinks(text, getConf());
    }
    ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
    if (metaTags.getRefresh()) {
        status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);
        status.setArgs(new String[] { metaTags.getRefreshHref().toString(), Integer.toString(metaTags.getRefreshTime()) });
    }
    ParseData parseData = new ParseData(status, title, outlinks, content.getMetadata(), nutchMetadata);
    ParseResult parseResult = ParseResult.createParseResult(content.getUrl(), new ParseImpl(text, parseData));
    // run filters on parse
    ParseResult filteredParse = this.htmlParseFilters.filter(content, parseResult, metaTags, root);
    if (metaTags.getNoCache()) {
        // not okay to cache
        for (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse) entry.getValue().getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);
    }
    return filteredParse;
}
Also used : MalformedURLException(java.net.MalformedURLException) Parse(org.apache.nutch.parse.Parse) Metadata(org.apache.tika.metadata.Metadata) ArrayList(java.util.ArrayList) URL(java.net.URL) BoilerpipeContentHandler(org.apache.tika.parser.html.BoilerpipeContentHandler) ContentHandler(org.xml.sax.ContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) ParseStatus(org.apache.nutch.parse.ParseStatus) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) DocumentFragment(org.w3c.dom.DocumentFragment) HTMLMetaTags(org.apache.nutch.parse.HTMLMetaTags) Outlink(org.apache.nutch.parse.Outlink) ParseResult(org.apache.nutch.parse.ParseResult) MalformedURLException(java.net.MalformedURLException) Parser(org.apache.tika.parser.Parser) HTMLDocumentImpl(org.apache.html.dom.HTMLDocumentImpl) ByteArrayInputStream(java.io.ByteArrayInputStream) ParseData(org.apache.nutch.parse.ParseData) HtmlMapper(org.apache.tika.parser.html.HtmlMapper) ParseContext(org.apache.tika.parser.ParseContext) ParseImpl(org.apache.nutch.parse.ParseImpl) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) Map(java.util.Map) BoilerpipeContentHandler(org.apache.tika.parser.html.BoilerpipeContentHandler) Link(org.apache.tika.sax.Link)

Aggregations

HTMLDocumentImpl (org.apache.html.dom.HTMLDocumentImpl)4 DocumentFragment (org.w3c.dom.DocumentFragment)4 DOMFragmentParser (org.cyberneko.html.parsers.DOMFragmentParser)3 ByteArrayInputStream (java.io.ByteArrayInputStream)2 MalformedURLException (java.net.MalformedURLException)2 IOException (java.io.IOException)1 URL (java.net.URL)1 ArrayList (java.util.ArrayList)1 Map (java.util.Map)1 Metadata (org.apache.nutch.metadata.Metadata)1 HTMLMetaTags (org.apache.nutch.parse.HTMLMetaTags)1 Outlink (org.apache.nutch.parse.Outlink)1 Parse (org.apache.nutch.parse.Parse)1 ParseData (org.apache.nutch.parse.ParseData)1 ParseImpl (org.apache.nutch.parse.ParseImpl)1 ParseResult (org.apache.nutch.parse.ParseResult)1 ParseStatus (org.apache.nutch.parse.ParseStatus)1 Parser (org.apache.nutch.parse.Parser)1 Content (org.apache.nutch.protocol.Content)1 Metadata (org.apache.tika.metadata.Metadata)1