Search in sources :

Example 6 with HtmlParseData

use of edu.uci.ics.crawler4j.parser.HtmlParseData in project mastering-java by Kingminghuang.

the class Crawler method visit.

@Override
public void visit(Page page) {
    String url = page.getWebURL().getURL();
    System.out.println("URL: " + url);
    if (page.getParseData() instanceof HtmlParseData) {
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
        String text = htmlParseData.getText();
        String html = htmlParseData.getHtml();
        Set<WebURL> links = htmlParseData.getOutgoingUrls();
        System.out.println("Text length: " + text.length());
        System.out.println("HTML length: " + html.length());
        System.out.println("Number of outgoing links: " + links.size());
    }
}
Also used : WebURL(edu.uci.ics.crawler4j.url.WebURL) HtmlParseData(edu.uci.ics.crawler4j.parser.HtmlParseData)

Example 7 with HtmlParseData

use of edu.uci.ics.crawler4j.parser.HtmlParseData in project yyl_example by Relucent.

the class MyCrawler method visit.

/**
	 * 当URL下载完成会调用这个方法
	 */
@Override
public void visit(Page page) {
    String url = page.getWebURL().getURL();
    System.out.println("URL: " + url);
    if (page.getParseData() instanceof HtmlParseData) {
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
        String text = htmlParseData.getText();
        String html = htmlParseData.getHtml();
        Set<WebURL> links = htmlParseData.getOutgoingUrls();
        System.out.println("Text length: " + text.length());
        System.out.println("Html length: " + html.length());
        System.out.println("Number of outgoing links: " + links.size());
    }
}
Also used : WebURL(edu.uci.ics.crawler4j.url.WebURL) HtmlParseData(edu.uci.ics.crawler4j.parser.HtmlParseData)

Aggregations

HtmlParseData (edu.uci.ics.crawler4j.parser.HtmlParseData)7 WebURL (edu.uci.ics.crawler4j.url.WebURL)6 Page (edu.uci.ics.crawler4j.crawler.Page)1 ParseData (edu.uci.ics.crawler4j.parser.ParseData)1 UnsupportedEncodingException (java.io.UnsupportedEncodingException)1 Header (org.apache.http.Header)1