Search in sources :

Example 6 with Page

use of us.codecraft.webmagic.Page in project webmagic by code4craft.

the class MockGithubDownloader method download.

@Override
public Page download(Request request, Task task) {
    Page page = new Page();
    page.setHtml(new Html(html));
    page.setRequest(new Request("https://github.com/code4craft/webmagic"));
    page.setUrl(new PlainText("https://github.com/code4craft/webmagic"));
    return page;
}
Also used : PlainText(us.codecraft.webmagic.selector.PlainText) Request(us.codecraft.webmagic.Request) Html(us.codecraft.webmagic.selector.Html) Page(us.codecraft.webmagic.Page)

Example 7 with Page

use of us.codecraft.webmagic.Page in project webmagic by code4craft.

the class SeleniumDownloader method download.

@Override
public Page download(Request request, Task task) {
    checkInit();
    WebDriver webDriver;
    try {
        webDriver = webDriverPool.get();
    } catch (InterruptedException e) {
        logger.warn("interrupted", e);
        return null;
    }
    logger.info("downloading page " + request.getUrl());
    webDriver.get(request.getUrl());
    try {
        Thread.sleep(sleepTime);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    WebDriver.Options manage = webDriver.manage();
    Site site = task.getSite();
    if (site.getCookies() != null) {
        for (Map.Entry<String, String> cookieEntry : site.getCookies().entrySet()) {
            Cookie cookie = new Cookie(cookieEntry.getKey(), cookieEntry.getValue());
            manage.addCookie(cookie);
        }
    }
    /*
		 * TODO You can add mouse event or other processes
		 * 
		 * @author: bob.li.0718@gmail.com
		 */
    WebElement webElement = webDriver.findElement(By.xpath("/html"));
    String content = webElement.getAttribute("outerHTML");
    Page page = new Page();
    page.setRawText(content);
    page.setHtml(new Html(UrlUtils.fixAllRelativeHrefs(content, request.getUrl())));
    page.setUrl(new PlainText(request.getUrl()));
    page.setRequest(request);
    webDriverPool.returnToPool(webDriver);
    return page;
}
Also used : WebDriver(org.openqa.selenium.WebDriver) Site(us.codecraft.webmagic.Site) Cookie(org.openqa.selenium.Cookie) PlainText(us.codecraft.webmagic.selector.PlainText) Html(us.codecraft.webmagic.selector.Html) Page(us.codecraft.webmagic.Page) WebElement(org.openqa.selenium.WebElement) Map(java.util.Map)

Example 8 with Page

use of us.codecraft.webmagic.Page in project webmagic by code4craft.

the class SeleniumDownloaderTest method test.

@Ignore("need chrome driver")
@Test
public void test() {
    SeleniumDownloader seleniumDownloader = new SeleniumDownloader(chromeDriverPath);
    long time1 = System.currentTimeMillis();
    for (int i = 0; i < 100; i++) {
        Page page = seleniumDownloader.download(new Request("http://huaban.com/"), new Task() {

            @Override
            public String getUUID() {
                return "huaban.com";
            }

            @Override
            public Site getSite() {
                return Site.me();
            }
        });
        System.out.println(page.getHtml().$("#waterfall").links().regex(".*pins.*").all());
    }
    System.out.println(System.currentTimeMillis() - time1);
}
Also used : Site(us.codecraft.webmagic.Site) Task(us.codecraft.webmagic.Task) Request(us.codecraft.webmagic.Request) Page(us.codecraft.webmagic.Page) Ignore(org.junit.Ignore) Test(org.junit.Test)

Example 9 with Page

use of us.codecraft.webmagic.Page in project webmagic by code4craft.

the class SeleniumDownloaderTest method testBaiduWenku.

@Ignore
@Test
public void testBaiduWenku() {
    SeleniumDownloader seleniumDownloader = new SeleniumDownloader(chromeDriverPath);
    seleniumDownloader.setSleepTime(10000);
    long time1 = System.currentTimeMillis();
    Page page = seleniumDownloader.download(new Request("http://wenku.baidu.com/view/462933ff04a1b0717fd5ddc2.html"), new Task() {

        @Override
        public String getUUID() {
            return "huaban.com";
        }

        @Override
        public Site getSite() {
            return Site.me();
        }
    });
    System.out.println(page.getHtml().$("div.inner").replace("<[^<>]+>", "").replace("&nsbp;", "").all());
}
Also used : Site(us.codecraft.webmagic.Site) Task(us.codecraft.webmagic.Task) Request(us.codecraft.webmagic.Request) Page(us.codecraft.webmagic.Page) Ignore(org.junit.Ignore) Test(org.junit.Test)

Example 10 with Page

use of us.codecraft.webmagic.Page in project webmagic by code4craft.

the class ProcessorBenchmark method test.

@Ignore
@Test
public void test() {
    ModelPageProcessor modelPageProcessor = ModelPageProcessor.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"), OschinaBlog.class);
    Page page = new Page();
    page.setRequest(new Request("http://my.oschina.net/flashsword/blog"));
    page.setUrl(new PlainText("http://my.oschina.net/flashsword/blog"));
    page.setHtml(new Html(html));
    long time = System.currentTimeMillis();
    for (int i = 0; i < 1000; i++) {
        modelPageProcessor.process(page);
    }
    System.out.println(System.currentTimeMillis() - time);
    time = System.currentTimeMillis();
    for (int i = 0; i < 1000; i++) {
        modelPageProcessor.process(page);
    }
    System.out.println(System.currentTimeMillis() - time);
}
Also used : PlainText(us.codecraft.webmagic.selector.PlainText) Request(us.codecraft.webmagic.Request) Html(us.codecraft.webmagic.selector.Html) Page(us.codecraft.webmagic.Page) Ignore(org.junit.Ignore) Test(org.junit.Test)

Aggregations

Page (us.codecraft.webmagic.Page)15 Request (us.codecraft.webmagic.Request)9 Test (org.junit.Test)7 PlainText (us.codecraft.webmagic.selector.PlainText)6 Site (us.codecraft.webmagic.Site)5 Ignore (org.junit.Ignore)3 Task (us.codecraft.webmagic.Task)3 Html (us.codecraft.webmagic.selector.Html)3 IOException (java.io.IOException)2 HttpServer (com.github.dreamhead.moco.HttpServer)1 Runnable (com.github.dreamhead.moco.Runnable)1 UnsupportedEncodingException (java.io.UnsupportedEncodingException)1 ArrayList (java.util.ArrayList)1 Map (java.util.Map)1 HttpHost (org.apache.http.HttpHost)1 CloseableHttpResponse (org.apache.http.client.methods.CloseableHttpResponse)1 HttpUriRequest (org.apache.http.client.methods.HttpUriRequest)1 Cookie (org.openqa.selenium.Cookie)1 WebDriver (org.openqa.selenium.WebDriver)1 WebElement (org.openqa.selenium.WebElement)1