Search in sources :

Example 26 with Page

use of us.codecraft.webmagic.Page in project webmagic by code4craft.

the class ModelPageProcessorTest method testExtractNoLinks.

@Test
public void testExtractNoLinks() throws Exception {
    ModelPageProcessor modelPageProcessor = ModelPageProcessor.create(null, MockModel.class);
    Page page = pageMocker.getMockPage();
    modelPageProcessor.setExtractLinks(false);
    modelPageProcessor.process(page);
    assertThat(page.getTargetRequests()).isEmpty();
}
Also used : Page(us.codecraft.webmagic.Page) Test(org.junit.Test)

Example 27 with Page

use of us.codecraft.webmagic.Page in project webmagic by code4craft.

the class PageMocker method getMockPage.

public Page getMockPage() throws IOException {
    Page page = new Page();
    page.setRawText(IOUtils.toString(PageMocker.class.getClassLoader().getResourceAsStream("html/mock-webmagic.html")));
    page.setRequest(new Request("http://webmagic.io/list/0"));
    page.setUrl(new PlainText("http://webmagic.io/list/0"));
    return page;
}
Also used : PlainText(us.codecraft.webmagic.selector.PlainText) Request(us.codecraft.webmagic.Request) Page(us.codecraft.webmagic.Page)

Example 28 with Page

use of us.codecraft.webmagic.Page in project webmagic by code4craft.

the class HttpClientDownloader method download.

@Override
public Page download(Request request, Task task) {
    if (task == null || task.getSite() == null) {
        throw new NullPointerException("task or site can not be null");
    }
    CloseableHttpResponse httpResponse = null;
    CloseableHttpClient httpClient = getHttpClient(task.getSite());
    Proxy proxy = proxyProvider != null ? proxyProvider.getProxy(task) : null;
    HttpClientRequestContext requestContext = httpUriRequestConverter.convert(request, task.getSite(), proxy);
    Page page = Page.fail();
    try {
        httpResponse = httpClient.execute(requestContext.getHttpUriRequest(), requestContext.getHttpClientContext());
        page = handleResponse(request, request.getCharset() != null ? request.getCharset() : task.getSite().getCharset(), httpResponse, task);
        onSuccess(request);
        logger.info("downloading page success {}", request.getUrl());
        return page;
    } catch (IOException e) {
        logger.warn("download page {} error", request.getUrl(), e);
        onError(request);
        return page;
    } finally {
        if (httpResponse != null) {
            // ensure the connection is released back to pool
            EntityUtils.consumeQuietly(httpResponse.getEntity());
        }
        if (proxyProvider != null && proxy != null) {
            proxyProvider.returnProxy(proxy, page, task);
        }
    }
}
Also used : CloseableHttpClient(org.apache.http.impl.client.CloseableHttpClient) Proxy(us.codecraft.webmagic.proxy.Proxy) CloseableHttpResponse(org.apache.http.client.methods.CloseableHttpResponse) Page(us.codecraft.webmagic.Page) IOException(java.io.IOException)

Example 29 with Page

use of us.codecraft.webmagic.Page in project webmagic by code4craft.

the class HttpClientDownloader method handleResponse.

protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {
    byte[] bytes = IOUtils.toByteArray(httpResponse.getEntity().getContent());
    String contentType = httpResponse.getEntity().getContentType() == null ? "" : httpResponse.getEntity().getContentType().getValue();
    Page page = new Page();
    page.setBytes(bytes);
    if (!request.isBinaryContent()) {
        if (charset == null) {
            charset = getHtmlCharset(contentType, bytes);
        }
        page.setCharset(charset);
        page.setRawText(new String(bytes, charset));
    }
    page.setUrl(new PlainText(request.getUrl()));
    page.setRequest(request);
    page.setStatusCode(httpResponse.getStatusLine().getStatusCode());
    page.setDownloadSuccess(true);
    if (responseHeader) {
        page.setHeaders(HttpClientUtils.convertHeaders(httpResponse.getAllHeaders()));
    }
    return page;
}
Also used : PlainText(us.codecraft.webmagic.selector.PlainText) Page(us.codecraft.webmagic.Page)

Aggregations

Page (us.codecraft.webmagic.Page)29 Request (us.codecraft.webmagic.Request)22 Test (org.junit.Test)19 IOException (java.io.IOException)11 HttpUriRequest (org.apache.http.client.methods.HttpUriRequest)11 HttpServer (com.github.dreamhead.moco.HttpServer)10 Runnable (com.github.dreamhead.moco.Runnable)10 UnsupportedEncodingException (java.io.UnsupportedEncodingException)10 PlainText (us.codecraft.webmagic.selector.PlainText)8 Site (us.codecraft.webmagic.Site)5 Task (us.codecraft.webmagic.Task)5 Ignore (org.junit.Ignore)3 Proxy (us.codecraft.webmagic.proxy.Proxy)2 Html (us.codecraft.webmagic.selector.Html)2 ArrayList (java.util.ArrayList)1 Map (java.util.Map)1 CloseableHttpResponse (org.apache.http.client.methods.CloseableHttpResponse)1 CloseableHttpClient (org.apache.http.impl.client.CloseableHttpClient)1 Cookie (org.openqa.selenium.Cookie)1 WebDriver (org.openqa.selenium.WebDriver)1