Search in sources :

Example 1 with Request

use of us.codecraft.webmagic.Request in project webmagic by code4craft.

the class FileCacheQueueScheduler method readUrlFile.

private void readUrlFile() throws IOException {
    String line;
    BufferedReader fileUrlReader = null;
    try {
        fileUrlReader = new BufferedReader(new FileReader(getFileName(fileUrlAllName)));
        int lineReaded = 0;
        while ((line = fileUrlReader.readLine()) != null) {
            urls.add(line.trim());
            lineReaded++;
            if (lineReaded > cursor.get()) {
                queue.add(new Request(line));
            }
        }
    } finally {
        if (fileUrlReader != null) {
            IOUtils.closeQuietly(fileUrlReader);
        }
    }
}
Also used : Request(us.codecraft.webmagic.Request)

Example 2 with Request

use of us.codecraft.webmagic.Request in project webmagic by code4craft.

the class ModelPageProcessorTest method getMockPage.

private Page getMockPage() throws IOException {
    Page page = new Page();
    page.setRawText(IOUtils.toString(getClass().getClassLoader().getResourceAsStream("html/mock-webmagic.html")));
    page.setRequest(new Request("http://webmagic.io/list/0"));
    page.setUrl(new PlainText("http://webmagic.io/list/0"));
    return page;
}
Also used : PlainText(us.codecraft.webmagic.selector.PlainText) Request(us.codecraft.webmagic.Request) Page(us.codecraft.webmagic.Page)

Example 3 with Request

use of us.codecraft.webmagic.Request in project webmagic by code4craft.

the class BloomFilterDuplicateRemoverTest method testMemory.

@Ignore("long time")
@Test
public void testMemory() throws Exception {
    int times = 5000000;
    DuplicateRemover duplicateRemover = new BloomFilterDuplicateRemover(times, 0.005);
    long freeMemory = Runtime.getRuntime().freeMemory();
    long time = System.currentTimeMillis();
    for (int i = 0; i < times; i++) {
        duplicateRemover.isDuplicate(new Request(String.valueOf(i)), null);
    }
    System.out.println("Time used by bloomfilter:" + (System.currentTimeMillis() - time));
    System.out.println("Memory used by bloomfilter:" + (freeMemory - Runtime.getRuntime().freeMemory()));
    duplicateRemover = new HashSetDuplicateRemover();
    System.gc();
    freeMemory = Runtime.getRuntime().freeMemory();
    time = System.currentTimeMillis();
    for (int i = 0; i < times; i++) {
        duplicateRemover.isDuplicate(new Request(String.valueOf(i)), null);
    }
    System.out.println("Time used by hashset:" + (System.currentTimeMillis() - time));
    System.out.println("Memory used by hashset:" + (freeMemory - Runtime.getRuntime().freeMemory()));
}
Also used : HashSetDuplicateRemover(us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover) Request(us.codecraft.webmagic.Request) DuplicateRemover(us.codecraft.webmagic.scheduler.component.DuplicateRemover) HashSetDuplicateRemover(us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover) Ignore(org.junit.Ignore) Test(org.junit.Test)

Example 4 with Request

use of us.codecraft.webmagic.Request in project webmagic by code4craft.

the class BloomFilterDuplicateRemoverTest method testRemove.

@Test
public void testRemove() throws Exception {
    BloomFilterDuplicateRemover bloomFilterDuplicateRemover = new BloomFilterDuplicateRemover(10);
    boolean isDuplicate = bloomFilterDuplicateRemover.isDuplicate(new Request("a"), null);
    assertThat(isDuplicate).isFalse();
    isDuplicate = bloomFilterDuplicateRemover.isDuplicate(new Request("a"), null);
    assertThat(isDuplicate).isTrue();
    isDuplicate = bloomFilterDuplicateRemover.isDuplicate(new Request("b"), null);
    assertThat(isDuplicate).isFalse();
    isDuplicate = bloomFilterDuplicateRemover.isDuplicate(new Request("b"), null);
    assertThat(isDuplicate).isTrue();
}
Also used : Request(us.codecraft.webmagic.Request) Test(org.junit.Test)

Example 5 with Request

use of us.codecraft.webmagic.Request in project webmagic by code4craft.

the class DelayQueueSchedulerTest method test.

@Ignore("infinite")
@Test
public void test() {
    DelayQueueScheduler delayQueueScheduler = new DelayQueueScheduler(1, TimeUnit.SECONDS);
    delayQueueScheduler.push(new Request("1"), null);
    while (true) {
        Request poll = delayQueueScheduler.poll(null);
        System.out.println(System.currentTimeMillis() + "\t" + poll);
    }
}
Also used : Request(us.codecraft.webmagic.Request) Ignore(org.junit.Ignore) Test(org.junit.Test)

Aggregations

Request (us.codecraft.webmagic.Request)45 Test (org.junit.Test)32 Page (us.codecraft.webmagic.Page)22 HttpUriRequest (org.apache.http.client.methods.HttpUriRequest)13 HttpServer (com.github.dreamhead.moco.HttpServer)12 Runnable (com.github.dreamhead.moco.Runnable)12 IOException (java.io.IOException)12 UnsupportedEncodingException (java.io.UnsupportedEncodingException)11 Task (us.codecraft.webmagic.Task)10 Ignore (org.junit.Ignore)8 Site (us.codecraft.webmagic.Site)6 PlainText (us.codecraft.webmagic.selector.PlainText)6 DuplicateRemover (us.codecraft.webmagic.scheduler.component.DuplicateRemover)4 Matcher (java.util.regex.Matcher)2 ResultItems (us.codecraft.webmagic.ResultItems)2 HashSetDuplicateRemover (us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover)2 JSONObject (com.alibaba.fastjson.JSONObject)1 URI (java.net.URI)1 ArrayList (java.util.ArrayList)1 Map (java.util.Map)1