Search in sources :

Example 1 with FileCacheQueueScheduler

use of us.codecraft.webmagic.scheduler.FileCacheQueueScheduler in project webmagic by code4craft.

the class SpiderTest method testGlobalSpider.

@Ignore
@Test
public void testGlobalSpider() {
    //        PageProcessor pageProcessor = new MeicanProcessor();
    //        Spider.me().pipeline(new FilePipeline()).scheduler(new FileCacheQueueScheduler(pageProcessor.getSite(),"/data/temp/webmagic/cache/")).
    //                processor(pageProcessor).run();
    SimplePageProcessor pageProcessor2 = new SimplePageProcessor("http://www.diaoyuweng.com/home.php?mod=space&uid=88304&do=thread&view=me&type=thread&from=space", "http://www.diaoyuweng.com/thread-*-1-1.html");
    System.out.println(pageProcessor2.getSite().getCharset());
    pageProcessor2.getSite().setSleepTime(500);
    Spider.create(pageProcessor2).addPipeline(new FilePipeline()).scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();
}
Also used : FilePipeline(us.codecraft.webmagic.pipeline.FilePipeline) SimplePageProcessor(us.codecraft.webmagic.processor.SimplePageProcessor) FileCacheQueueScheduler(us.codecraft.webmagic.scheduler.FileCacheQueueScheduler) Ignore(org.junit.Ignore) Test(org.junit.Test)

Example 2 with FileCacheQueueScheduler

use of us.codecraft.webmagic.scheduler.FileCacheQueueScheduler in project webmagic by code4craft.

the class SinablogProcessorTest method test.

@Ignore
@Test
public void test() throws IOException {
    SinaBlogProcessor sinaBlogProcessor = new SinaBlogProcessor();
    //pipeline是抓取结束后的处理
    //默认放到/data/webmagic/ftl/[domain]目录下
    JsonFilePipeline pipeline = new JsonFilePipeline("/data/webmagic/");
    //Spider.me()是简化写法,其实就是new一个啦
    //Spider.pipeline()设定一个pipeline,支持链式调用
    //ConsolePipeline输出结果到控制台
    //FileCacheQueueSchedular保存url,支持断点续传,临时文件输出到/data/temp/webmagic/cache目录
    //Spider.run()执行
    Spider.create(sinaBlogProcessor).pipeline(new FilePipeline()).pipeline(pipeline).scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();
}
Also used : JsonFilePipeline(us.codecraft.webmagic.pipeline.JsonFilePipeline) FilePipeline(us.codecraft.webmagic.pipeline.FilePipeline) JsonFilePipeline(us.codecraft.webmagic.pipeline.JsonFilePipeline) FileCacheQueueScheduler(us.codecraft.webmagic.scheduler.FileCacheQueueScheduler) SinaBlogProcessor(us.codecraft.webmagic.samples.SinaBlogProcessor) Ignore(org.junit.Ignore) Test(org.junit.Test)

Aggregations

Ignore (org.junit.Ignore)2 Test (org.junit.Test)2 FilePipeline (us.codecraft.webmagic.pipeline.FilePipeline)2 FileCacheQueueScheduler (us.codecraft.webmagic.scheduler.FileCacheQueueScheduler)2 JsonFilePipeline (us.codecraft.webmagic.pipeline.JsonFilePipeline)1 SimplePageProcessor (us.codecraft.webmagic.processor.SimplePageProcessor)1 SinaBlogProcessor (us.codecraft.webmagic.samples.SinaBlogProcessor)1