use of us.codecraft.webmagic.pipeline.FilePipeline in project webmagic by code4craft.
the class SpiderTest method testSpider.
@Ignore
@Test
public void testSpider() throws InterruptedException {
Spider me = Spider.create(new HuxiuProcessor()).addPipeline(new FilePipeline());
me.run();
}
use of us.codecraft.webmagic.pipeline.FilePipeline in project webmagic by code4craft.
the class SpiderTest method testGlobalSpider.
@Ignore
@Test
public void testGlobalSpider() {
// PageProcessor pageProcessor = new MeicanProcessor();
// Spider.me().pipeline(new FilePipeline()).scheduler(new FileCacheQueueScheduler(pageProcessor.getSite(),"/data/temp/webmagic/cache/")).
// processor(pageProcessor).run();
SimplePageProcessor pageProcessor2 = new SimplePageProcessor("http://www.diaoyuweng.com/thread-*-1-1.html");
System.out.println(pageProcessor2.getSite().getCharset());
pageProcessor2.getSite().setSleepTime(500);
Spider.create(pageProcessor2).addUrl("http://www.diaoyuweng.com/home.php?mod=space&uid=88304&do=thread&view=me&type=thread&from=space").addPipeline(new FilePipeline()).scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();
}
use of us.codecraft.webmagic.pipeline.FilePipeline in project webmagic by code4craft.
the class SinablogProcessorTest method test.
@Ignore
@Test
public void test() throws IOException {
SinaBlogProcessor sinaBlogProcessor = new SinaBlogProcessor();
// pipeline是抓取结束后的处理
// 默认放到/data/webmagic/ftl/[domain]目录下
JsonFilePipeline pipeline = new JsonFilePipeline("/data/webmagic/");
// Spider.me()是简化写法,其实就是new一个啦
// Spider.pipeline()设定一个pipeline,支持链式调用
// ConsolePipeline输出结果到控制台
// FileCacheQueueSchedular保存url,支持断点续传,临时文件输出到/data/temp/webmagic/cache目录
// Spider.run()执行
Spider.create(sinaBlogProcessor).pipeline(new FilePipeline()).pipeline(pipeline).scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();
}
Aggregations