VsCrawler文档java
本文Demogit
1,引入mavan,启动demo后,日志输出apache
10:39:45.636 [main] WARN c.v.vscrawler.core.event.EventLoop - 程序已中止 10:39:45.641 [main] INFO c.v.v.core.config.DirectoryWatcher - 注册事件:ENTRY_MODIFY 10:39:45.641 [main] INFO c.v.v.core.config.DirectoryWatcher - 注册事件:ENTRY_DELETE 10:39:45.660 [main] INFO c.v.v.core.config.DirectoryWatcher - 监控目录:D:\workspace\vscrawler\target\classes 10:39:45.661 [main] INFO c.v.v.c.seed.BerkeleyDBSeedManager - vsCrawler配置工做目录:classpath:work 10:39:45.672 [main] INFO c.v.v.c.seed.BerkeleyDBSeedManager - vsCrawler实际工做目录:D:\workspace\vscrawler\target\classes\work 10:39:45.709 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - contextPath:work 10:39:45.710 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - directoryPath:D:\workspace\vscrawler\target\classes 10:39:45.710 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - absolutePath:D:\workspace\vscrawler\target\classes\work 10:39:45.710 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - kind:ENTRY_MODIFY 10:39:45.710 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - 修改:D:\workspace\vscrawler\target\classes\work 10:39:45.964 [main] INFO c.v.v.core.seed.LocalFileSeedSource - 没有配置初始种子 10:39:45.964 [main] INFO c.v.v.c.seed.BerkeleyDBSeedManager - import new init seeds:0 注入一个种子任务 ################################################ ############## VSCrawler ############## ############## 0.0.1 ############## ############## 你有一个有意思的灵魂 ############## ################################################ ############## virjar ############## ################################################10:39:45.975 [VSCrawler-Dispatch] INFO com.virjar.vscrawler.core.VSCrawler - Spider started! 10:39:45.976 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty 10:39:46.100 [vsCrawlerEventLoop] INFO com.virjar.vscrawler.core.VSCrawler - 新的种子加入,激活爬虫派发线程 10:39:46.127 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty Exception in thread "VSCrawlerWorker-thread-1" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.http.impl.client.DefaultRedirectStrategy.<init>(DefaultRedirectStrategy.java:76) at org.apache.http.impl.client.DefaultRedirectStrategy.<clinit>(DefaultRedirectStrategy.java:84) at com.virjar.vscrawler.core.net.DefaultHttpClientGenerator.gen(DefaultHttpClientGenerator.java:22) at com.virjar.vscrawler.core.net.session.CrawlerSession.<init>(CrawlerSession.java:69) at com.virjar.vscrawler.core.net.session.CrawlerSessionPool.createNewSession(CrawlerSessionPool.java:126) at com.virjar.vscrawler.core.net.session.CrawlerSessionPool.borrowOne(CrawlerSessionPool.java:157) 10:39:46.224 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty at com.virjar.vscrawler.core.VSCrawler$SeedProcessTask.processSeed(VSCrawler.java:234) at com.virjar.vscrawler.core.VSCrawler$SeedProcessTask.run(VSCrawler.java:222) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 11 more
缺乏 网络
<dependency> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> <version>1.2</version> </dependency>
引入后解决.session
2,而后会发现缺乏logback的配置文件,这个我不是很会用,copy来一个并发
logback.xmlapp
<?xml version="1.0" encoding="UTF-8" ?> <configuration> <appender name="console" class="ch.qos.logback.core.ConsoleAppender"> <encoder charset="UTF-8"> <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level [%thread] %class{5}:%line>>%msg%n</pattern> </encoder> </appender> <appender name="file" class="ch.qos.logback.core.rolling.RollingFileAppender"> <encoder charset="UTF-8"> <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level [%thread] %class{5}:%line>>%msg%n</pattern> </encoder> <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy"> <fileNamePattern>${catalina.base}/logs/proxyipcenter/info.%d{yyyy-MM-dd}.log</fileNamePattern> <maxHistory>30</maxHistory> </rollingPolicy> </appender> <root level="info"> <appender-ref ref="console"/> </root> </configuration>
3,而后再次运行Demo,会提示缺乏 异步
原来vs默认集成了他的dungproxy代理,须要一个配置文件 proxyclient.propertiesoop
#爬虫的默认配置,他会被vsCrawler.properties里面的配置项merge,做为真正生效的配置数据传递到各个组件 #最大空闲时间,默认25分钟 sessionPool.maxIdle=25 * 60 * 1000 #至少空转时间,默认10s,也就是一个session被回收后,至少10s后才能再次被使用 sessionPool.minIdl=10 * 1000 #最多连续使用时间,默认一个小时,也就是说,一个session一直被使用,一个小时以后,销毁这个用户,将user登陆注销 sessionPool.maxDuration=60 * 60 * 1000 #一个用户最大并发数,默认一个session只能被一个session使用,这样每一个用户都是串行的,单线程的抓取数据,不会存在状态紊乱,适合查询提交和结果获取在屡次维护了状态的请求的场景 sessionPool.maxOccurs=10 #活跃session数目,若是你又n个帐户,此配置数据为m,若是m<n,那么保持m个帐户登陆,若是n<m,那么保持n个用户处于登陆状态,有用户登陆的session在vscrawler中被做为一种资源来管理 sessionPool.activeUser=65535 #sessionPool在异步准备session的时候,须要单独的线程来执行登陆,session检查等操纵,因为涉及网络,将会很是耗时,因此须要配置sessionPool里面的线程数目(正在设计动态线程池) sessionPool.monitorThreadNumber=2 #爬虫线程数目,默认10个线程 vsCrawler.threadNumber=1 #工做目录,将会存在一些爬虫中间数据 vsCrawler.Working.directory=classpath:work #初始种子文件 vsCrawler.initSeedFile= #预计 seedManager.expectedSeedNumber=1000000
至此,demo跑通,开始进行测试吧.
写完上文发现,按照Demo的
http://git.oschina.net/virjar/vscrawler/tree/master/vscrawler-samples/src/main
是没有问题的,除了配置文件名是默认是proxyclient.properties,其余问题是没有的.