VsCrawler 使用第一天--解决测试坑问题

VsCrawler文档java

本文Demogit

1,引入mavan,启动demo后,日志输出apache

10:39:45.636 [main] WARN  c.v.vscrawler.core.event.EventLoop - 程序已中止
10:39:45.641 [main] INFO  c.v.v.core.config.DirectoryWatcher - 注册事件:ENTRY_MODIFY
10:39:45.641 [main] INFO  c.v.v.core.config.DirectoryWatcher - 注册事件:ENTRY_DELETE
10:39:45.660 [main] INFO  c.v.v.core.config.DirectoryWatcher - 监控目录:D:\workspace\vscrawler\target\classes
10:39:45.661 [main] INFO  c.v.v.c.seed.BerkeleyDBSeedManager - vsCrawler配置工做目录:classpath:work
10:39:45.672 [main] INFO  c.v.v.c.seed.BerkeleyDBSeedManager - vsCrawler实际工做目录:D:\workspace\vscrawler\target\classes\work
10:39:45.709 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - contextPath:work
10:39:45.710 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - directoryPath:D:\workspace\vscrawler\target\classes
10:39:45.710 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - absolutePath:D:\workspace\vscrawler\target\classes\work
10:39:45.710 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - kind:ENTRY_MODIFY
10:39:45.710 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - 修改:D:\workspace\vscrawler\target\classes\work
10:39:45.964 [main] INFO  c.v.v.core.seed.LocalFileSeedSource - 没有配置初始种子
10:39:45.964 [main] INFO  c.v.v.c.seed.BerkeleyDBSeedManager - import new init seeds:0
注入一个种子任务
################################################
##############     VSCrawler      ##############
##############       0.0.1        ##############
############## 你有一个有意思的灵魂 ##############
################################################
##############       virjar       ##############
################################################10:39:45.975 [VSCrawler-Dispatch] INFO  com.virjar.vscrawler.core.VSCrawler - Spider  started!

10:39:45.976 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty
10:39:46.100 [vsCrawlerEventLoop] INFO  com.virjar.vscrawler.core.VSCrawler - 新的种子加入,激活爬虫派发线程
10:39:46.127 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty
Exception in thread "VSCrawlerWorker-thread-1" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
	at org.apache.http.impl.client.DefaultRedirectStrategy.<init>(DefaultRedirectStrategy.java:76)
	at org.apache.http.impl.client.DefaultRedirectStrategy.<clinit>(DefaultRedirectStrategy.java:84)
	at com.virjar.vscrawler.core.net.DefaultHttpClientGenerator.gen(DefaultHttpClientGenerator.java:22)
	at com.virjar.vscrawler.core.net.session.CrawlerSession.<init>(CrawlerSession.java:69)
	at com.virjar.vscrawler.core.net.session.CrawlerSessionPool.createNewSession(CrawlerSessionPool.java:126)
	at com.virjar.vscrawler.core.net.session.CrawlerSessionPool.borrowOne(CrawlerSessionPool.java:157)
10:39:46.224 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty
	at com.virjar.vscrawler.core.VSCrawler$SeedProcessTask.processSeed(VSCrawler.java:234)
	at com.virjar.vscrawler.core.VSCrawler$SeedProcessTask.run(VSCrawler.java:222)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 11 more

缺乏 网络

<dependency>
			<groupId>commons-logging</groupId>
			<artifactId>commons-logging</artifactId>
			<version>1.2</version>
		</dependency>

引入后解决.session

 

2,而后会发现缺乏logback的配置文件,这个我不是很会用,copy来一个并发

logback.xmlapp

<?xml version="1.0" encoding="UTF-8" ?>
<configuration>
    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <encoder charset="UTF-8">
            <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level [%thread] %class{5}:%line>>%msg%n</pattern>
        </encoder>
    </appender>
    <appender name="file" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <encoder charset="UTF-8">
            <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level [%thread] %class{5}:%line>>%msg%n</pattern>
        </encoder>
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>${catalina.base}/logs/proxyipcenter/info.%d{yyyy-MM-dd}.log</fileNamePattern>
            <maxHistory>30</maxHistory>
        </rollingPolicy>
    </appender>
    <root level="info">
        <appender-ref ref="console"/>
    </root>
</configuration>

3,而后再次运行Demo,会提示缺乏 异步

proxyclient.propertieside

原来vs默认集成了他的dungproxy代理,须要一个配置文件 proxyclient.propertiesoop

#爬虫的默认配置,他会被vsCrawler.properties里面的配置项merge,做为真正生效的配置数据传递到各个组件

#最大空闲时间,默认25分钟
sessionPool.maxIdle=25 * 60 * 1000

#至少空转时间,默认10s,也就是一个session被回收后,至少10s后才能再次被使用
sessionPool.minIdl=10 * 1000

#最多连续使用时间,默认一个小时,也就是说,一个session一直被使用,一个小时以后,销毁这个用户,将user登陆注销
sessionPool.maxDuration=60 * 60 * 1000

#一个用户最大并发数,默认一个session只能被一个session使用,这样每一个用户都是串行的,单线程的抓取数据,不会存在状态紊乱,适合查询提交和结果获取在屡次维护了状态的请求的场景
sessionPool.maxOccurs=10

#活跃session数目,若是你又n个帐户,此配置数据为m,若是m<n,那么保持m个帐户登陆,若是n<m,那么保持n个用户处于登陆状态,有用户登陆的session在vscrawler中被做为一种资源来管理
sessionPool.activeUser=65535

#sessionPool在异步准备session的时候,须要单独的线程来执行登陆,session检查等操纵,因为涉及网络,将会很是耗时,因此须要配置sessionPool里面的线程数目(正在设计动态线程池)
sessionPool.monitorThreadNumber=2

#爬虫线程数目,默认10个线程
vsCrawler.threadNumber=1

#工做目录,将会存在一些爬虫中间数据
vsCrawler.Working.directory=classpath:work

#初始种子文件
vsCrawler.initSeedFile=

#预计
seedManager.expectedSeedNumber=1000000

至此,demo跑通,开始进行测试吧.

写完上文发现,按照Demo的 

http://git.oschina.net/virjar/vscrawler/tree/master/vscrawler-samples/src/main

是没有问题的,除了配置文件名是默认是proxyclient.properties,其余问题是没有的.

相关文章
相关标签/搜索