本人也是菜鸟一枚,如今刚开始接触爬虫,想经过读别人的爬虫框架源码来了解下爬虫,若有错误,请见谅并指出。 html
继以前解析了crawler4j的robotstxt包以后,今天来让咱们看看crawler包和exception包。java
crawler包中主要有如下几个类:web
1.Configurable:抽象配置类,这是一个抽象类,里面有一个CrawlConfig的引用。其余什么也没有了。服务器
2.CrawlConfig:这是一个爬虫的具体配置类,里面有许多参数,这里我只介绍几个主要的可配置的参数。框架
resumableCrawling:这个变量用来控制那些已经中止的爬虫是否能够恢复。(开启了会让爬的效率下降)ide
maxDepthOfCrawling:所爬取的最大深度。第一个页面是0的话,则从该页面里获取到的下一个页面深度就是1,以此类推,达到最大深度后的页面下的url都不会加入url队列。函数
maxPagesToFetch:最多的爬取数量。 fetch
politenessDelay:发送2个请求的间隔时间this
includeBinaryContentInCrawling和processBinaryContentInCrawling:是否处理二进制内容,如图像。url
userAgentString:爬虫名
proxyHost和proxyPort:代理服务器地址和端口。(关于代理能够自行百度,简单说下,就是你的爬虫先向代理服务器发送http请求,若代理服务器上有最新结果直接返回,不然由代理服务器向web服务器发送并获得结果返回)
还有一些参数就不一一讲了。(有些http链接和超时的参数,还有些没搞懂,如onlineTldListUpdate)
3.WebCrawler:爬虫类,实现了Runnable。既然实现了Runnable这里首先来看下run方法
public void run() { onStart(); while (true) { List<WebURL> assignedURLs = new ArrayList<>(50); isWaitingForNewURLs = true; frontier.getNextURLs(50, assignedURLs); isWaitingForNewURLs = false; if (assignedURLs.isEmpty()) { if (frontier.isFinished()) { return; } try { Thread.sleep(3000); } catch (InterruptedException e) { logger.error("Error occurred", e); } } else { for (WebURL curURL : assignedURLs) { if (myController.isShuttingDown()) { logger.info("Exiting because of controller shutdown."); return; } if (curURL != null) { curURL = handleUrlBeforeProcess(curURL); processPage(curURL); frontier.setProcessed(curURL); } } } } }
onStart()方法默认是空方法,但咱们能够重写这个方法定制在爬虫开始前的一些配置。而后本身的队列里没有url了就从全局url队列里取,若也没有则结束爬虫。如有url且控制器未关闭则处理,最后告诉全局url控制器Frontier这个url处理过了。
接着咱们来看下处理页的方法,这里贴出主要逻辑
fetchResult = pageFetcher.fetchPage(curURL);//获取结果集 Page page = new Page(curURL);//new 一个page page.setFetchResponseHeaders(fetchResult.getResponseHeaders()); page.setStatusCode(statusCode); //status code is 200 if (!curURL.getURL().equals(fetchResult.getFetchedUrl())) { if (docIdServer.isSeenBefore(fetchResult.getFetchedUrl())) { throw new RedirectException(Level.DEBUG, "Redirect page: " + curURL + " has already been seen"); } curURL.setURL(fetchResult.getFetchedUrl()); curURL.setDocid(docIdServer.getNewDocID(fetchResult.getFetchedUrl())); } parser.parse(page, curURL.getURL());//解析到page中 ParseData parseData = page.getParseData(); for (WebURL webURL : parseData.getOutgoingUrls()) { int newdocid = docIdServer.getDocId(webURL.getURL()); if (newdocid > 0) {//对于已经访问过的,深度设置为-1 // This is not the first time that this Url is visited. So, we set the depth to a negative number. webURL.setDepth((short) -1); webURL.setDocid(newdocid); }else {//加入url队列 webURL.setDocid(-1); webURL.setDepth((short) (curURL.getDepth() + 1)); if (shouldVisit(page, webURL)) {//知足访问要求,咱们能够重写此方法定制本身要访问的页面 webURL.setDocid(docIdServer.getNewDocID(webURL.getURL())); toSchedule.add(webURL); } } //加入全局url队列 frontier.scheduleAll(toSchedule); //重写此方法处理获取到的html代码 visit(page);
里面有许多细节忽略了,不过一次状态码为200的http请求处理过程大体是这样了。
4.Page:表明一个页面,存储了页面的相关信息
5.CrawlController:爬虫控制器。这是一个总控制器,用来开启爬虫并监视各个爬虫状态。构造函数须要CrawlConfig,PageFetcher和RobotstxtServer。经过addSeed(String)方法来添加种子(最一开始爬虫所爬的页面),可添加多个。而后经过start方法开始爬取。start方法须要传递一个继承WebCrawler的类的Class对象和开启爬虫的数量。让咱们来看下这个start方法
for (int i = 1; i <= numberOfCrawlers; i++) {//建立爬虫 T crawler = crawlerFactory.newInstance(); Thread thread = new Thread(crawler, "Crawler " + i); crawler.setThread(thread); crawler.init(i, this); thread.start(); crawlers.add(crawler); threads.add(thread); logger.info("Crawler {} started", i); } //接下来开启一个监视线程, Thread monitorThread = new Thread(new Runnable() { @Override public void run() { try { synchronized (waitingLock) { while (true) { sleep(10); boolean someoneIsWorking = false; for (int i = 0; i < threads.size(); i++) {//检查每一个爬虫 Thread thread = threads.get(i); if (!thread.isAlive()) { if (!shuttingDown) {//重启爬虫 logger.info("Thread {} was dead, I'll recreate it", i); T crawler = crawlerFactory.newInstance(); thread = new Thread(crawler, "Crawler " + (i + 1)); threads.remove(i); threads.add(i, thread); crawler.setThread(thread); crawler.init(i + 1, controller); thread.start(); crawlers.remove(i); crawlers.add(i, crawler); } } else if (crawlers.get(i).isNotWaitingForNewURLs()) { someoneIsWorking = true; } } boolean shut_on_empty = config.isShutdownOnEmptyQueue(); //没有爬虫在工做且,队列为空时关闭 if (!someoneIsWorking && shut_on_empty) { // Make sure again that none of the threads // are // alive. logger.info("It looks like no thread is working, waiting for 10 seconds to make sure..."); sleep(10); someoneIsWorking = false; //再次检查各个线程爬虫 for (int i = 0; i < threads.size(); i++) { Thread thread = threads.get(i); if (thread.isAlive() && crawlers.get(i).isNotWaitingForNewURLs()) { someoneIsWorking = true; } } if (!someoneIsWorking) { if (!shuttingDown) { //队列里还有要爬取的页面 long queueLength = frontier.getQueueLength(); if (queueLength > 0) { continue; } logger.info( "No thread is working and no more URLs are in queue waiting for another 10 seconds to make " + "sure..."); sleep(10); //又判断了一次,这里进行了2次判断,防止出现伪结束 queueLength = frontier.getQueueLength(); if (queueLength > 0) { continue; } } //全部爬虫都结束了,关闭各个服务 logger.info("All of the crawlers are stopped. Finishing the process..."); frontier.finish(); for (T crawler : crawlers) { crawler.onBeforeExit(); crawlersLocalData.add(crawler.getMyLocalData()); } logger.info("Waiting for 10 seconds before final clean up..."); sleep(10); frontier.close(); docIdServer.close(); pageFetcher.shutDown(); finished = true; waitingLock.notifyAll(); env.close(); return; } } } } } catch (Exception e) { logger.error("Unexpected Error", e); } } }); monitorThread.start();
其实还有许多细节没有解析,不得不佩服大神啊,光是看看都以为太厉害了。不过仍是但愿能从这里学到些东西的。