爬虫初探（一）crawler4j的robots

时间 2019-11-17

标签爬虫初探 crawler4j crawler robots 栏目网络爬虫繁體版

原文原文链接

最近刚刚开始研究爬虫，身为小白的我不知道应该从何处下手，网上查了查，发现主要的开源java爬虫有nutch apache/nutch · GitHub，Heritrix internetarchive/heritrix3 · GitHub和Crawler4j yasserg/crawler4j · GitHub，还有WebCollectorCrawlScript/WebCollector · GitHub和 WebMagic code4craft/webmagic · GitHubhtml

因为刚刚开始接触爬虫，所以决定先接触个小型的项目Crawler4j 。先从git上clone下来，结果发现不会导入eclipse（实在是小白啊，见谅）。一点点了解发现这是个maven项目，直接导入maven项目便可。最后终于运行了给的例子。
java

在初步了解的过程当中，发现了一个robots协议，百度了一下，竟然是个爬虫协议，有点吃惊。robots.txt是一个文本文件，放在网站的根目录，所以我就去尝试了读取大众点评的robots.txt ，发现还真有这么个文件。不过这是个道德规范的文件，由于它没法阻止“强盗”进入。具体的文件写法能够百度。git

能够看到crawler4j也是支持robots.txt协议的，总共有如下这几个类：
github

1.RobotstxtConfigweb

这个类十分简单，里面就3个变量，分别是是否开启robots协议，user-agent 那么以及缓存（这个缓存是指最多能缓存的robots.txt的数量，若是超过这个数量，会将最久不用的一个替换）。apache

2.HostDirectives缓存

这个类就是存放robots.txt的类，里面主要存放了disallows和allows （这2个是做者写的RuleSet，稍后说），还有个终止期限，超过这个期限要从新获取对应的robots.txt。多线程

3. RuleSeteclipse

这个类是存放具体的robots规则的，继承了TreeSet，由于TreeSet是按天然排序（这里字符串比较升序排）的，而又要将前缀路径覆盖全部后续的路径（做者思虑真周密啊），好比a/b覆盖a/b/c。但其实这样的话a/b/c1会覆盖a/b/c12，所以其实也有点小缺陷。附上源码：maven

public boolean add(String str) {
    SortedSet<String> sub = headSet(str);
    if (!sub.isEmpty() && str.startsWith(sub.last())) {
      // no need to add; prefix is already present
      return false;
    }
    boolean retVal = super.add(str);
    sub = tailSet(str + "\0");
    while (!sub.isEmpty() && sub.first().startsWith(str)) {
      // remove redundant entries
      sub.remove(sub.first());
    }
    return retVal;
  }

4. RobotstxtParser

顾名思义，就是将Robots.txt解析成HostDirectives，这里只有一个静态方法parse。这里对每一行进行解析，首先对于协议指定的user-agent，若是包括咱们本身的user-agent，则下面的disallow或者allow才加入规则中。具体是如何解析的有兴趣本身看源码吧。

不过这一块代码不是很清楚。

int commentIndex = line.indexOf('#');
      if (commentIndex > -1) {
        line = line.substring(0, commentIndex);
      }

      // remove any html markup
      line = line.replaceAll("<[^>]+>", "");

但愿有小伙伴指教下。

5.RobotstxtServer

这是Robots的主类，有个对外的方法。

public boolean allows(WebURL webURL) {
    if (config.isEnabled()) {
      try {
        URL url = new URL(webURL.getURL());
        String host = getHost(url);
        String path = url.getPath();

        HostDirectives directives = host2directivesCache.get(host);

        if ((directives != null) && directives.needsRefetch()) {
          synchronized (host2directivesCache) {
            host2directivesCache.remove(host);
            //这里用双重锁更合适，否则可能会remove异常
            directives = null;
          }
        }

        if (directives == null) {
          directives = fetchDirectives(url);
        }

        return directives.allows(path);
      } catch (MalformedURLException e) {
        logger.error("Bad URL in Robots.txt: " + webURL.getURL(), e);
      }
    }

    return true;
  }

代码很清晰，就是首先获取HostDirectives，（若是必要的话解析robots.txt），而后判断是否容许。

 public boolean allows(String path) {
    timeLastAccessed = System.currentTimeMillis();
    return !disallows.containsPrefixOf(path) || allows.containsPrefixOf(path);
  }

只要allows包含或者disallows不包含便可。

最后这个类有一个map存放各个host的robots.txt解析过来的HostDirectives，因为涉及多线程，所以当把HostDirectives加入这个map的时候须要加锁，否则remove可能会出异常。

synchronized (host2directivesCache) {
      if (host2directivesCache.size() == config.getCacheSize()) {
        String minHost = null;
        long minAccessTime = Long.MAX_VALUE;
        for (Map.Entry<String, HostDirectives> entry : host2directivesCache.entrySet()) {
          if (entry.getValue().getLastAccessTime() < minAccessTime) {
            minAccessTime = entry.getValue().getLastAccessTime();
            minHost = entry.getKey();
          }
        }
        host2directivesCache.remove(minHost);
      }
      host2directivesCache.put(host, directives);
    }