用Golang写爬虫(六) - 使用colly

时间 2019-11-07

标签 golang 爬虫使用 colly 栏目 Go 繁體版

原文原文链接

Colly是Golang世界最知名的Web爬虫框架了，它的API清晰明了，高度可配置和可扩展，支持分布式抓取，还支持多种存储后端（如内存、Redis、MongoDB等）。这篇文章记录我学习使用它的的一些感觉和理解。html

首先安装它：node

❯ go get -u github.com/gocolly/colly/...
复制代码

这个go get和以前安装包不太同样，最后有...这样的省略号，它的意思是也获取这个包的子包和依赖。git

从最简单的例子开始

Colly的文档写的算是很详细很完整的了，并且项目下的_examples目录里面也有不少爬虫例子，上手很是容易。先看个人一个例子：github

package main

import (
	"fmt"

	"github.com/gocolly/colly"
)

func main() {
	c := colly.NewCollector(
		colly.UserAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"),
	)

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.OnError(func(_ *colly.Response, err error) {
		fmt.Println("Something went wrong:", err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", r.Request.URL)
	})

	c.OnHTML(".paginator a", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

    c.OnScraped(func(r *colly.Response) {
        fmt.Println("Finished", r.Request.URL)
    })

	c.Visit("https://movie.douban.com/top250?start=0&filter=")
}
复制代码

这个程序就是去找豆瓣电影Top250的所有连接，如OnHTML方法的第一个函数所描述，找类名是paginator的标签下的a标签的href属性值。golang

运行一下：web

❯ go run colly/doubanCrawler1.go
Visiting https://movie.douban.com/top250?start=0&filter=
Visited https://movie.douban.com/top250?start=0&filter=
Visiting https://movie.douban.com/top250?start=25&filter=
Visited https://movie.douban.com/top250?start=25&filter=
...
Finished https://movie.douban.com/top250?start=25&filter=
Finished https://movie.douban.com/top250?start=0&filter=
复制代码

在Colly中主要实体就是一个Collector对象(用colly.NewCollector建立)，Collector管理网络通讯和对于响应的回调执行。Collector在初始化时能够接受多种设置项，例如这个例子里面我就设置了UserAgent的值。其余的设置项能够去看官方网站。后端

Collector对象接受多种回调方法，有不一样的做用，按调用顺序我列出来：bash

OnRequest。请求前
OnError。请求过程当中发生错误
OnResponse。收到响应后
OnHTML。若是收到的响应内容是HTML调用它。
OnXML。若是收到的响应内容是XML 调用它。写爬虫基本用不到，因此上面我没有使用它。
OnScraped。在OnXML/OnHTML回调完成后调用。不过官网写的是Called after OnXML callbacks，实际上对于OnHTML也有效，你们能够注意一下。

抓取条目ID和标题

仍是以前的需求，先看看豆瓣Top250页面每一个条目的部分HTML代码：网络

<ol class="grid_view">
  <li>
    <div class="item">
      <div class="info">
        <div class="hd">
          <a href="https://movie.douban.com/subject/1292052/" class="">
            <span class="title">肖申克的救赎</span>
            <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
            <span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激 1995(台)</span>
          </a>
          <span class="playable">[可播放]</span>
        </div>
      </div>
    </div>
  </li>
  ....
</ol>
复制代码

看看这个程序怎么写的：并发

package main

import (
	"log"
	"strings"

	"github.com/gocolly/colly"
)

func main() {
	c := colly.NewCollector(
		colly.Async(true),
		colly.UserAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"),
	)

	c.Limit(&colly.LimitRule{DomainGlob:  "*.douban.*", Parallelism: 5})

	c.OnRequest(func(r *colly.Request) {
		log.Println("Visiting", r.URL)
	})

	c.OnError(func(_ *colly.Response, err error) {
		log.Println("Something went wrong:", err)
	})

	c.OnHTML(".hd", func(e *colly.HTMLElement) {
		log.Println(strings.Split(e.ChildAttr("a", "href"), "/")[4],
			strings.TrimSpace(e.DOM.Find("span.title").Eq(0).Text()))
    })

	c.OnHTML(".paginator a", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.Visit("https://movie.douban.com/top250?start=0&filter=")
	c.Wait()
}
复制代码

若是你有心运行上面的那个例子，能够感觉到抓取时同步的，比较慢。而此次在colly.NewCollector里面加了一项colly.Async(true)，表示抓取时异步的。在Colly里面很是方便控制并发度，只抓取符合某个(些)规则的URLS，有一句c.Limit(&colly.LimitRule{DomainGlob: "*.douban.*", Parallelism: 5})，表示限制只抓取域名是douban(域名后缀和二级域名不限制)的地址，固然还支持正则匹配某些符合的 URLS，具体的能够看官方文档。

另外Limit方法中也限制了并发是5。为何要控制并发度呢？由于抓取的瓶颈每每来自对方网站的抓取频率的限制，若是在一段时间内达到某个抓取频率很容易被封，因此咱们要控制抓取的频率。另外为了避免给对方网站带来额外的压力和资源消耗，也应该控制你的抓取机制。

这个例子里面没有OnResponse方法，主要是里面没有实际的逻辑。可是多用了Wait方法，这是由于在Async为true时须要等待协程都完成再结束。可是呢，有2个OnHTML方法，一个用来确认都访问那些页面，另一个里面就是抓取条目信息的逻辑了。也就是这部分：

c.OnHTML(".hd", func(e *colly.HTMLElement) {
    log.Println(strings.Split(e.ChildAttr("a", "href"), "/")[4],
        strings.TrimSpace(e.DOM.Find("span.title").Eq(0).Text()))
})
复制代码

Colly的HTML解析库用的是goquery，因此写起来遵循goquery的语法就能够了。ChildAttr方法能够得到元素对应属性的值，另一个没有列出来的ChildText，用于得到元素的文本内容。可是咱们这个例子中类名为title的span标签有2个，用ChildText回直接返回2个标签的所有的值，可是Colly又没有提供ChildTexts方法（有ChildAttrs），因此只能看源码看ChildText实现改为了strings.TrimSpace(e.DOM.Find("span.title").Eq(0).Text())，这样就能够拿到第一个符合的文本了。

在Colly中使用XPath

若是你不喜欢goquery这种形式，固然也能够切换HTML解析方案，看我这个例子：

import "github.com/antchfx/htmlquery"

c.OnResponse(func(r *colly.Response) {
    doc, err := htmlquery.Parse(strings.NewReader(string(r.Body)))
    if err != nil {
        log.Fatal(err)
    }
    nodes := htmlquery.Find(doc, `//ol[@class="grid_view"]/li//div[@class="hd"]`)
    for _, node := range nodes {
        url := htmlquery.FindOne(node, "./a/@href")
        title := htmlquery.FindOne(node, `.//span[@class="title"]/text()`)
        log.Println(strings.Split(htmlquery.InnerText(url), "/")[4],
            htmlquery.InnerText(title))
    }
})
复制代码

此次我改在OnResponse方法里面得到条目ID和标题。htmlquery.Parse须要接受一个实现io.Reader接口的对象，因此用了strings.NewReader(string(r.Body))。其余的代码是以前用Golang写爬虫(五) - 使用XPath里面写过的，直接拷贝过来就能够了。

后记

试用Colly后就喜欢上了它，你呢？

代码地址

本文原文地址: strconv.com/posts/use-c…

完整代码能够在这个地址找到。