scrapy框架之Selectors选择器

时间 2019-11-15

原文原文链接

Selectors（选择器）

当您抓取网页时，您须要执行的最多见任务是从HTML源中提取数据。有几个库能够实现这一点：css

BeautifulSoup是Python程序员中很是流行的网络抓取库，它基于HTML代码的结构构建一个Python对象，而且处理至关糟糕的标记，但它有一个缺点：它很慢。
lxml是一个XML解析库（它还解析HTML）与基于ElementTree的pythonic API 。（lxml不是Python标准库的一部分。）
Scrapy自带了提取数据的机制。它们称为选择器，由于它们“选择”由XPath或CSS表达式指定的HTML文档的某些部分。html

XPath是用于选择XML文档中的节点的语言，其也能够与HTML一块儿使用。CSS是一种用于将样式应用于HTML文档的语言。它定义了选择器以将这些样式与特定的HTML元素相关联。python

Scrapy选择器构建在lxml库之上，这意味着它们的速度和解析精度很是类似。程序员

这个页面解释了选择器是如何工做的，并描述了他们的API是很是小和简单，不像lxml API是更大，由于 lxml库能够用于许多其余任务，除了选择标记文档。web

构造选择器

Scrapy选择器是Selector经过传递文本或TextResponse 对象构造的类的实例。它根据输入类型自动选择最佳的解析规则（XML与HTML）：正则表达式

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

从文本构造：shell

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

构建响应：express

>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

为了方便起见，响应对象在.selector属性上显示一个选择器，在可能的状况下使用此快捷键是彻底正确的：服务器

>>> response.selector.xpath('//span/text()').extract()
[u'good']

使用选择器

为了解释如何使用选择器，咱们将使用Scrapy shell（提供交互式测试）和位于Scrapy文档服务器中的示例页面：网络

http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
这里是它的HTML代码：

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br />![](image1_thumb.jpg)</a>
   <a href='image2.html'>Name: My image 2 <br />![](image2_thumb.jpg)</a>
   <a href='image3.html'>Name: My image 3 <br />![](image3_thumb.jpg)</a>
   <a href='image4.html'>Name: My image 4 <br />![](image4_thumb.jpg)</a>
   <a href='image5.html'>Name: My image 5 <br />![](image5_thumb.jpg)</a>
  </div>
 </body>
</html>

首先，让咱们打开shell：
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
而后，在加载shell以后，您将有可用的响应做为response shell变量，以及其附加的选择器response.selector属性。

因为咱们处理HTML，选择器将自动使用HTML解析器。

所以，经过查看该页面的HTML代码，让咱们构造一个XPath来选择标题标签中的文本：

>>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]

使用XPath和CSS查询响应很是广泛，响应包括两个方便的快捷键：response.xpath()和response.css()：

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

正如你所看到的，.xpath()而.css()方法返回一个 SelectorList实例，它是新的选择列表。此API可用于快速选择嵌套数据：

>>> response.css('img').xpath('@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

要实际提取文本数据，必须调用选择器.extract() 方法，以下所示：

>>> response.xpath('//title/text()').extract()
[u'Example website']

若是只想提取第一个匹配的元素，能够调用选择器 .extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '

None若是没有找到元素则返回：

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first() is None
True

能够提供默认返回值做为参数，而不是使用None：

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found')
'not-found'

请注意，CSS选择器可使用CSS3伪元素选择文本或属性节点：

>>> response.css('title::text').extract()
[u'Example website']

如今咱们要获取基本URL和一些图像连接：

>>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

嵌套选择器

选择方法（.xpath()或.css()）返回相同类型的选择器的列表，所以您也能够调用这些选择器的选择方法。这里有一个例子：

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br>![](image1_thumb.jpg)</a>',
 u'<a href="image2.html">Name: My image 2 <br>![](image2_thumb.jpg)</a>',
 u'<a href="image3.html">Name: My image 3 <br>![](image3_thumb.jpg)</a>',
 u'<a href="image4.html">Name: My image 4 <br>![](image4_thumb.jpg)</a>',
 u'<a href="image5.html">Name: My image 5 <br>![](image5_thumb.jpg)</a>']

>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
...     print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

使用带有正则表达式的选择器

Selector也有一种.re()使用正则表达式提取数据的方法。可是，不一样于使用 .xpath()或 .css()methods，.re()返回一个unicode字符串列表。因此你不能构造嵌套.re()调用。

如下是用于从上面的HTML代码中提取图片名称的示例：

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

这里有一个额外的辅助往复.extract_first()进行.re()，命名.re_first()。使用它只提取第一个匹配的字符串：

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
u'My image 1'

使用相对XPath

请记住，若是您嵌套选择器并使用以XPath开头的XPath /，该XPath将是绝对的文档，而不是相对于 Selector您调用它。

例如，假设要提取<p>元素中的全部<div> 元素。首先，你会获得全部的<div>元素：

>>> divs = response.xpath('//div')

首先，你可能会使用下面的方法，这是错误的，由于它实际上<p>从文档中提取全部元素，而不只仅是那些内部<div>元素：

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

这是正确的方式（注意点前面的.//pXPath 的点）：

>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

另外一个常见的状况是提取全部直接的<p>孩子：

>>> for p in divs.xpath('p'):
...     print p.extract()

XPath表达式中的变量

XPath容许您使用$somevariable语法来引用XPath表达式中的变量。这在某种程度上相似于SQL世界中的参数化查询或预准备语句，您在查询中使用占位符替换一些参数，?而后用查询传递的值替换。

这里有一个例子来匹配元素基于其“id”属性值，没有硬编码它（如前所示）：

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').extract_first()  
u'Name: My image 1 '

这里是另外一个例子，找到一个<div>标签的“id” 属性包含五个<a>孩子（这里咱们传递的值5做为一个整数）：

>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first()
u'images'

全部变量引用在调用时必须有一个绑定值.xpath()（不然你会获得一个异常）。这是经过传递必要的命名参数。ValueError: XPath error: