scrapy中Selector的使用

scrapy的Selector选择器其实也能够用来解析,今天主要总结下css和xpath的用法,其实我我的最喜欢用csscss

以慕课网嵩天老师教程中的一个网页为例,python123.io/ws/demo.htmlhtml

解析是提取信息的一种手段,主要提取的信息包括:标签节点、属性、文本,下面从这三个方面来分别说明python

1、提取标签节点scrapy

response = ”<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>”spa

上面这个就是网页的html信息了,好比我要提取<p>标签code

使用css选择器htm

selector = Selector(text=response) p = selector.css('p').extract() print(p)
#['<p class="title"><b>The demo python introduces several python courses.</b></p>', '<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>']

这样就获得了全部p节点的信息,获得的是一个列表信息,若是只想获得第一个,实际上可使用extract_first()方法,而不是使用extract()方法blog

对于简单的节点查找,这样就够了,可是若是一样的节点不少,并且我要查找的节点不在第一个,这样处理就不行。解决的方法是添加限制条件,添加class、id等等限制信息教程

好比我想提取class=course的p节点信息,使用p[class='course'],固然,若是有其余的属性,也能够用其余属性做为限定it

selecor = Selector(text=result) response = selecor.css('p[class="course"]').extract_first() print(response)

#<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>

使用xpath

使用xpath大致思路也是同样的,只不过语法有点不一样

使用xpath实现上述第一个例子

selecor = Selector(text=result) response = selecor.xpath('//p').extract_first() print(response)

使用xpath实现上述第二个例子

selecor = Selector(text=result) response = selecor.xpath('//p[@class="course"]').extract_first() print(response)

细心点的可能会发现xpath选取标签节点,就比css多了个//和@,//表明从当前节点进行选择,@后面接的是属性

2、提取属性

有时候咱们须要提取属性值,好比src、href

response = ”<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>”

仍是这段例子,为了方便观看,我拷过来

好比我如今要提取第一个a标签的href

使用css

直接在标签后面加上::attr(href),attr表明提取的是属性,括号内的href表明我要提取的是哪一种属性

 

selecor = Selector(text=result) response = selecor.css('a::attr(href)').extract_first() print(response)
#http://www.icourse163.org/course/BIT-268001

 

若是要提取特性的a标签的href属性,好比第二个a标签的href,一样可使用限制条件

selecor = Selector(text=result) response = selecor.css('a[class="py2"]::attr(href)').extract_first() print(response)
#http://www.icourse163.org/course/BIT-1001870001

使用xpath

实现上面第一个例子

selecor = Selector(text=result) response = selecor.xpath('//a/@href').extract_first() print(response)

实现上面第二个例子

selecor = Selector(text=result) response = selecor.xpath('//a[@class="py2"]/@href').extract_first() print(response)

3、提取文本信息

response = ”<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>”

提取第一个a标签的文本

使用css选择器

只须要在标签后面加上::text,至于怎么选择标签参照上面

selecor = Selector(text=result) response = selecor.css('a::text').extract_first() print(response)
#Basic Python

选择特定标签的文本,好比第二个a标签文本,一样是加一个限制条件就好

selecor = Selector(text=result) response = selecor.css('a[class="py2"]::text').extract_first() print(response)
#Advanced Python

使用xpath来实现

首先是第一个例子,使用//a选择到a节点,再/text()选择到文本信息

selecor = Selector(text=result) response = selecor.xpath('//a/text()').extract_first() print(response)

实现第二个例子,添加xpath限制条件的时候前面必定不要忘记加@,并且text后面要加()

selecor = Selector(text=result) response = selecor.xpath('//a[@class="py2"]/text()').extract_first() print(response)

 

最后总结下:对于提取而言,xpath多了/和@符号,即便在添加限制条件时,xpath也须要在限制的属性前加@,因此这也是我喜欢css的缘由,由于我懒。