笔记-scrapy-selector

笔记-scrapy-selector

scrapy版本:1.5.0css

 

1.总述

 

scrapy内置selector创建在lxml上。html

2.使用

能够使用xpath和css方法来进行解析,二者都返回列表;node

sel = Selector(text=body).xpath('//div[@class="ip_list"/text()]').extract()express

selector中也能够使用re()方法进行正则解析,使用方法相似于re库;less

3.类用经常使用属性

Selector objects

class scrapy.selector.Selector(response=Nonetext=Nonetype=None)scrapy

response is an HtmlResponse or an XmlResponse object that will be used for selecting and extracting data.spa

text is a unicode string or utf-8 encoded text for cases when a response isn’t available. Using text and response together is undefined behavior.code

type defines the selector type, it can be "html", "xml" or None (default).xml

If type is None, the selector automatically chooses the best type based on response type (see below), or defaults to "html" in case it is used together with text.htm

If type is None and a response is passed, the selector type is inferred from the response type as follows:

"html" for HtmlResponse type
"xml" for XmlResponse type
"html" for anything else
Otherwise, if type is set, the selector type will be forced and no detection will occur.

 

re(regex)

Apply the given regex and return a list of unicode strings with the matches.

regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)

 

extract()

Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.

 

remove_namespaces()

Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See example below.

 

SelectorList对象

selector类对象是内建list的一个子类,能够理解为多个selector对象组合,对selectorlist对象使用xpath,css,extract,re方法能够理解为对list中每个对象使用方法后再将返回组合为一个列表(注意:返回值并非做为一个总体进行插入)。

相关文章
相关标签/搜索