Xpath下根据标签获取指定标签的text,相关属性值。
要可以准确的定位到列表中的某一项(经过id或class)
根据标签或相关属性的值进行过滤css
response.xpath('//*[@id="resultList"]/div[4]/span[1]/a/@href').extract_first()
获取标签id为resultList的标签,向下第4个div元素,再向下第1个span元素,向下的a标签,获取a标签的href属性html
CSS根据css样式获取指定的某个元素或元素列表
获取指标签的text,相关属性值
要能准确的定位到列表中的某一项
若是一个标签有多个css样式的状况下,怎么写
Scrapy xpathnode
表达式 | 描述 |
---|---|
nodename | 选取此节点的全部子节点。 |
/ | 从根节点选取。 |
// | 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置。 |
. | 选取当前节点。 |
.. | 选取当前节点的父节点。 |
@ | 选取属性。 |
路径表达式 | 结果 |
---|---|
/bookstore/book[1] | 选取属于 bookstore 子元素的第一个 book 元素。 |
/bookstore/book[last()] | 选取属于 bookstore 子元素的最后一个 book 元素。 |
/bookstore/book[last()-1] | 选取属于 bookstore 子元素的倒数第二个 book 元素。 |
/bookstore/book[position()<3] | 选取最前面的两个属于 bookstore 元素的子元素的 book 元素。 |
//title[@lang] | 选取全部拥有名为 lang 的属性的 title 元素。 |
//title[@lang=’eng’] | 选取全部 title 元素,且这些元素拥有值为 eng 的 lang 属性。 |
/bookstore/book[price>35.00] | 选取 bookstore 元素的全部 book 元素,且其中的 price 元素的值须大于 35.00。 |
/bookstore/book[price>35.00]/title | 选取 bookstore 元素中的 book 元素的全部 title 元素,且其中的 price 元素的值须大于 35.00。 |
几个简单的例子:python
/html/head/title: 选择HTML文档<head>元素下面的<title> 标签。 方法2:response.xpath('//title') 获取了网页的标题 //效率低,不建议使用 /html/head/title/text(): 选择前面提到的<title> 元素下面的文本内容 //td: 选择全部 <td> 元素 //div[@class="mine"]: 选择全部包含 class="mine" 属性的div 标签元素
Scrapy使用css和xpath选择器来定位元素,它有四个基本方法:
xpath(): 返回选择器列表,每一个选择器表明使用xpath语法选择的节点
css(): 返回选择器列表,每一个选择器表明使用css语法选择的节点
extract(): 返回被选择元素的unicode字符串
re(): 返回经过正则表达式提取的unicode字符串列表git
>>> response.xpath('//title/text()') [<Selector (text) xpath=//title/text()>] >>> response.css('title::text') [<Selector (text) xpath=//title/text()>]
Scrapy没有进行预期的循环抓取的操做,
解决办法:将allow_domain中的域名改成与爬取url一致便可
缘由是 allow_domain中的域名写错了,与待爬取url不一致
已更改过的代码以下:程序员
# -*- coding: utf-8 -*- import scrapy from scrapy_demo7.items import ScrapyDemo7Item from scrapy.http import Request class ZhilianSpider(scrapy.Spider): name = 'zhilian' allowed_domains = ['zhaopin.com'] start_urls = ['http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&sm=0&p=1'] def parse(self, response): tables = response.xpath('//*[@id="newlist_list_content_table"]/table') for table in tables: item = ScrapyDemo7Item() first = table.xpath('./tbody/tr[1]/td[1]/div/a/@href').extract_first() print("first", first) tableRecord = table.xpath("./tr[1]") jobInfo = tableRecord.xpath("./td[@class='zwmc']/div/a") item["job_name"] = jobInfo.xpath("./text()").extract_first() item["company_name"] = tableRecord.xpath("./td[@class='gsmc']/a[@target='_blank']/text()").extract_first() item["job_provide_salary"] = tableRecord.xpath("./td[@class='zwyx']/text()").extract_first() item["job_location"] = tableRecord.xpath("./td[@class='gzdd']/text()").extract_first() item["job_release_date"] = tableRecord.xpath("./td[@class='gxsj']/span/text()").extract_first() item["job_url"] = jobInfo.xpath("./@href").extract_first() yield item for i in range(1, 21): url = "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&sm=0&p=" + str(i) print(url) yield Request(url, callback=self.parse)
C:\Users\user>pip3 install scrapy Collecting scrapy Using cached Scrapy-1.4.0-py2.py3-none-any.whl Collecting parsel>=1.1 (from scrapy) Using cached parsel-1.2.0-py2.py3-none-any.whl Requirement already satisfied: service-identity in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: w3lib>=1.17.0 in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: cssselect>=0.9 in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: queuelib in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: lxml in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: PyDispatcher>=2.0.5 in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: six>=1.5.2 in d:\python362\lib\site-packages (from scrapy) Collecting Twisted>=13.1.0 (from scrapy) Using cached Twisted-17.9.0.tar.bz2 Requirement already satisfied: pyOpenSSL in d:\python362\lib\site-packages (from scrapy) Requirement already satisfied: attrs in d:\python362\lib\site-packages (from service-identity->scrapy) Requirement already satisfied: pyasn1-modules in d:\python362\lib\site-packages (from service-identity->scrapy) Requirement already satisfied: pyasn1 in d:\python362\lib\site-packages (from service-identity->scrapy) Requirement already satisfied: zope.interface>=4.0.2 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: constantly>=15.1 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: incremental>=16.10.1 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: Automat>=0.3.0 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: hyperlink>=17.1.1 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy) Requirement already satisfied: cryptography>=2.1.4 in d:\python362\lib\site-packages (from pyOpenSSL->scrapy) Requirement already satisfied: setuptools in d:\python362\lib\site-packages (from zope.interface>=4.0.2->Twisted>=13.1.0->scrapy) Requirement already satisfied: cffi>=1.7; platform_python_implementation != "PyPy" in d:\python362\lib\site-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy) Requirement already satisfied: asn1crypto>=0.21.0 in d:\python362\lib\site-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy) Requirement already satisfied: idna>=2.1 in d:\python362\lib\site-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy) Requirement already satisfied: pycparser in d:\python362\lib\site-packages (from cffi>=1.7; platform_python_implementation != "PyPy"->cryptography>=2.1.4->pyOpenSSL->scrapy) Installing collected packages: parsel, Twisted, scrapy Running setup.py install for Twisted ... done Successfully installed Twisted-17.9.0 parsel-1.2.0 scrapy-1.4.0 C:\Users\user>scrapy Scrapy 1.4.0 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command C:\Users\user>scrapy bench 2017-12-13 15:41:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot) 2017-12-13 15:41:49 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'} 2017-12-13 15:41:50 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.logstats.LogStats'] Unhandled error in Deferred: 2017-12-13 15:41:50 [twisted] CRITICAL: Unhandled error in Deferred: 2017-12-13 15:41:50 [twisted] CRITICAL: Traceback (most recent call last): File "d:\python362\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "d:\python362\lib\site-packages\scrapy\crawler.py", line 77, in crawl self.engine = self._create_engine() File "d:\python362\lib\site-packages\scrapy\crawler.py", line 102, in _create_engine return ExecutionEngine(self, lambda _: self.stop()) File "d:\python362\lib\site-packages\scrapy\core\engine.py", line 69, in __init__ self.downloader = downloader_cls(crawler) File "d:\python362\lib\site-packages\scrapy\core\downloader\__init__.py", line 88, in __init__ self.middleware = DownloaderMiddlewareManager.from_crawler(crawler) File "d:\python362\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler return cls.from_settings(crawler.settings, crawler) File "d:\python362\lib\site-packages\scrapy\middleware.py", line 34, in from_settings mwcls = load_object(clspath) File "d:\python362\lib\site-packages\scrapy\utils\misc.py", line 44, in load_object mod = import_module(module) File "d:\python362\lib\importlib\__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 978, in _gcd_import File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 655, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 678, in exec_module File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed File "d:\python362\lib\site-packages\scrapy\downloadermiddlewares\retry.py", line 20, in <module> from twisted.web.client import ResponseFailed File "d:\python362\lib\site-packages\twisted\web\client.py", line 42, in <module> from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS File "d:\python362\lib\site-packages\twisted\internet\endpoints.py", line 41, in <module> from twisted.internet.stdio import StandardIO, PipeAddress File "d:\python362\lib\site-packages\twisted\internet\stdio.py", line 30, in <module> from twisted.internet import _win32stdio File "d:\python362\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module> import win32api ModuleNotFoundError: No module named 'win32api'
C:\Users\user>pip3 install pypiwin32 Collecting pypiwin32 Downloading pypiwin32-220-cp36-none-win32.whl (8.3MB) 100% |████████████████████████████████| 8.3MB 34kB/s Installing collected packages: pypiwin32 Successfully installed pypiwin32-220 C:\Users\user> C:\Users\user>scrapy bench 2017-12-13 15:49:05 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot) 2017-12-13 15:49:05 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'} 2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.logstats.LogStats'] 2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-12-13 15:49:06 [scrapy.core.engine] INFO: Spider opened 2017-12-13 15:49:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:07 [scrapy.extensions.logstats] INFO: Crawled 85 pages (at 5100 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:08 [scrapy.extensions.logstats] INFO: Crawled 157 pages (at 4320 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:09 [scrapy.extensions.logstats] INFO: Crawled 229 pages (at 4320 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:10 [scrapy.extensions.logstats] INFO: Crawled 293 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:11 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 3840 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:12 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 3360 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:13 [scrapy.extensions.logstats] INFO: Crawled 469 pages (at 3360 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:14 [scrapy.extensions.logstats] INFO: Crawled 517 pages (at 2880 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:15 [scrapy.extensions.logstats] INFO: Crawled 573 pages (at 3360 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:16 [scrapy.core.engine] INFO: Closing spider (closespider_timeout) 2017-12-13 15:49:16 [scrapy.extensions.logstats] INFO: Crawled 621 pages (at 2880 pages/min), scraped 0 items (at 0 items/min) 2017-12-13 15:49:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 284168, 'downloader/request_count': 629, 'downloader/request_method_count/GET': 629, 'downloader/response_bytes': 1976557, 'downloader/response_count': 629, 'downloader/response_status_count/200': 629, 'finish_reason': 'closespider_timeout', 'finish_time': datetime.datetime(2017, 12, 13, 7, 49, 17, 78107), 'log_count/INFO': 17, 'request_depth_max': 21, 'response_received_count': 629, 'scheduler/dequeued': 629, 'scheduler/dequeued/memory': 629, 'scheduler/enqueued': 12581, 'scheduler/enqueued/memory': 12581, 'start_time': datetime.datetime(2017, 12, 13, 7, 49, 6, 563037)} 2017-12-13 15:49:17 [scrapy.core.engine] INFO: Spider closed (closespider_timeout) C:\Users\user>
在本教程中,咱们假设您已经安装了Scrapy。若是没有,请参阅安装指南。github
咱们将要抓取 quotes.toscrape.com,一个列出著名做家的名言(quote)的网站。web
本教程将引导您完成如下任务:正则表达式
Scrapy 是用 Python 编写的。若是你没学过 Python,你可能须要了解一下这个语言,以充分利用 Scrapy。shell
若是您已经熟悉其余语言,并但愿快速学习 Python,咱们建议您阅读 Dive Into Python 3。或者,您能够学习 Python 教程。
若是您刚开始编程,并但愿从 Python 开始,在线电子书《Learn Python The Hard Way》很是有用。您也能够查看非程序员的 Python 资源列表。
在开始抓取以前,您必须建立一个新的 Scrapy 项目。 进入您要存储代码的目录,而后运行:
scrapy startproject tutorial
这将建立一个包含如下内容的 tutorial 目录:
tutorial/ scrapy.cfg # 项目配置文件 tutorial/ # 项目的 Python 模块,放置您的代码的地方 __init__.py items.py # 项目项(item)定义文件 pipelines.py # 项目管道(piplines)文件 settings.py # 项目设置文件 spiders/ # 一个你之后会放置 spider 的目录 __init__.py
Spider 是您定义的类,Scrapy 用它从网站(或一组网站)中抓取信息。 他们必须是 scrapy.Spider 的子类并定义初始请求,和如何获取要继续抓取的页面的连接,以及如何解析下载的页面来提取数据。
这是咱们第一个爬虫的代码。 将其保存在项目中的 tutorial/spiders 目录下的名为 quotes_spider.py 的文件中:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)
你能够看到,咱们的 Spider 是 scrapy.Spider 的子类并定义了一些属性和方法:
parse() 方法一般解析响应,将抓取的数据提取为字典,而且还能够查找新的 URL 来跟踪并从中建立新的请求(Request)。
要使咱们的爬虫工做,请进入项目的根目录并运行:
scrapy crawl quotes
这个命令运行咱们刚刚添加的名称为 quotes 的爬虫,它将向 quotes.toscrape.com 发送一些请求。 你将获得相似于这样的输出:
... (omitted for brevity) 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened 2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished) ...
如今,查看当前目录下的文件。 您会发现已经建立了两个新文件:quotes-1.html 和 quotes-2.html,其中包含各个URL的内容,就像咱们的 parse 方法指示同样。
注意
若是您想知道为何咱们尚未解析 HTML,请继续,咱们将尽快介绍。
Spider 的 start_requests 方法返回 scrapy.Request 对象,Scrapy 对其发起请求 。而后将收到的响应实例化为 Response 对象,以响应为参数调用请求对象中定义的回调方法(在这里为 parse 方法)。
用于代替实现一个从 URL 生成 scrapy.Request 对象的 start_requests() 方法,您能够用 URL 列表定义一个 start_urls 类属性。 此列表将默认替代 start_requests() 方法为您的爬虫建立初始请求:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body)
Scrapy 将调用 parse() 方法来处理每一个 URL 的请求,即便咱们没有明确告诉 Scrapy 这样作。 这是由于 parse() 是 Scrapy 的默认回调方法,没有明确分配回调方法的请求默认调用此方法。
学习如何使用 Scrapy 提取数据的最佳方式是在 Scrapy shell 中尝试一下选择器。 运行:
scrapy shell 'http://quotes.toscrape.com/page/1/'
注意
在从命令行运行 Scrapy shell 时必须给 url 加上引号,不然包含参数(例如 &符号)的 url 将不起做用。
在Windows上,要使用双引号:
scrapy shell "http://quotes.toscrape.com/page/1/"
你将会看到:
[ ... Scrapy log here ... ] 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90> [s] item {} [s] request <GET http://quotes.toscrape.com/page/1/> [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings <scrapy.settings.Settings object at 0x7fa91d888c10> [s] spider <DefaultSpider 'default' at 0x7fa91c8af990> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser >>>
使用 shell,您能够尝试使用 CSS 选择器选择元素:
>>> response.css('title') [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
运行 response.css('title') 返回的结果是一个 SelectorList 类列表对象,它是一个指向 XML/HTML 元素的 Selector 对象的列表,容许您进行进一步的查询来细分选择或提取数据。
要从上面的 title 中提取文本,您能够执行如下操做:
>>> response.css('title::text').extract() ['Quotes to Scrape']
这里有两件事情要注意:一个是咱们在 CSS 查询中添加了 ::text,这意味着咱们只想要 <title> 元素中的文本。 若是咱们不指定 ::text,咱们将获得完整的 title 元素,包括其标签:
>>> response.css('title').extract() ['<title>Quotes to Scrape</title>']
另外一件事是调用 .extract() 返回的结果是一个列表,由于咱们在处理 SelectorList。 当你明确你只是想要第一个结果时,你能够这样作:
>>> response.css('title::text').extract_first() 'Quotes to Scrape'
或者你能够这样写:
>>> response.css('title::text')[0].extract() 'Quotes to Scrape'
可是,若是没有找到匹配选择的元素,.extract_first() 返回 None,避免了 IndexError
这里有一个教训:对于大多数爬虫代码,您但愿它具备容错性,若是在页面上找不到指定的元素致使没法获取某些项,至少其它的数据能够被抓取。
除了 extract() 和 extract_first() 方法以外,还可使用 re() 方法用正则表达式来提取:
>>> response.css('title::text').re(r'Quotes.*') ['Quotes to Scrape'] >>> response.css('title::text').re(r'Q\w+') ['Quotes'] >>> response.css('title::text').re(r'(\w+) to (\w+)') ['Quotes', 'Scrape']
为了获得正确的 CSS 选择器语句,您能够在浏览器中打开页面并查看源代码。 您也可使用浏览器的开发工具或扩展(如 Firebug)(请参阅有关 Using Firebug for scraping 和 Using Firefox for scraping 的部分)。
Selector Gadget 也是一个很好的工具,能够快速找到元素的 CSS 选择器语句,它能够在许多浏览器中运行。
除了 CSS,Scrapy 选择器还支持使用 XPath 表达式:
>>> response.xpath('//title') [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>] >>> response.xpath('//title/text()').extract_first() 'Quotes to Scrape'
XPath 表达式很是强大,是 Scrapy 选择器的基础。 实际上,若是你查看相关的源代码就能够发现,CSS 选择器被转换为 XPath。
虽然也许不像 CSS 选择器那么受欢迎,但 XPath 表达式提供更多的功能,由于除了导航结构以外,它还能够查看内容。 使用 XPath,您能够选择如下内容:包含文本“下一页”的连接。 这使得 XPath 很是适合抓取任务,咱们鼓励您学习 XPath,即便您已经知道如何使用 CSS 选择器,这会使抓取更容易。
咱们不会在这里讲太多关于 XPath 的内容,但您能够阅读 using XPath with Scrapy Selectors 获取更多有关 XPath 的信息。 咱们推荐教程 to learn XPath through examples,和教程 “how to think in XPath”。
如今你知道了如何选择和提取,让咱们来完成咱们的爬虫,编写代码从网页中提取名言(quote)。
http://quotes.toscrape.com 中的每一个名言都由 HTML 元素表示,以下所示:
<div class="quote"> <span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div> </div>
让咱们打开 scrapy shell 玩一玩,找到提取咱们想要的数据的方法:
$ scrapy shell 'http://quotes.toscrape.com'
获得 quote 元素的 selector 列表:
>>> response.css("div.quote")
经过上述查询返回的每一个 selector 容许咱们对其子元素运行进一步的查询。 让咱们将第一个 selector 分配给一个变量,以便咱们能够直接在特定的 quote 上运行咱们的 CSS 选择器:
>>> quote = response.css("div.quote")[0]
如今,咱们使用刚刚建立的 quote 对象,从该 quote 中提取 title,author 和 tags:
>>> title = quote.css("span.text::text").extract_first() >>> title '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' >>> author = quote.css("small.author::text").extract_first() >>> author 'Albert Einstein'
鉴于标签是字符串列表,咱们可使用 .extract() 方法将它们所有提取出来:
>>> tags = quote.css("div.tags a.tag::text").extract() >>> tags ['change', 'deep-thoughts', 'thinking', 'world']
如今已经弄清楚了如何提取每个信息,接下来遍历全部 quote 元素,并把它们放在一个 Python 字典中:
>>> for quote in response.css("div.quote"): ... text = quote.css("span.text::text").extract_first() ... author = quote.css("small.author::text").extract_first() ... tags = quote.css("div.tags a.tag::text").extract() ... print(dict(text=text, author=author, tags=tags)) {'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'} {'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'} ... a few more of these, omitted for brevity >>>
让咱们回到咱们的爬虫上。 到目前为止,它并无提取任何数据,只将整个 HTML 页面保存到本地文件。 让咱们将上述提取逻辑整合到咱们的爬虫中。
Scrapy 爬虫一般生成许多包含提取到的数据的字典。 为此,咱们在回调方法中使用 yield Python 关键字,以下所示:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }
若是您运行此爬虫,它将输出提取的数据与日志:
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'} 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
存储抓取数据的最简单的方法是使用 Feed exports,使用如下命令:
scrapy crawl quotes -o quotes.json
这将生成一个 quotes.json 文件,其中包含全部抓取到的 JSON 序列化的数据。
因为历史缘由,Scrapy 追加内容到给定的文件,而不是覆盖其内容。 若是您在第二次以前删除该文件两次运行此命令,那么最终会出现一个破坏的 JSON 文件。您还可使用其余格式,如 JSON 行(JSON Lines):
scrapy crawl quotes -o quotes.jl
JSON 行格式颇有用,由于它像流同样,您能够轻松地将新记录附加到文件。 当运行两次时,它不会发生 JSON 那样的问题。 另外,因为每条记录都是单独的行,因此您在处理大文件时无需将全部内容放到内存中,还有 JQ 等工具能够帮助您在命令行中执行此操做。
在小项目(如本教程中的一个)中,这应该是足够的。 可是,若是要使用已抓取的项目执行更复杂的操做,则能够编写项目管道(Item Pipeline)。 在工程的建立过程当中已经为您建立了项目管道的占位符文件 tutorial/pipelines.py, 虽然您只须要存储已抓取的项目,不须要任何项目管道。
或许你但愿获取网站全部页面的 quotes,而不是从 http://quotes.toscrape.com 的前两页抓取。
如今您已经知道如何从页面中提取数据,咱们来看看如何跟踪连接。
首先是提取咱们想要跟踪的页面的连接。 检查咱们的页面,咱们能够看到连接到下一个页面的URL在下面的元素中:
<ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">→</span></a> </li> </ul>
咱们能够尝试在 shell 中提取它:
>>> response.css('li.next a').extract_first() '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
这获得了超连接元素,可是咱们须要其属性 href。 为此,Scrapy 支持 CSS 扩展,您能够选择属性内容,以下所示:
>>> response.css('li.next a::attr(href)').extract_first() '/page/2/'
如今修改咱们的爬虫,改成递归地跟踪下一页的连接,从中提取数据:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
如今,在提取数据以后,parse() 方法查找到下一页的连接,使用 urljoin() 方法构建一个完整的绝对 URL(由于连接能够是相对的),并生成(yield)一个到下一页的新的请求, 其中包括回调方法(parse)。
您在这里看到的是 Scrapy 的连接跟踪机制:当您在一个回调方法中生成(yield)请求(request)时,Scrapy 将安排发起该请求,并注册该请求完成时执行的回调方法。
使用它,您能够根据您定义的规则构建复杂的跟踪连接机制,并根据访问页面提取不一样类型的数据。
在咱们的示例中,它建立一个循环,跟踪全部到下一页的连接,直到它找不到要抓取的博客,论坛或其余站点分页。
做为建立请求对象的快捷方式,您可使用 response.follow:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('span small::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)
不像 scrapy.Request,response.follow 支持相对 URL - 不须要调用urljoin。请注意,response.follow 只是返回一个 Request 实例,您仍然须要生成请求(yield request)。
您也能够将选择器传递给 response.follow,该选择器应该提取必要的属性:
for href in response.css('li.next a::attr(href)'): yield response.follow(href, callback=self.parse)
对于<a>元素,有一个快捷方式:response.follow 自动使用它们的 href 属性。 因此代码能够进一步缩短:
for a in response.css('li.next a'): yield response.follow(a, callback=self.parse)
注意
response.follow(response.css('li.next a')) 无效,由于 response.css 返回的是一个相似列表的对象,其中包含全部结果的选择器,而不是单个选择器。for 循环或者 response.follow(response.css('li.next a')[0]) 则能够正常工做。
这是另一个爬虫,示例了回调和跟踪连接,此次是为了抓取做者信息:
import scrapy class AuthorSpider(scrapy.Spider): name = 'author' start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): # 连接到做者页面 for href in response.css('.author + a::attr(href)'): yield response.follow(href, self.parse_author) # 连接到下一页 for href in response.css('li.next a::attr(href)'): yield response.follow(href, self.parse) def parse_author(self, response): def extract_with_css(query): return response.css(query).extract_first().strip() yield { 'name': extract_with_css('h3.author-title::text'), 'birthdate': extract_with_css('.author-born-date::text'), 'bio': extract_with_css('.author-description::text'), }
这个爬虫将从主页面开始, 以 parse_author 回调方法跟踪全部到做者页面的连接,以 parse 回调方法跟踪其它页面。
这里咱们将回调方法做为参数直接传递给 response.follow,这样代码更短,也能够传递给 scrapy.Request。
parse_author 回调方法里定义了另一个函数来根据 CSS 查询语句(query)来提取数据,而后生成包含做者数据的 Python 字典。
这个爬虫演示的另外一个有趣的事是,即便同一做者有许多名言,咱们也不用担忧屡次访问同一做者的页面。默认状况下,Scrapy 会将重复的请求过滤出来,避免了因为编程错误而致使的重复服务器的问题。能够经过 DUPEFILTER_CLASS 进行相关的设置。
但愿如今您已经了解了 Scrapy 的跟踪连接和回调方法机制。
CrawlSpider 类是一个小规模的通用爬虫引擎,只须要修改其跟踪连接的机制等,就能够在它之上实现你本身的爬虫程序。
另外,一个常见的模式是从多个页面据构建一个包含数据的项(item),有一个将附加数据传递给回调方法的技巧。
在运行爬虫时,能够经过 -a 选项为您的爬虫提供命令行参数:
scrapy crawl quotes -o quotes-humor.json -a tag=humor
默认状况下,这些参数将传递给 Spider 的 __init__ 方法并成为爬虫的属性。
在此示例中,经过 self.tag 获取命令行中参数 tag 的值。您能够根据命令行参数构建 URL,使您的爬虫只爬取特色标签的名言:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): url = 'http://quotes.toscrape.com/' tag = getattr(self, 'tag', None) if tag is not None: url = url + 'tag/' + tag yield scrapy.Request(url, self.parse) def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, self.parse)
若是您将 tag = humor 传递给爬虫,您会注意到它只会访问标签为 humor 的 URL,例如 http://quotes.toscrape.com/tag/humor。您能够在这里了解更多关于爬虫参数的信息。
本教程仅涵盖了 Scrapy 的基础知识,还有不少其余功能未在此说起。 查看初窥 Scrapy 中的“还有什么?”部分能够快速了解有哪些重要的内容。
您能够经过目录了解更多有关命令行工具、爬虫、选择器以及本教程未涵盖的其余内容的信息。下一章是示例项目。
http://www.cnblogs.com/-E6-/p/7213872.html
原英文文档:https://docs.scrapy.org/en/latest/topics/commands.html
github上的源码:https://github.com/scrapy/scrapy/tree/1.4
xpath,selector:
When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:
- BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.
- lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree. (lxml is not part of the Python standard library.)
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.
XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
Scrapy selectors are built over the lxml library, which means they’re very similar in speed and parsing accuracy.
This page explains how selectors work and describes their API which is very small and simple, unlike the lxml API which is much bigger because the lxml library can be used for many other tasks, besides selecting markup documents.
For a complete reference of the selectors API see Selector reference
Scrapy selectors are instances of Selector
class constructed by passing text or TextResponse
object. It automatically chooses the best parsing rules (XML vs HTML) based on input type:
>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse
Constructing from text:
>>> body = '<html><body><span>good</span></body></html>' >>> Selector(text=body).xpath('//span/text()').extract() [u'good']
Constructing from response:
>>> response = HtmlResponse(url='http://example.com', body=body) >>> Selector(response=response).xpath('//span/text()').extract() [u'good']
For convenience, response objects expose a selector on .selector attribute, it’s totally OK to use this shortcut when possible:
>>> response.selector.xpath('//span/text()').extract() [u'good']
To explain how to use the selectors we’ll use the Scrapy shell (which provides interactive testing) and an example page located in the Scrapy documentation server:
Here’s its HTML code:
<html> <head> <base href='http://example.com/' /> <title>Example website</title> </head> <body> <div id='images'> <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a> <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a> </div> </body> </html>
First, let’s open the shell:
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
Then, after the shell loads, you’ll have the response available as response
shell variable, and its attached selector in response.selector
attribute.
Since we’re dealing with HTML, the selector will automatically use an HTML parser.
So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:
>>> response.selector.xpath('//title/text()') [<Selector (text) xpath=//title/text()>]
Querying responses using XPath and CSS is so common that responses include two convenience shortcuts: response.xpath()
and response.css()
:
>>> response.xpath('//title/text()') [<Selector (text) xpath=//title/text()>] >>> response.css('title::text') [<Selector (text) xpath=//title/text()>]
As you can see, .xpath()
and .css()
methods return a SelectorList
instance, which is a list of new selectors. This API can be used for quickly selecting nested data:
>>> response.css('img').xpath('@src').extract() [u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg', u'image4_thumb.jpg', u'image5_thumb.jpg']
To actually extract the textual data, you must call the selector .extract()
method, as follows:
>>> response.xpath('//title/text()').extract() [u'Example website']
If you want to extract only first matched element, you can call the selector .extract_first()
>>> response.xpath('//div[@id="images"]/a/text()').extract_first() u'Name: My image 1 '
It returns None
if no element was found:
>>> response.xpath('//div[@id="not-exists"]/text()').extract_first() is None True
A default return value can be provided as an argument, to be used instead of None
:
>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found') 'not-found'
Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-elements:
>>> response.css('title::text').extract() [u'Example website']
Now we’re going to get the base URL and some image links:
>>> response.xpath('//base/@href').extract() [u'http://example.com/'] >>> response.css('base::attr(href)').extract() [u'http://example.com/'] >>> response.xpath('//a[contains(@href, "image")]/@href').extract() [u'image1.html', u'image2.html', u'image3.html', u'image4.html', u'image5.html'] >>> response.css('a[href*=image]::attr(href)').extract() [u'image1.html', u'image2.html', u'image3.html', u'image4.html', u'image5.html'] >>> response.xpath('//a[contains(@href, "image")]/img/@src').extract() [u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg', u'image4_thumb.jpg', u'image5_thumb.jpg'] >>> response.css('a[href*=image] img::attr(src)').extract() [u'image1_thumb.jpg', u'image2_thumb.jpg', u'image3_thumb.jpg', u'image4_thumb.jpg', u'image5_thumb.jpg']
The selection methods (.xpath()
or .css()
) return a list of selectors of the same type, so you can call the selection methods for those selectors too. Here’s an example:
>>> links = response.xpath('//a[contains(@href, "image")]') >>> links.extract() [u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>'] >>> for index, link in enumerate(links): ... args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) ... print 'Link number %d points to url %s and image %s' % args Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg'] Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg'] Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg'] Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg'] Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
Selector
also has a .re()
method for extracting data using regular expressions. However, unlike using .xpath()
or .css()
methods, .re()
returns a list of unicode strings. So you can’t construct nested .re()
calls.
Here’s an example used to extract image names from the HTML code above:
>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)') [u'My image 1', u'My image 2', u'My image 3', u'My image 4', u'My image 5']
There’s an additional helper reciprocating .extract_first()
for .re()
, named .re_first()
. Use it to extract just the first matching string:
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)') u'My image 1'
Keep in mind that if you are nesting selectors and use an XPath that starts with /
, that XPath will be absolute to the document and not relative to the Selector
you’re calling it from.
For example, suppose you want to extract all <p>
elements inside <div>
elements. First, you would get all <div>
elements:
>>> divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p>
elements from the document, not only those inside <div>
elements:
>>> for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document ... print p.extract()
This is the proper way to do it (note the dot prefixing the .//p
XPath):
>>> for p in divs.xpath('.//p'): # extracts all <p> inside ... print p.extract()
Another common case would be to extract all direct <p>
children:
>>> for p in divs.xpath('p'): ... print p.extract()
For more details about relative XPaths see the Location Paths section in the XPath specification.
XPath allows you to reference variables in your XPath expressions, using the $somevariable
syntax. This is somewhat similar to parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders like ?
, which are then substituted with values passed with the query.
Here’s an example to match an element based on its “id” attribute value, without hard-coding it (that was shown previously):
>>> # `$val` used in the expression, a `val` argument needs to be passed >>> response.xpath('//div[@id=$val]/a/text()', val='images').extract_first() u'Name: My image 1 '
Here’s another example, to find the “id” attribute of a <div>
tag containing five <a>
children (here we pass the value 5
as an integer):
>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first() u'images'
All variable references must have a binding value when calling .xpath()
(otherwise you’ll get a ValueError: XPath error:
exception). This is done by passing as many named arguments as necessary.
parsel, the library powering Scrapy selectors, has more details and examples on XPath variables.
Being built atop lxml, Scrapy selectors also support some EXSLT extensions and come with these pre-registered namespaces to use in XPath expressions:
prefix | namespace | usage |
---|---|---|
re | http://exslt.org/regular-expressions | regular expressions |
set | http://exslt.org/sets | set manipulation |
The test()
function, for example, can prove quite useful when XPath’s starts-with()
or contains()
are not sufficient.
Example selecting links in list item with a “class” attribute ending with a digit:
>>> from scrapy import Selector >>> doc = """ ... <div> ... <ul> ... <li class="item-0"><a href="link1.html">first item</a></li> ... <li class="item-1"><a href="link2.html">second item</a></li> ... <li class="item-inactive"><a href="link3.html">third item</a></li> ... <li class="item-1"><a href="link4.html">fourth item</a></li> ... <li class="item-0"><a href="link5.html">fifth item</a></li> ... </ul> ... </div> ... """ >>> sel = Selector(text=doc, type="html") >>> sel.xpath('//li//@href').extract() [u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html'] >>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract() [u'link1.html', u'link2.html', u'link4.html', u'link5.html'] >>>
Warning
C library libxslt
doesn’t natively support EXSLT regular expressions so lxml‘s implementation uses hooks to Python’s re
module. Thus, using regexp functions in your XPath expressions may add a small performance penalty.
These can be handy for excluding parts of a document tree before extracting text elements for example.
Example extracting microdata (sample content taken from http://schema.org/Product) with groups of itemscopes and corresponding itemprops:
>>> doc = """ ... <div itemscope itemtype="http://schema.org/Product"> ... <span itemprop="name">Kenmore White 17" Microwave</span> ... <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' /> ... <div itemprop="aggregateRating" ... itemscope itemtype="http://schema.org/AggregateRating"> ... Rated <span itemprop="ratingValue">3.5</span>/5 ... based on <span itemprop="reviewCount">11</span> customer reviews ... </div> ... ... <div itemprop="offers" itemscope itemtype="http://schema.org/Offer"> ... <span itemprop="price">$55.00</span> ... <link itemprop="availability" href="http://schema.org/InStock" />In stock ... </div> ... ... Product description: ... <span itemprop="description">0.7 cubic feet countertop microwave. ... Has six preset cooking categories and convenience features like ... Add-A-Minute and Child Lock.</span> ... ... Customer reviews: ... ... <div itemprop="review" itemscope itemtype="http://schema.org/Review"> ... <span itemprop="name">Not a happy camper</span> - ... by <span itemprop="author">Ellie</span>, ... <meta itemprop="datePublished" content="2011-04-01">April 1, 2011 ... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> ... <meta itemprop="worstRating" content = "1"> ... <span itemprop="ratingValue">1</span>/ ... <span itemprop="bestRating">5</span>stars ... </div> ... <span itemprop="description">The lamp burned out and now I have to replace ... it. </span> ... </div> ... ... <div itemprop="review" itemscope itemtype="http://schema.org/Review"> ... <span itemprop="name">Value purchase</span> - ... by <span itemprop="author">Lucas</span>, ... <meta itemprop="datePublished" content="2011-03-25">March 25, 2011 ... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> ... <meta itemprop="worstRating" content = "1"/> ... <span itemprop="ratingValue">4</span>/ ... <span itemprop="bestRating">5</span>stars ... </div> ... <span itemprop="description">Great microwave for the price. It is small and ... fits in my apartment.</span> ... </div> ... ... ... </div> ... """ >>> sel = Selector(text=doc, type="html") >>> for scope in sel.xpath('//div[@itemscope]'): ... print "current scope:", scope.xpath('@itemtype').extract() ... props = scope.xpath(''' ... set:difference(./descendant::*/@itemprop, ... .//*[@itemscope]/*/@itemprop)''') ... print " properties:", props.extract() ... print current scope: [u'http://schema.org/Product'] properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review'] current scope: [u'http://schema.org/AggregateRating'] properties: [u'ratingValue', u'reviewCount'] current scope: [u'http://schema.org/Offer'] properties: [u'price', u'availability'] current scope: [u'http://schema.org/Review'] properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description'] current scope: [u'http://schema.org/Rating'] properties: [u'worstRating', u'ratingValue', u'bestRating'] current scope: [u'http://schema.org/Review'] properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description'] current scope: [u'http://schema.org/Rating'] properties: [u'worstRating', u'ratingValue', u'bestRating'] >>>
Here we first iterate over itemscope
elements, and for each one, we look for all itemprops
elements and exclude those that are themselves inside another itemscope
.
Here are some tips that you may find useful when using XPath with Scrapy selectors, based on this post from ScrapingHub’s blog. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.
When you need to use the text content as argument to an XPath string function, avoid using .//text()
and use just .
instead.
This is because the expression .//text()
yields a collection of text elements – a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains()
or starts-with()
, it results in the text for the first element only.
Example:
>>> from scrapy import Selector >>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
Converting a node-set to string:
>>> sel.xpath('//a//text()').extract() # take a peek at the node-set [u'Click here to go to the ', u'Next Page'] >>> sel.xpath("string(//a[1]//text())").extract() # convert it to string [u'Click here to go to the ']
A node converted to a string, however, puts together the text of itself plus of all its descendants:
>>> sel.xpath("//a[1]").extract() # select the first node [u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'] >>> sel.xpath("string(//a[1])").extract() # convert it to string [u'Click here to go to the Next Page']
So, using the .//text()
node-set won’t select anything in this case:
>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract() []
But using the .
to mean the node, works:
>>> sel.xpath("//a[contains(., 'Next Page')]").extract() [u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
//node[1]
selects all the nodes occurring first under their respective parents.
(//node)[1]
selects all the nodes in the document, and then gets only the first of them.
Example:
>>> from scrapy import Selector >>> sel = Selector(text=""" ....: <ul class="list"> ....: <li>1</li> ....: <li>2</li> ....: <li>3</li> ....: </ul> ....: <ul class="list"> ....: <li>4</li> ....: <li>5</li> ....: <li>6</li> ....: </ul>""") >>> xp = lambda x: sel.xpath(x).extract()
This gets all first <li>
elements under whatever it is its parent:
>>> xp("//li[1]") [u'<li>1</li>', u'<li>4</li>']
And this gets the first <li>
element in the whole document:
>>> xp("(//li)[1]") [u'<li>1</li>']
This gets all first <li>
elements under an <ul>
parent:
>>> xp("//ul/li[1]") [u'<li>1</li>', u'<li>4</li>']
And this gets the first <li>
element under an <ul>
parent in the whole document:
>>> xp("(//ul/li)[1]") [u'<li>1</li>']
Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose:
*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]
If you use @class='someclass'
you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass')
to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass
.
As it turns out, Scrapy selectors allow you to chain selectors, so most of the time you can just select by class using CSS and then switch to XPath when needed:
>>> from scrapy import Selector >>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>') >>> sel.css('.shout').xpath('./time/@datetime').extract() [u'2014-07-23 19:00']
This is cleaner than using the verbose XPath trick shown above. Just remember to use the .
in the XPath expressions that will follow.
scrapy.selector.
Selector
(response=None, text=None, type=None)
An instance of Selector
is a wrapper over response to select certain parts of its content.
response
is an HtmlResponse
or an XmlResponse
object that will be used for selecting and extracting data.
text
is a unicode string or utf-8 encoded text for cases when a response
isn’t available. Using text
and response
together is undefined behavior.
type
defines the selector type, it can be "html"
, "xml"
or None
(default).
If
type
isNone
, the selector automatically chooses the best type based onresponse
type (see below), or defaults to"html"
in case it is used together withtext
.If
type
isNone
and aresponse
is passed, the selector type is inferred from the response type as follows:
"html"
forHtmlResponse
type"xml"
forXmlResponse
type"html"
for anything elseOtherwise, if
type
is set, the selector type will be forced and no detection will occur.
xpath
(query)
Find nodes matching the xpath query
and return the result as a SelectorList
instance with all elements flattened. List elements implement Selector
interface too.
query
is a string containing the XPATH query to apply.
Note
For convenience, this method can be called as response.xpath()
css
(query)
Apply the given CSS selector and return a SelectorList
instance.
query
is a string containing the CSS selector to apply.
In the background, CSS queries are translated into XPath queries using cssselect library and run .xpath()
method.
Note
For convenience this method can be called as response.css()
extract
()
Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.
re
(regex)
Apply the given regex and return a list of unicode strings with the matches.
regex
can be either a compiled regular expression or a string which will be compiled to a regular expression usingre.compile(regex)
Note
Note that re()
and re_first()
both decode HTML entities (except <
and &
).
register_namespace
(prefix, uri)
Register the given namespace to be used in this Selector
. Without registering namespaces you can’t select or extract data from non-standard namespaces. See examples below.
remove_namespaces
()
Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See example below.
__nonzero__
()
Returns True
if there is any real content selected or False
otherwise. In other words, the boolean value of a Selector
is given by the contents it selects.
scrapy.selector.
SelectorList
The SelectorList
class is a subclass of the builtin list
class, which provides a few additional methods.
xpath
(query)
Call the .xpath()
method for each element in this list and return their results flattened as another SelectorList
.
query
is the same argument as the one in Selector.xpath()
css
(query)
Call the .css()
method for each element in this list and return their results flattened as another SelectorList
.
query
is the same argument as the one in Selector.css()
extract
()
Call the .extract()
method for each element in this list and return their results flattened, as a list of unicode strings.
re
()
Call the .re()
method for each element in this list and return their results flattened, as a list of unicode strings.
Here’s a couple of Selector
examples to illustrate several concepts. In all cases, we assume there is already a Selector
instantiated with a HtmlResponse
object like this:
sel = Selector(html_response)
Select all <h1>
elements from an HTML response body, returning a list of Selector
objects (ie. a SelectorList
object):
sel.xpath("//h1")
Extract the text of all <h1>
elements from an HTML response body, returning a list of unicode strings:
sel.xpath("//h1").extract() # this includes the h1 tag sel.xpath("//h1/text()").extract() # this excludes the h1 tag
Iterate over all <p>
tags and print their class attribute:
for node in sel.xpath("//p"): print node.xpath("@class").extract()
Here’s a couple of examples to illustrate several concepts. In both cases we assume there is already a Selector
instantiated with an XmlResponse
object like this:
sel = Selector(xml_response)
Select all <product>
elements from an XML response body, returning a list of Selector
objects (ie. a SelectorList
object):
sel.xpath("//product")
Extract all prices from a Google Base XML feed which requires registering a namespace:
sel.register_namespace("g", "http://base.google.com/ns/1.0") sel.xpath("//g:price").extract()
When dealing with scraping projects, it is often quite convenient to get rid of namespaces altogether and just work with element names, to write more simple/convenient XPaths. You can use the Selector.remove_namespaces()
method for that.
Let’s show an example that illustrates this with GitHub blog atom feed.
First, we open the shell with the url we want to scrape:
$ scrapy shell https://github.com/blog.atom
Once in the shell we can try selecting all <link>
objects and see that it doesn’t work (because the Atom XML namespace is obfuscating those nodes):
>>> response.xpath("//link") []
But once we call the Selector.remove_namespaces()
method, all nodes can be accessed directly by their names:
>>> response.selector.remove_namespaces() >>> response.xpath("//link") [<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>, <Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>, ...
If you wonder why the namespace removal procedure isn’t always called by default instead of having to call it manually, this is because of two reasons, which, in order of relevance, are:
https://docs.scrapy.org/en/latest/topics/selectors.html
Scrapy学习系列(一):网页元素查询CSS Selector和XPath Selector
这篇文章主要介绍建立一个简单的spider,顺便介绍一下对网页元素的选取方式(css selector, xpath selector)。
打开命令行运行如下命令:
scrapy startproject homelink_selling_index
建立出的工程结构以下:
│ scrapy.cfg │ └─lianjia_shub │ items.py │ pipelines.py │ settings.py │ __init__.py │ └─spiders __init__.py
须要抓取的页面元素以下图:
导入命名空间:
import scrapy
定义spider:
class homelink_selling_index_spider(scrapy.Spider): # 定义spider的名字,在调用spider进行crawling的时候会用到: # scrapy crawl <spider.name> name = "homelink_selling_index" # 若是没有特别指定其余的url,spider会以start_urls中的连接为入口开始爬取 start_urls = ["http://bj.lianjia.com/ershoufang/pg1tt2/"] # parse是scrapy.Spider处理http response的默认入口 # parse会对start_urls里的全部连接挨个进行处理 def parse(self, response): # 获取当前页面的房屋列表 #house_lis = response.css('.house-lst .info-panel') house_lis = response.xpath('//ul[@class="house-lst"]/li/div[@class="info-panel"]') # 把结果输出到文件(在命令行中房屋标题会由于编码缘由显示为乱码) with open("homelink.log", "wb") as f: ## 使用css selector进行操做 #average_price = response.css('.secondcon.fl li:nth-child(1)').css('.botline a::text').extract_first() #f.write("Average Price: " + str(average_price) + "\r\n") #yesterday_count = response.css('.secondcon.fl li:last-child').css('.botline strong::text').extract_first() #f.write("Yesterday Count: " + str(yesterday_count) + "\r\n") #for house_li in house_lis: # link = house_li.css('a::attr("href")').extract_first() # 获取房屋的连接地址 # title = house_li.css('a::text').extract_first() # 获取房屋的标题 # price = house_li.css('.price .num::text').extract_first() # 获取房屋的价格 # 使用xpath selector进行操做 average_price = response.xpath('//div[@class="secondcon fl"]//li[1]/span[@class="botline"]//a/text()').extract_first() f.write("Average Price: " + str(average_price) + "\r\n") yesterday_count = response.xpath('//div[@class="secondcon fl"]//li[last()]//span[@class="botline"]/strong/text()').extract_first() f.write("Yesterday Count: " + str(yesterday_count) + "\r\n") for house_li in house_lis: link = house_li.xpath('.//a/@href').extract_first() # 注意这里xpath的语法,前面要加上".",不然会从文档根节点而不是当前节点为起点开始查询 title = house_li.xpath('.//a/text()').extract_first() price = house_li.xpath('.//div[@class="price"]/span[@class="num"]/text()').extract_first() f.write("Title: {0}\tPrice:{1}\r\n\tLink: {2}\r\n".format(title.encode('utf-8'), price, link))
Average Price: 44341 Yesterday Count: 33216 Title: 万科假日风景全明格局 南北精装三居 满五惟一 Price:660 Link: http://bj.lianjia.com/ershoufang/xxx.html Title: 南北通透精装三居 免税带车位 先后对花园 有钥匙 Price:910 Link: http://bj.lianjia.com/ershoufang/xxx.html Title: 西直门 时代之光名苑 西南四居 满五惟一 诚心出售 Price:1200 Link: http://bj.lianjia.com/ershoufang/xxx.html
......
经过上面的三步,咱们能够对网页元素进行简单的爬取操做了。可是这里尚未真正利用好Scrapy提供给咱们的不少方便、强大的功能,好比: ItemLoader, Pipeline等。这些操做会在后续的文章中继续介绍。
https://www.cnblogs.com/silverbullet11/p/scrapy_series_1.html