scrapy


Xpath下根据标签获取指定标签的text,相关属性值。
要可以准确的定位到列表中的某一项(经过id或class)
根据标签或相关属性的值进行过滤css

response.xpath('//*[@id="resultList"]/div[4]/span[1]/a/@href').extract_first()

获取标签id为resultList的标签,向下第4个div元素,再向下第1个span元素,向下的a标签,获取a标签的href属性html




CSS根据css样式获取指定的某个元素或元素列表
获取指标签的text,相关属性值
要能准确的定位到列表中的某一项
若是一个标签有多个css样式的状况下,怎么写





Scrapy xpathnode

 

表达式 描述
nodename 选取此节点的全部子节点。
/ 从根节点选取。
// 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置。
. 选取当前节点。
.. 选取当前节点的父节点。
@ 选取属性。

 

路径表达式 结果
/bookstore/book[1] 选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()] 选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1] 选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()<3] 选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang] 选取全部拥有名为 lang 的属性的 title 元素。
//title[@lang=’eng’] 选取全部 title 元素,且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00] 选取 bookstore 元素的全部 book 元素,且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title 选取 bookstore 元素中的 book 元素的全部 title 元素,且其中的 price 元素的值须大于 35.00。

 

几个简单的例子:python

/html/head/title: 选择HTML文档<head>元素下面的<title> 标签。   方法2:response.xpath('//title') 获取了网页的标题  //效率低,不建议使用
/html/head/title/text(): 选择前面提到的<title> 元素下面的文本内容
//td: 选择全部 <td> 元素
//div[@class="mine"]: 选择全部包含 class="mine" 属性的div 标签元素

 

Scrapy使用css和xpath选择器来定位元素,它有四个基本方法:
xpath(): 返回选择器列表,每一个选择器表明使用xpath语法选择的节点
css(): 返回选择器列表,每一个选择器表明使用css语法选择的节点
extract(): 返回被选择元素的unicode字符串
re(): 返回经过正则表达式提取的unicode字符串列表git

>>> response.xpath('//title/text()')  
[<Selector (text) xpath=//title/text()>]  
>>> response.css('title::text')  
[<Selector (text) xpath=//title/text()>]  

 

Scrapy没有进行预期的循环抓取的操做,
解决办法:将allow_domain中的域名改成与爬取url一致便可
缘由是 allow_domain中的域名写错了,与待爬取url不一致

已更改过的代码以下:程序员

# -*- coding: utf-8 -*-
import scrapy
from scrapy_demo7.items import ScrapyDemo7Item
from scrapy.http import Request


class ZhilianSpider(scrapy.Spider):
    name = 'zhilian'
    allowed_domains = ['zhaopin.com']
    start_urls = ['http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&sm=0&p=1']

    def parse(self, response):
        tables = response.xpath('//*[@id="newlist_list_content_table"]/table')
        for table in tables:
            item = ScrapyDemo7Item()
            first = table.xpath('./tbody/tr[1]/td[1]/div/a/@href').extract_first()
            print("first", first)
            tableRecord = table.xpath("./tr[1]")
            jobInfo = tableRecord.xpath("./td[@class='zwmc']/div/a")
            item["job_name"] = jobInfo.xpath("./text()").extract_first()
            item["company_name"] = tableRecord.xpath("./td[@class='gsmc']/a[@target='_blank']/text()").extract_first()
            item["job_provide_salary"] = tableRecord.xpath("./td[@class='zwyx']/text()").extract_first()
            item["job_location"] = tableRecord.xpath("./td[@class='gzdd']/text()").extract_first()
            item["job_release_date"] = tableRecord.xpath("./td[@class='gxsj']/span/text()").extract_first()
            item["job_url"] = jobInfo.xpath("./@href").extract_first()
            yield item
        for i in range(1, 21):
            url = "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&sm=0&p=" + str(i)
            print(url)
            yield Request(url, callback=self.parse)

 




 

C:\Users\user>pip3 install scrapy
Collecting scrapy
  Using cached Scrapy-1.4.0-py2.py3-none-any.whl
Collecting parsel>=1.1 (from scrapy)
  Using cached parsel-1.2.0-py2.py3-none-any.whl
Requirement already satisfied: service-identity in d:\python362\lib\site-packages (from scrapy)
Requirement already satisfied: w3lib>=1.17.0 in d:\python362\lib\site-packages (from scrapy)
Requirement already satisfied: cssselect>=0.9 in d:\python362\lib\site-packages (from scrapy)
Requirement already satisfied: queuelib in d:\python362\lib\site-packages (from scrapy)
Requirement already satisfied: lxml in d:\python362\lib\site-packages (from scrapy)
Requirement already satisfied: PyDispatcher>=2.0.5 in d:\python362\lib\site-packages (from scrapy)
Requirement already satisfied: six>=1.5.2 in d:\python362\lib\site-packages (from scrapy)
Collecting Twisted>=13.1.0 (from scrapy)
  Using cached Twisted-17.9.0.tar.bz2
Requirement already satisfied: pyOpenSSL in d:\python362\lib\site-packages (from scrapy)
Requirement already satisfied: attrs in d:\python362\lib\site-packages (from service-identity->scrapy)
Requirement already satisfied: pyasn1-modules in d:\python362\lib\site-packages (from service-identity->scrapy)
Requirement already satisfied: pyasn1 in d:\python362\lib\site-packages (from service-identity->scrapy)
Requirement already satisfied: zope.interface>=4.0.2 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: constantly>=15.1 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: incremental>=16.10.1 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: Automat>=0.3.0 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: hyperlink>=17.1.1 in d:\python362\lib\site-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: cryptography>=2.1.4 in d:\python362\lib\site-packages (from pyOpenSSL->scrapy)
Requirement already satisfied: setuptools in d:\python362\lib\site-packages (from zope.interface>=4.0.2->Twisted>=13.1.0->scrapy)
Requirement already satisfied: cffi>=1.7; platform_python_implementation != "PyPy" in d:\python362\lib\site-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy)
Requirement already satisfied: asn1crypto>=0.21.0 in d:\python362\lib\site-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy)
Requirement already satisfied: idna>=2.1 in d:\python362\lib\site-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy)
Requirement already satisfied: pycparser in d:\python362\lib\site-packages (from cffi>=1.7; platform_python_implementation != "PyPy"->cryptography>=2.1.4->pyOpenSSL->scrapy)
Installing collected packages: parsel, Twisted, scrapy
  Running setup.py install for Twisted ... done
Successfully installed Twisted-17.9.0 parsel-1.2.0 scrapy-1.4.0

C:\Users\user>scrapy
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

C:\Users\user>scrapy bench 2017-12-13 15:41:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-12-13 15:41:49 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2017-12-13 15:41:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2017-12-13 15:41:50 [twisted] CRITICAL: Unhandled error in Deferred:

2017-12-13 15:41:50 [twisted] CRITICAL:
Traceback (most recent call last):
  File "d:\python362\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "d:\python362\lib\site-packages\scrapy\crawler.py", line 77, in crawl
    self.engine = self._create_engine()
  File "d:\python362\lib\site-packages\scrapy\crawler.py", line 102, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "d:\python362\lib\site-packages\scrapy\core\engine.py", line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "d:\python362\lib\site-packages\scrapy\core\downloader\__init__.py", line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "d:\python362\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "d:\python362\lib\site-packages\scrapy\middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "d:\python362\lib\site-packages\scrapy\utils\misc.py", line 44, in load_object
    mod = import_module(module)
  File "d:\python362\lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 978, in _gcd_import
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "d:\python362\lib\site-packages\scrapy\downloadermiddlewares\retry.py", line 20, in <module>
    from twisted.web.client import ResponseFailed
  File "d:\python362\lib\site-packages\twisted\web\client.py", line 42, in <module>
    from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS
  File "d:\python362\lib\site-packages\twisted\internet\endpoints.py", line 41, in <module>
    from twisted.internet.stdio import StandardIO, PipeAddress
  File "d:\python362\lib\site-packages\twisted\internet\stdio.py", line 30, in <module>
    from twisted.internet import _win32stdio
  File "d:\python362\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module>
    import win32api
ModuleNotFoundError: No module named 'win32api'

 

C:\Users\user>pip3 install pypiwin32
Collecting pypiwin32
  Downloading pypiwin32-220-cp36-none-win32.whl (8.3MB)
    100% |████████████████████████████████| 8.3MB 34kB/s
Installing collected packages: pypiwin32
Successfully installed pypiwin32-220

C:\Users\user>
C:\Users\user>scrapy bench 2017-12-13 15:49:05 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-12-13 15:49:05 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-12-13 15:49:06 [scrapy.core.engine] INFO: Spider opened
2017-12-13 15:49:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:07 [scrapy.extensions.logstats] INFO: Crawled 85 pages (at 5100 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:08 [scrapy.extensions.logstats] INFO: Crawled 157 pages (at 4320 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:09 [scrapy.extensions.logstats] INFO: Crawled 229 pages (at 4320 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:10 [scrapy.extensions.logstats] INFO: Crawled 293 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:11 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:12 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:13 [scrapy.extensions.logstats] INFO: Crawled 469 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:14 [scrapy.extensions.logstats] INFO: Crawled 517 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:15 [scrapy.extensions.logstats] INFO: Crawled 573 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:16 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2017-12-13 15:49:16 [scrapy.extensions.logstats] INFO: Crawled 621 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 284168,
 'downloader/request_count': 629,
 'downloader/request_method_count/GET': 629,
 'downloader/response_bytes': 1976557,
 'downloader/response_count': 629,
 'downloader/response_status_count/200': 629,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2017, 12, 13, 7, 49, 17, 78107),
 'log_count/INFO': 17,
 'request_depth_max': 21,
 'response_received_count': 629,
 'scheduler/dequeued': 629,
 'scheduler/dequeued/memory': 629,
 'scheduler/enqueued': 12581,
 'scheduler/enqueued/memory': 12581,
 'start_time': datetime.datetime(2017, 12, 13, 7, 49, 6, 563037)}
2017-12-13 15:49:17 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

C:\Users\user>

 

在本教程中,咱们假设您已经安装了Scrapy。若是没有,请参阅安装指南github

咱们将要抓取 quotes.toscrape.com,一个列出著名做家的名言(quote)的网站。web

本教程将引导您完成如下任务:正则表达式

  1. 建立一个新的 Scrapy 项目
  2. 编写一个爬虫来爬取站点并提取数据
  3. 使用命令行导出抓取的数据
  4. 改写爬虫以递归地跟踪连接
  5. 使用爬虫参数

Scrapy 是用 Python 编写的。若是你没学过 Python,你可能须要了解一下这个语言,以充分利用 Scrapy。shell

若是您已经熟悉其余语言,并但愿快速学习 Python,咱们建议您阅读 Dive Into Python 3。或者,您能够学习 Python 教程

若是您刚开始编程,并但愿从 Python 开始,在线电子书《Learn Python The Hard Way》很是有用。您也能够查看非程序员的 Python 资源列表

建立一个项目

在开始抓取以前,您必须建立一个新的 Scrapy 项目。 进入您要存储代码的目录,而后运行:

scrapy startproject tutorial

这将建立一个包含如下内容的 tutorial 目录:

复制代码
tutorial/
    scrapy.cfg            # 项目配置文件
    tutorial/             # 项目的 Python 模块,放置您的代码的地方
        __init__.py
        items.py          # 项目项(item)定义文件
        pipelines.py      # 项目管道(piplines)文件
        settings.py       # 项目设置文件
        spiders/          # 一个你之后会放置 spider 的目录
            __init__.py
复制代码

第一个爬虫

Spider 是您定义的类,Scrapy 用它从网站(或一组网站)中抓取信息。 他们必须是 scrapy.Spider 的子类并定义初始请求,和如何获取要继续抓取的页面的连接,以及如何解析下载的页面来提取数据。

这是咱们第一个爬虫的代码。 将其保存在项目中的 tutorial/spiders 目录下的名为 quotes_spider.py 的文件中:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
复制代码

你能够看到,咱们的 Spider 是 scrapy.Spider 的子类并定义了一些属性和方法:

  • name:用于识别 Spider。 它在项目中必须是惟一的,也就是说,您不能为不一样的 Spider 设置相同的名称。
  • start_requests():必须返回一个 Requests 的迭代(您能够返回一个 requests 列表或者写一个生成器函数),Spider 将从这里开始抓取。 随后的请求将从这些初始请求连续生成。
  • parse():用来处理每一个请求获得的响应的方法。 响应参数是 TextResponse 的一个实例,它保存页面内容,而且还有其余有用的方法来处理它。

parse() 方法一般解析响应,将抓取的数据提取为字典,而且还能够查找新的 URL 来跟踪并从中建立新的请求(Request)。

如何运行咱们的爬虫

要使咱们的爬虫工做,请进入项目的根目录并运行:

scrapy crawl quotes

这个命令运行咱们刚刚添加的名称为 quotes 的爬虫,它将向 quotes.toscrape.com 发送一些请求。 你将获得相似于这样的输出:

复制代码
... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...
复制代码

如今,查看当前目录下的文件。 您会发现已经建立了两个新文件:quotes-1.html 和 quotes-2.html,其中包含各个URL的内容,就像咱们的 parse 方法指示同样。

注意

若是您想知道为何咱们尚未解析 HTML,请继续,咱们将尽快介绍。

这个过程当中发生了什么?

Spider 的 start_requests 方法返回 scrapy.Request 对象,Scrapy 对其发起请求 。而后将收到的响应实例化为 Response 对象,以响应为参数调用请求对象中定义的回调方法(在这里为 parse 方法)。

start_requests 方法的快捷方式

用于代替实现一个从 URL 生成 scrapy.Request 对象的 start_requests() 方法,您能够用 URL 列表定义一个 start_urls 类属性。 此列表将默认替代 start_requests() 方法为您的爬虫建立初始请求:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
复制代码

Scrapy 将调用 parse() 方法来处理每一个 URL 的请求,即便咱们没有明确告诉 Scrapy 这样作。 这是由于 parse() 是 Scrapy 的默认回调方法,没有明确分配回调方法的请求默认调用此方法。

提取数据

学习如何使用 Scrapy 提取数据的最佳方式是在 Scrapy shell 中尝试一下选择器。 运行:

scrapy shell 'http://quotes.toscrape.com/page/1/'

注意

在从命令行运行 Scrapy shell 时必须给 url 加上引号,不然包含参数(例如 &符号)的 url 将不起做用。

在Windows上,要使用双引号:

scrapy shell "http://quotes.toscrape.com/page/1/"

你将会看到:

复制代码
[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>
复制代码

使用 shell,您能够尝试使用 CSS 选择器选择元素:

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

运行 response.css('title') 返回的结果是一个 SelectorList 类列表对象,它是一个指向 XML/HTML 元素的 Selector 对象的列表,容许您进行进一步的查询来细分选择或提取数据。

要从上面的 title 中提取文本,您能够执行如下操做:

>>> response.css('title::text').extract()
['Quotes to Scrape']

这里有两件事情要注意:一个是咱们在 CSS 查询中添加了 ::text,这意味着咱们只想要 <title> 元素中的文本。 若是咱们不指定 ::text,咱们将获得完整的 title 元素,包括其标签:

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']

另外一件事是调用 .extract() 返回的结果是一个列表,由于咱们在处理 SelectorList。 当你明确你只是想要第一个结果时,你能够这样作:

>>> response.css('title::text').extract_first()
'Quotes to Scrape'

或者你能够这样写:

>>> response.css('title::text')[0].extract()
'Quotes to Scrape'

可是,若是没有找到匹配选择的元素,.extract_first() 返回 None,避免了 IndexError

这里有一个教训:对于大多数爬虫代码,您但愿它具备容错性,若是在页面上找不到指定的元素致使没法获取某些项,至少其它的数据能够被抓取。

除了 extract() 和 extract_first() 方法以外,还可使用 re() 方法用正则表达式来提取:

复制代码
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
复制代码

为了获得正确的 CSS 选择器语句,您能够在浏览器中打开页面并查看源代码。 您也可使用浏览器的开发工具或扩展(如 Firebug)(请参阅有关 Using Firebug for scraping 和 Using Firefox for scraping 的部分)。

Selector Gadget 也是一个很好的工具,能够快速找到元素的 CSS 选择器语句,它能够在许多浏览器中运行。

XPath:简要介绍

除了 CSS,Scrapy 选择器还支持使用 XPath 表达式:

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

XPath 表达式很是强大,是 Scrapy 选择器的基础。 实际上,若是你查看相关的源代码就能够发现,CSS 选择器被转换为 XPath。 

虽然也许不像 CSS 选择器那么受欢迎,但 XPath 表达式提供更多的功能,由于除了导航结构以外,它还能够查看内容。 使用 XPath,您能够选择如下内容:包含文本“下一页”的连接。 这使得 XPath 很是适合抓取任务,咱们鼓励您学习 XPath,即便您已经知道如何使用 CSS 选择器,这会使抓取更容易。

咱们不会在这里讲太多关于 XPath 的内容,但您能够阅读 using XPath with Scrapy Selectors 获取更多有关 XPath 的信息。 咱们推荐教程 to learn XPath through examples,和教程 “how to think in XPath”

提取名人和名言

如今你知道了如何选择和提取,让咱们来完成咱们的爬虫,编写代码从网页中提取名言(quote)。

http://quotes.toscrape.com 中的每一个名言都由 HTML 元素表示,以下所示:

复制代码
<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>
复制代码

让咱们打开 scrapy shell 玩一玩,找到提取咱们想要的数据的方法:

$ scrapy shell 'http://quotes.toscrape.com'

获得 quote 元素的 selector 列表:

>>> response.css("div.quote")

经过上述查询返回的每一个 selector 容许咱们对其子元素运行进一步的查询。 让咱们将第一个 selector 分配给一个变量,以便咱们能够直接在特定的 quote 上运行咱们的 CSS 选择器:

>>> quote = response.css("div.quote")[0]

如今,咱们使用刚刚建立的 quote 对象,从该 quote 中提取 title,author 和 tags:

复制代码
>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'
复制代码

鉴于标签是字符串列表,咱们可使用 .extract() 方法将它们所有提取出来:

>>> tags = quote.css("div.tags a.tag::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

如今已经弄清楚了如何提取每个信息,接下来遍历全部 quote 元素,并把它们放在一个 Python 字典中:

复制代码
>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>
复制代码

在爬虫中提取数据

让咱们回到咱们的爬虫上。 到目前为止,它并无提取任何数据,只将整个 HTML 页面保存到本地文件。 让咱们将上述提取逻辑整合到咱们的爬虫中。

Scrapy 爬虫一般生成许多包含提取到的数据的字典。 为此,咱们在回调方法中使用 yield Python 关键字,以下所示:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }
复制代码

若是您运行此爬虫,它将输出提取的数据与日志:

复制代码
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
复制代码

存储抓取的数据

存储抓取数据的最简单的方法是使用 Feed exports,使用如下命令:

scrapy crawl quotes -o quotes.json

这将生成一个 quotes.json 文件,其中包含全部抓取到的 JSON 序列化的数据。

因为历史缘由,Scrapy 追加内容到给定的文件,而不是覆盖其内容。 若是您在第二次以前删除该文件两次运行此命令,那么最终会出现一个破坏的 JSON 文件。您还可使用其余格式,如 JSON 行(JSON Lines):

scrapy crawl quotes -o quotes.jl

JSON 行格式颇有用,由于它像流同样,您能够轻松地将新记录附加到文件。 当运行两次时,它不会发生 JSON 那样的问题。 另外,因为每条记录都是单独的行,因此您在处理大文件时无需将全部内容放到内存中,还有 JQ 等工具能够帮助您在命令行中执行此操做。

在小项目(如本教程中的一个)中,这应该是足够的。 可是,若是要使用已抓取的项目执行更复杂的操做,则能够编写项目管道(Item Pipeline)。 在工程的建立过程当中已经为您建立了项目管道的占位符文件 tutorial/pipelines.py, 虽然您只须要存储已抓取的项目,不须要任何项目管道。

跟踪连接

或许你但愿获取网站全部页面的 quotes,而不是从 http://quotes.toscrape.com 的前两页抓取。

如今您已经知道如何从页面中提取数据,咱们来看看如何跟踪连接。

首先是提取咱们想要跟踪的页面的连接。 检查咱们的页面,咱们能够看到连接到下一个页面的URL在下面的元素中:

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

咱们能够尝试在 shell 中提取它:

>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

这获得了超连接元素,可是咱们须要其属性 href。 为此,Scrapy 支持 CSS 扩展,您能够选择属性内容,以下所示:

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

如今修改咱们的爬虫,改成递归地跟踪下一页的连接,从中提取数据:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
复制代码

如今,在提取数据以后,parse() 方法查找到下一页的连接,使用 urljoin() 方法构建一个完整的绝对 URL(由于连接能够是相对的),并生成(yield)一个到下一页的新的请求, 其中包括回调方法(parse)。

您在这里看到的是 Scrapy 的连接跟踪机制:当您在一个回调方法中生成(yield)请求(request)时,Scrapy 将安排发起该请求,并注册该请求完成时执行的回调方法。

使用它,您能够根据您定义的规则构建复杂的跟踪连接机制,并根据访问页面提取不一样类型的数据。

在咱们的示例中,它建立一个循环,跟踪全部到下一页的连接,直到它找不到要抓取的博客,论坛或其余站点分页。

建立请求的快捷方式

做为建立请求对象的快捷方式,您可使用 response.follow

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)
复制代码

不像 scrapy.Request,response.follow 支持相对 URL - 不须要调用urljoin。请注意,response.follow 只是返回一个 Request 实例,您仍然须要生成请求(yield request)。

您也能够将选择器传递给 response.follow,该选择器应该提取必要的属性:

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

对于<a>元素,有一个快捷方式:response.follow 自动使用它们的 href 属性。 因此代码能够进一步缩短:

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

注意

response.follow(response.css('li.next a')) 无效,由于 response.css 返回的是一个相似列表的对象,其中包含全部结果的选择器,而不是单个选择器。for 循环或者 response.follow(response.css('li.next a')[0]) 则能够正常工做。

更多的例子和模式

这是另一个爬虫,示例了回调和跟踪连接,此次是为了抓取做者信息:

复制代码
import scrapy

class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # 连接到做者页面
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # 连接到下一页
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }
复制代码

这个爬虫将从主页面开始, 以 parse_author 回调方法跟踪全部到做者页面的连接,以 parse 回调方法跟踪其它页面。

这里咱们将回调方法做为参数直接传递给 response.follow,这样代码更短,也能够传递给 scrapy.Request。

parse_author 回调方法里定义了另一个函数来根据 CSS 查询语句(query)来提取数据,而后生成包含做者数据的 Python 字典。

这个爬虫演示的另外一个有趣的事是,即便同一做者有许多名言,咱们也不用担忧屡次访问同一做者的页面。默认状况下,Scrapy 会将重复的请求过滤出来,避免了因为编程错误而致使的重复服务器的问题。能够经过 DUPEFILTER_CLASS 进行相关的设置。

但愿如今您已经了解了 Scrapy 的跟踪连接和回调方法机制。

CrawlSpider 类是一个小规模的通用爬虫引擎,只须要修改其跟踪连接的机制等,就能够在它之上实现你本身的爬虫程序。

另外,一个常见的模式是从多个页面据构建一个包含数据的项(item),有一个将附加数据传递给回调方法的技巧。

使用爬虫参数

在运行爬虫时,能够经过 -a 选项为您的爬虫提供命令行参数:

scrapy crawl quotes -o quotes-humor.json -a tag=humor

默认状况下,这些参数将传递给 Spider 的 __init__ 方法并成为爬虫的属性。

在此示例中,经过 self.tag 获取命令行中参数 tag 的值。您能够根据命令行参数构建 URL,使您的爬虫只爬取特色标签的名言:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)
复制代码

若是您将 tag = humor 传递给爬虫,您会注意到它只会访问标签为 humor 的 URL,例如 http://quotes.toscrape.com/tag/humor。您能够在这里了解更多关于爬虫参数的信息。

下一步

本教程仅涵盖了 Scrapy 的基础知识,还有不少其余功能未在此说起。 查看初窥 Scrapy 中的“还有什么?”部分能够快速了解有哪些重要的内容。

您能够经过目录了解更多有关命令行工具、爬虫、选择器以及本教程未涵盖的其余内容的信息。下一章是示例项目。

http://www.cnblogs.com/-E6-/p/7213872.html

原英文文档:https://docs.scrapy.org/en/latest/topics/commands.html
github上的源码:https://github.com/scrapy/scrapy/tree/1.4

xpath,selector:

Selectors

When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:

  • BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.
  • lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree. (lxml is not part of the Python standard library.)

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.

Scrapy selectors are built over the lxml library, which means they’re very similar in speed and parsing accuracy.

This page explains how selectors work and describes their API which is very small and simple, unlike the lxml API which is much bigger because the lxml library can be used for many other tasks, besides selecting markup documents.

For a complete reference of the selectors API see Selector reference

Using selectors

Constructing selectors

Scrapy selectors are instances of Selector class constructed by passing text or TextResponseobject. It automatically chooses the best parsing rules (XML vs HTML) based on input type:

>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse 

Constructing from text:

>>> body = '<html><body><span>good</span></body></html>' >>> Selector(text=body).xpath('//span/text()').extract() [u'good'] 

Constructing from response:

>>> response = HtmlResponse(url='http://example.com', body=body) >>> Selector(response=response).xpath('//span/text()').extract() [u'good'] 

For convenience, response objects expose a selector on .selector attribute, it’s totally OK to use this shortcut when possible:

>>> response.selector.xpath('//span/text()').extract() [u'good'] 

Using selectors

To explain how to use the selectors we’ll use the Scrapy shell (which provides interactive testing) and an example page located in the Scrapy documentation server:

Here’s its HTML code:

<html> <head> <base href='http://example.com/' /> <title>Example website</title> </head> <body> <div id='images'> <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a> <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a> </div> </body> </html> 

First, let’s open the shell:

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

Then, after the shell loads, you’ll have the response available as response shell variable, and its attached selector in response.selector attribute.

Since we’re dealing with HTML, the selector will automatically use an HTML parser.

So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:

>>> response.selector.xpath('//title/text()') [<Selector (text) xpath=//title/text()>] 

Querying responses using XPath and CSS is so common that responses include two convenience shortcuts: response.xpath() and response.css():

>>> response.xpath('//title/text()') [<Selector (text) xpath=//title/text()>] >>> response.css('title::text') [<Selector (text) xpath=//title/text()>] 

As you can see, .xpath() and .css() methods return a SelectorList instance, which is a list of new selectors. This API can be used for quickly selecting nested data:

>>> response.css('img').xpath('@src').extract() [u'image1_thumb.jpg',  u'image2_thumb.jpg',  u'image3_thumb.jpg',  u'image4_thumb.jpg',  u'image5_thumb.jpg'] 

To actually extract the textual data, you must call the selector .extract() method, as follows:

>>> response.xpath('//title/text()').extract() [u'Example website'] 

If you want to extract only first matched element, you can call the selector .extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first() u'Name: My image 1 ' 

It returns None if no element was found:

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first() is None True 

A default return value can be provided as an argument, to be used instead of None:

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found') 'not-found' 

Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-elements:

>>> response.css('title::text').extract() [u'Example website'] 

Now we’re going to get the base URL and some image links:

>>> response.xpath('//base/@href').extract() [u'http://example.com/'] >>> response.css('base::attr(href)').extract() [u'http://example.com/'] >>> response.xpath('//a[contains(@href, "image")]/@href').extract() [u'image1.html',  u'image2.html',  u'image3.html',  u'image4.html',  u'image5.html'] >>> response.css('a[href*=image]::attr(href)').extract() [u'image1.html',  u'image2.html',  u'image3.html',  u'image4.html',  u'image5.html'] >>> response.xpath('//a[contains(@href, "image")]/img/@src').extract() [u'image1_thumb.jpg',  u'image2_thumb.jpg',  u'image3_thumb.jpg',  u'image4_thumb.jpg',  u'image5_thumb.jpg'] >>> response.css('a[href*=image] img::attr(src)').extract() [u'image1_thumb.jpg',  u'image2_thumb.jpg',  u'image3_thumb.jpg',  u'image4_thumb.jpg',  u'image5_thumb.jpg'] 

Nesting selectors

The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors too. Here’s an example:

>>> links = response.xpath('//a[contains(@href, "image")]') >>> links.extract() [u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',  u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',  u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',  u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',  u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>'] >>> for index, link in enumerate(links): ... args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) ... print 'Link number %d points to url %s and image %s' % args Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg'] Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg'] Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg'] Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg'] Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg'] 

Using selectors with regular expressions

Selector also has a .re() method for extracting data using regular expressions. However, unlike using .xpath() or .css() methods, .re() returns a list of unicode strings. So you can’t construct nested .re() calls.

Here’s an example used to extract image names from the HTML code above:

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)') [u'My image 1',  u'My image 2',  u'My image 3',  u'My image 4',  u'My image 5'] 

There’s an additional helper reciprocating .extract_first() for .re(), named .re_first(). Use it to extract just the first matching string:

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)') u'My image 1' 

Working with relative XPaths

Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.

For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:

>>> divs = response.xpath('//div') 

At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:

>>> for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document ... print p.extract() 

This is the proper way to do it (note the dot prefixing the .//p XPath):

>>> for p in divs.xpath('.//p'): # extracts all <p> inside ... print p.extract() 

Another common case would be to extract all direct <p> children:

>>> for p in divs.xpath('p'): ... print p.extract() 

For more details about relative XPaths see the Location Paths section in the XPath specification.

Variables in XPath expressions

XPath allows you to reference variables in your XPath expressions, using the $somevariable syntax. This is somewhat similar to parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders like ?, which are then substituted with values passed with the query.

Here’s an example to match an element based on its “id” attribute value, without hard-coding it (that was shown previously):

>>> # `$val` used in the expression, a `val` argument needs to be passed >>> response.xpath('//div[@id=$val]/a/text()', val='images').extract_first() u'Name: My image 1 ' 

Here’s another example, to find the “id” attribute of a <div> tag containing five <a> children (here we pass the value 5 as an integer):

>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first() u'images' 

All variable references must have a binding value when calling .xpath() (otherwise you’ll get a ValueError: XPath error: exception). This is done by passing as many named arguments as necessary.

parsel, the library powering Scrapy selectors, has more details and examples on XPath variables.

Using EXSLT extensions

Being built atop lxml, Scrapy selectors also support some EXSLT extensions and come with these pre-registered namespaces to use in XPath expressions:

prefix namespace usage
re http://exslt.org/regular-expressions regular expressions
set http://exslt.org/sets set manipulation

Regular expressions

The test() function, for example, can prove quite useful when XPath’s starts-with() or contains() are not sufficient.

Example selecting links in list item with a “class” attribute ending with a digit:

>>> from scrapy import Selector >>> doc = """ ... <div> ...  <ul> ...  <li class="item-0"><a href="link1.html">first item</a></li> ...  <li class="item-1"><a href="link2.html">second item</a></li> ...  <li class="item-inactive"><a href="link3.html">third item</a></li> ...  <li class="item-1"><a href="link4.html">fourth item</a></li> ...  <li class="item-0"><a href="link5.html">fifth item</a></li> ...  </ul> ... </div> ... """ >>> sel = Selector(text=doc, type="html") >>> sel.xpath('//li//@href').extract() [u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html'] >>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract() [u'link1.html', u'link2.html', u'link4.html', u'link5.html'] >>> 

Warning

C library libxslt doesn’t natively support EXSLT regular expressions so lxml‘s implementation uses hooks to Python’s re module. Thus, using regexp functions in your XPath expressions may add a small performance penalty.

Set operations

These can be handy for excluding parts of a document tree before extracting text elements for example.

Example extracting microdata (sample content taken from http://schema.org/Product) with groups of itemscopes and corresponding itemprops:

>>> doc = """ ... <div itemscope itemtype="http://schema.org/Product"> ...  <span itemprop="name">Kenmore White 17" Microwave</span> ...  <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' /> ...  <div itemprop="aggregateRating" ...  itemscope itemtype="http://schema.org/AggregateRating"> ...  Rated <span itemprop="ratingValue">3.5</span>/5 ...  based on <span itemprop="reviewCount">11</span> customer reviews ...  </div> ... ...  <div itemprop="offers" itemscope itemtype="http://schema.org/Offer"> ...  <span itemprop="price">$55.00</span> ...  <link itemprop="availability" href="http://schema.org/InStock" />In stock ...  </div> ... ...  Product description: ...  <span itemprop="description">0.7 cubic feet countertop microwave. ...  Has six preset cooking categories and convenience features like ...  Add-A-Minute and Child Lock.</span> ... ...  Customer reviews: ... ...  <div itemprop="review" itemscope itemtype="http://schema.org/Review"> ...  <span itemprop="name">Not a happy camper</span> - ...  by <span itemprop="author">Ellie</span>, ...  <meta itemprop="datePublished" content="2011-04-01">April 1, 2011 ...  <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> ...  <meta itemprop="worstRating" content = "1"> ...  <span itemprop="ratingValue">1</span>/ ...  <span itemprop="bestRating">5</span>stars ...  </div> ...  <span itemprop="description">The lamp burned out and now I have to replace ...  it. </span> ...  </div> ... ...  <div itemprop="review" itemscope itemtype="http://schema.org/Review"> ...  <span itemprop="name">Value purchase</span> - ...  by <span itemprop="author">Lucas</span>, ...  <meta itemprop="datePublished" content="2011-03-25">March 25, 2011 ...  <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> ...  <meta itemprop="worstRating" content = "1"/> ...  <span itemprop="ratingValue">4</span>/ ...  <span itemprop="bestRating">5</span>stars ...  </div> ...  <span itemprop="description">Great microwave for the price. It is small and ...  fits in my apartment.</span> ...  </div> ...  ... ... </div> ... """ >>> sel = Selector(text=doc, type="html") >>> for scope in sel.xpath('//div[@itemscope]'): ... print "current scope:", scope.xpath('@itemtype').extract() ... props = scope.xpath(''' ...  set:difference(./descendant::*/@itemprop, ...  .//*[@itemscope]/*/@itemprop)''') ... print " properties:", props.extract() ... print current scope: [u'http://schema.org/Product']  properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review'] current scope: [u'http://schema.org/AggregateRating']  properties: [u'ratingValue', u'reviewCount'] current scope: [u'http://schema.org/Offer']  properties: [u'price', u'availability'] current scope: [u'http://schema.org/Review']  properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description'] current scope: [u'http://schema.org/Rating']  properties: [u'worstRating', u'ratingValue', u'bestRating'] current scope: [u'http://schema.org/Review']  properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description'] current scope: [u'http://schema.org/Rating']  properties: [u'worstRating', u'ratingValue', u'bestRating'] >>> 

Here we first iterate over itemscope elements, and for each one, we look for all itemprops elements and exclude those that are themselves inside another itemscope.

Some XPath tips

Here are some tips that you may find useful when using XPath with Scrapy selectors, based on this post from ScrapingHub’s blog. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.

Using text nodes in a condition

When you need to use the text content as argument to an XPath string function, avoid using .//text() and use just . instead.

This is because the expression .//text() yields a collection of text elements – a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), it results in the text for the first element only.

Example:

>>> from scrapy import Selector >>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>') 

Converting a node-set to string:

>>> sel.xpath('//a//text()').extract() # take a peek at the node-set [u'Click here to go to the ', u'Next Page'] >>> sel.xpath("string(//a[1]//text())").extract() # convert it to string [u'Click here to go to the '] 

node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> sel.xpath("//a[1]").extract() # select the first node [u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'] >>> sel.xpath("string(//a[1])").extract() # convert it to string [u'Click here to go to the Next Page'] 

So, using the .//text() node-set won’t select anything in this case:

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract() [] 

But using the . to mean the node, works:

>>> sel.xpath("//a[contains(., 'Next Page')]").extract() [u'<a href="#">Click here to go to the <strong>Next Page</strong></a>'] 

Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

Example:

>>> from scrapy import Selector >>> sel = Selector(text=""" ....: <ul class="list"> ....: <li>1</li> ....: <li>2</li> ....: <li>3</li> ....: </ul> ....: <ul class="list"> ....: <li>4</li> ....: <li>5</li> ....: <li>6</li> ....: </ul>""") >>> xp = lambda x: sel.xpath(x).extract() 

This gets all first <li> elements under whatever it is its parent:

>>> xp("//li[1]") [u'<li>1</li>', u'<li>4</li>'] 

And this gets the first <li> element in the whole document:

>>> xp("(//li)[1]") [u'<li>1</li>'] 

This gets all first <li> elements under an <ul> parent:

>>> xp("//ul/li[1]") [u'<li>1</li>', u'<li>4</li>'] 

And this gets the first <li> element under an <ul> parent in the whole document:

>>> xp("(//ul/li)[1]") [u'<li>1</li>'] 

When querying by class, consider using CSS

Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')] 

If you use @class='someclass' you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass') to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass.

As it turns out, Scrapy selectors allow you to chain selectors, so most of the time you can just select by class using CSS and then switch to XPath when needed:

>>> from scrapy import Selector >>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>') >>> sel.css('.shout').xpath('./time/@datetime').extract() [u'2014-07-23 19:00'] 

This is cleaner than using the verbose XPath trick shown above. Just remember to use the . in the XPath expressions that will follow.

Built-in Selectors reference

Selector objects

class scrapy.selector. Selector (response=Nonetext=Nonetype=None)

An instance of Selector is a wrapper over response to select certain parts of its content.

response is an HtmlResponse or an XmlResponse object that will be used for selecting and extracting data.

text is a unicode string or utf-8 encoded text for cases when a response isn’t available. Using text and response together is undefined behavior.

type defines the selector type, it can be "html""xml" or None (default).

If type is None, the selector automatically chooses the best type based on responsetype (see below), or defaults to "html" in case it is used together with text.

If type is None and a response is passed, the selector type is inferred from the response type as follows:

Otherwise, if type is set, the selector type will be forced and no detection will occur.

xpath (query)

Find nodes matching the xpath query and return the result as a SelectorList instance with all elements flattened. List elements implement Selector interface too.

query is a string containing the XPATH query to apply.

Note

For convenience, this method can be called as response.xpath()

css (query)

Apply the given CSS selector and return a SelectorList instance.

query is a string containing the CSS selector to apply.

In the background, CSS queries are translated into XPath queries using cssselect library and run .xpath() method.

Note

For convenience this method can be called as response.css()

extract ()

Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.

re (regex)

Apply the given regex and return a list of unicode strings with the matches.

regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)

Note

Note that re() and re_first() both decode HTML entities (except &lt; and &amp;).

register_namespace (prefixuri)

Register the given namespace to be used in this Selector. Without registering namespaces you can’t select or extract data from non-standard namespaces. See examples below.

remove_namespaces ()

Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See example below.

__nonzero__ ()

Returns True if there is any real content selected or False otherwise. In other words, the boolean value of a Selector is given by the contents it selects.

SelectorList objects

class scrapy.selector. SelectorList

The SelectorList class is a subclass of the builtin list class, which provides a few additional methods.

xpath (query)

Call the .xpath() method for each element in this list and return their results flattened as another SelectorList.

query is the same argument as the one in Selector.xpath()

css (query)

Call the .css() method for each element in this list and return their results flattened as another SelectorList.

query is the same argument as the one in Selector.css()

extract ()

Call the .extract() method for each element in this list and return their results flattened, as a list of unicode strings.

re ()

Call the .re() method for each element in this list and return their results flattened, as a list of unicode strings.

Selector examples on HTML response

Here’s a couple of Selector examples to illustrate several concepts. In all cases, we assume there is already a Selector instantiated with a HtmlResponse object like this:

sel = Selector(html_response) 
  1. Select all <h1> elements from an HTML response body, returning a list of Selector objects (ie. a SelectorList object):

    sel.xpath("//h1") 
  2. Extract the text of all <h1> elements from an HTML response body, returning a list of unicode strings:

    sel.xpath("//h1").extract() # this includes the h1 tag sel.xpath("//h1/text()").extract() # this excludes the h1 tag 
  3. Iterate over all <p> tags and print their class attribute:

    for node in sel.xpath("//p"): print node.xpath("@class").extract() 

Selector examples on XML response

Here’s a couple of examples to illustrate several concepts. In both cases we assume there is already a Selector instantiated with an XmlResponse object like this:

sel = Selector(xml_response) 
  1. Select all <product> elements from an XML response body, returning a list of Selector objects (ie. a SelectorList object):

    sel.xpath("//product") 
  2. Extract all prices from a Google Base XML feed which requires registering a namespace:

    sel.register_namespace("g", "http://base.google.com/ns/1.0") sel.xpath("//g:price").extract() 

Removing namespaces

When dealing with scraping projects, it is often quite convenient to get rid of namespaces altogether and just work with element names, to write more simple/convenient XPaths. You can use the Selector.remove_namespaces() method for that.

Let’s show an example that illustrates this with GitHub blog atom feed.

First, we open the shell with the url we want to scrape:

$ scrapy shell https://github.com/blog.atom

Once in the shell we can try selecting all <link> objects and see that it doesn’t work (because the Atom XML namespace is obfuscating those nodes):

>>> response.xpath("//link") [] 

But once we call the Selector.remove_namespaces() method, all nodes can be accessed directly by their names:

>>> response.selector.remove_namespaces() >>> response.xpath("//link") [<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,  <Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,  ... 

If you wonder why the namespace removal procedure isn’t always called by default instead of having to call it manually, this is because of two reasons, which, in order of relevance, are:

  1. Removing namespaces requires to iterate and modify all nodes in the document, which is a reasonably expensive operation to perform for all documents crawled by Scrapy
  2. There could be some cases where using namespaces is actually required, in case some element names clash between namespaces. These cases are very rare though.

https://docs.scrapy.org/en/latest/topics/selectors.html

Scrapy学习系列(一):网页元素查询CSS Selector和XPath Selector

这篇文章主要介绍建立一个简单的spider,顺便介绍一下对网页元素的选取方式(css selector, xpath selector)。

第一步:建立spider工程

打开命令行运行如下命令:

scrapy startproject homelink_selling_index

建立出的工程结构以下:

复制代码
│  scrapy.cfg

│

└─lianjia_shub

    │  items.py

    │  pipelines.py

    │  settings.py

    │  __init__.py

    │

    └─spiders

            __init__.py
复制代码

第二步:定义spider(homelink_selling_index)

须要抓取的页面元素以下图:

导入命名空间:

import scrapy

定义spider:

复制代码
class homelink_selling_index_spider(scrapy.Spider):

    # 定义spider的名字,在调用spider进行crawling的时候会用到:
    #   scrapy crawl <spider.name>
    name = "homelink_selling_index"
    # 若是没有特别指定其余的url,spider会以start_urls中的连接为入口开始爬取
    start_urls = ["http://bj.lianjia.com/ershoufang/pg1tt2/"]

    # parse是scrapy.Spider处理http response的默认入口
    # parse会对start_urls里的全部连接挨个进行处理
    def parse(self, response):
        # 获取当前页面的房屋列表
        #house_lis = response.css('.house-lst .info-panel')
        house_lis = response.xpath('//ul[@class="house-lst"]/li/div[@class="info-panel"]')
        # 把结果输出到文件(在命令行中房屋标题会由于编码缘由显示为乱码)
        with open("homelink.log", "wb") as f:
            ## 使用css selector进行操做
            #average_price = response.css('.secondcon.fl li:nth-child(1)').css('.botline a::text').extract_first()
            #f.write("Average Price: " + str(average_price) + "\r\n")
            #yesterday_count = response.css('.secondcon.fl li:last-child').css('.botline strong::text').extract_first()
            #f.write("Yesterday Count: " + str(yesterday_count) + "\r\n")
            #for house_li in house_lis:
            #    link = house_li.css('a::attr("href")').extract_first()             # 获取房屋的连接地址
            #    title = house_li.css('a::text').extract_first()                    # 获取房屋的标题
            #    price = house_li.css('.price .num::text').extract_first()          # 获取房屋的价格

            # 使用xpath selector进行操做
            average_price = response.xpath('//div[@class="secondcon fl"]//li[1]/span[@class="botline"]//a/text()').extract_first()
            f.write("Average Price: " + str(average_price) + "\r\n")
            yesterday_count = response.xpath('//div[@class="secondcon fl"]//li[last()]//span[@class="botline"]/strong/text()').extract_first()
            f.write("Yesterday Count: " + str(yesterday_count) + "\r\n")
            for house_li in house_lis:
                link = house_li.xpath('.//a/@href').extract_first()                 # 注意这里xpath的语法,前面要加上".",不然会从文档根节点而不是当前节点为起点开始查询
                title = house_li.xpath('.//a/text()').extract_first()
                price = house_li.xpath('.//div[@class="price"]/span[@class="num"]/text()').extract_first()
                f.write("Title: {0}\tPrice:{1}\r\n\tLink: {2}\r\n".format(title.encode('utf-8'), price, link))
复制代码

第三步:查看结果

复制代码
Average Price: 44341
Yesterday Count: 33216
Title: 万科假日风景全明格局 南北精装三居 满五惟一	Price:660
	Link: http://bj.lianjia.com/ershoufang/xxx.html
Title: 南北通透精装三居 免税带车位 先后对花园 有钥匙	Price:910
	Link: http://bj.lianjia.com/ershoufang/xxx.html
Title: 西直门 时代之光名苑 西南四居 满五惟一 诚心出售	Price:1200
	Link: http://bj.lianjia.com/ershoufang/xxx.html
......
复制代码

 

结语:

经过上面的三步,咱们能够对网页元素进行简单的爬取操做了。可是这里尚未真正利用好Scrapy提供给咱们的不少方便、强大的功能,好比: ItemLoader, Pipeline等。这些操做会在后续的文章中继续介绍。

https://www.cnblogs.com/silverbullet11/p/scrapy_series_1.html

相关文章
相关标签/搜索