python网络爬虫之使用scrapy下载文件

时间 2019-11-16

标签 python 网络爬虫使用 scrapy 下载文件栏目 Python 繁體版

原文原文链接

前面介绍了ImagesPipeline用于下载图片，Scrapy还提供了FilesPipeline用与文件下载。和以前的ImagesPipeline同样，FilesPipeline使用时只须要经过item的一个特殊字段将要下载的文件或图片的url传递给它们，它们便会自动将文件或图片下载到本地。将下载结果信息存入item的另外一个特殊字段，便于用户在导出文件中查阅。工做流程以下：html

1 在一个爬虫里，你抓取一个项目，把其中图片的URL放入 file_urls 组内。python

2 项目从爬虫内返回，进入项目管道。api

3 当项目进入 FilesPipeline，file_urls 组内的URLs将被Scrapy的调度器和下载器（这意味着调度器和下载器的中间件能够复用）安排下载，当优先级更高，会在其余页面被抓取前处理。项目会在这个特定的管道阶段保持“locker”的状态，直到完成文件的下载（或者因为某些缘由未完成下载）。app

4 当文件下载完后，另外一个字段(files)将被更新到结构中。这个组将包含一个字典列表，其中包括下载文件的信息，好比下载路径、源抓取地址（从 file_urls 组得到）和图片的校验码(checksum)。 files 列表中的文件顺序将和源 file_urls 组保持一致。若是某个图片下载失败，将会记录下错误信息，图片也不会出如今 files 组中。python2.7

下面来看下如何使用：scrapy

第一步：在配置文件settings.py中启用FilesPipelineide

ITEM_PIPELINES = {

    'scrapy.pipelines.files.FilesPipeline':1,

}

第二步：在配置文件settings.py中设置文件下载路径函数

FILE_STORE='E:\scrapy_project\file_download\file'

第三步：在item.py中定义file_url和file两个字段

class FileDownloadItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    file_urls=scrapy.Field()
    files=scrapy.Field()

这三步设置后之后，下面就来看下具体的下载了，咱们从matplotlib网站上下载示例代码。网址是：http://matplotlib.org/examples/index.html网站

接下来来查看网页结构，以下ui

点击animate_decay后进入下载页面。Animate_decay的网页连接都在<div class=”toctree-wrapper compound”>元素下。

可是像animation Examples这种索引的连接咱们是不须要的

经过这里咱们能够首先写出咱们的网页获取连接的方式：

def parse(self,response):
    le=LinkExtractor(restrict_xpaths='//*[@id="matplotlib-examples"]/div',deny='/index.html$')
    for link in le.extract_links(response):
        yield Request(link.url,callback=self.parse_link)

restrict_xpaths设定网页连接的元素。Deny将上面的目录连接给屏蔽了。所以获得的都是具体的文件的下载连接

接下来进入下载页面，网页结构图以下：点击source_code就能够下载文件

网页结构以下

还有另一种既包含代码连接，又包含图片连接的

从具体的文件下载连接来看有以下两种;

http://matplotlib.org/examples/pyplots/whats_new_99_mplot3d.py

http://matplotlib.org/mpl_examples/statistics/boxplot_demo.py

针对这两种方式获取对应的连接代码以下：

def parse_link(self,response):
        pattern=re.compile('href=(.*\.py)')
        div=response.xpath('/html/body/div[4]/div[1]/div/div')
        p=div.xpath('//p')[0].extract()
        link=re.findall(pattern,p)[0]
        if ('/') in link:      #针对包含文件，图片的下载连接方式生成：http://matplotlib.org/examples/pyplots/whats_new_99_mplot3d.py
            href='http://matplotlib.org/'+link.split('/')[2]+'/'+link.split('/')[3]+'/'+link.split('/')[4]
        else:     #针对只包含文件的下载连接方式生成：http://matplotlib.org/mpl_examples/statistics/boxplot_demo.py
            link=link.replace('"','')
            scheme=urlparse(response.url).scheme
            netloc=urlparse(response.url).netloc
            temp=urlparse(response.url).path
            path='/'+temp.split('/')[1]+'/'+temp.split('/')[2]+'/'+link
            combine=(scheme,netloc,path,'','','')
            href=urlunparse(combine)
#            print href,os.path.splitext(href)[1]
        file=FileDownloadItem()
        file['file_urls']=[href]
        return file

运行后出现以下的错误：提示ValueError: Missing scheme in request url: h。

2017-11-21 22:29:53 [scrapy] ERROR: Error processing {'file_urls': u'http://matplotlib.org/examples/api/agg_oo.htmlagg_oo.py'}

Traceback (most recent call last):

  File "E:\python2.7.11\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks

    current.result = callback(current.result, *args, **kw)

  File "E:\python2.7.11\lib\site-packages\scrapy\pipelines\media.py", line 44, in process_item

    requests = arg_to_iter(self.get_media_requests(item, info))

  File "E:\python2.7.11\lib\site-packages\scrapy\pipelines\files.py", line 365, in get_media_requests

    return [Request(x) for x in item.get(self.files_urls_field, [])]

  File "E:\python2.7.11\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__

    self._set_url(url)

  File "E:\python2.7.11\lib\site-packages\scrapy\http\request\__init__.py", line 57, in _set_url

    raise ValueError('Missing scheme in request url: %s' % self._url)

ValueError: Missing scheme in request url: h

这个错误的意思是在url中丢失了scheme. 咱们知道网址的通常结构是：scheme://host:port/path?。 这里的错误意思就是在scheme中没有找到http而只有一个h. 可是从log记录的来看，咱们明明是生成了一个完整的网页呢。为何会提示找不到呢。缘由就在于下面的这个配置使用的是url列表形式

ITEM_PIPELINES = {
#    'file_download.pipelines.SomePipeline': 300,
    'scrapy.pipelines.files.FilesPipeline':1,
}

而咱们的代码对于item的赋值倒是file['file_urls']=href 字符串的形式，所以若是用列表的方式来提取数据，只有h被提取出来了。所以代码须要成列表的赋值形式。修改成：file['file_urls']=[href]就能够了

程序运行成功。从保存路径来看，在download下面新建了一个full文件夹。而后下载的文件都保存在里面。可是文件名倒是00f4d142b951f072.py这种形式的。这些文件名是由url的散列值的出来的。这种命名方式能够防止重名的文件相互冲突，可是这种文件名太不直观了，咱们须要从新来定义下载的文件名名字

在FilesPipeline中，下载文件的函数是file_path。主体代码以下

Return的值就是文件路径。从下面看到是文件都是创建在full文件下面

media_guid = hashlib.sha1(to_bytes(url)).hexdigest()  
media_ext = os.path.splitext(url)[1]

return 'full/%s%s' % (media_guid, media_ext)

media_guid获得的是url的散列值，做为文件名

media_ext获得的是文件的后缀名也就是.py

下面咱们来从新写file_path函数用于生成咱们本身的文件名

咱们能够看到有不少网址是下面的形式，widgets是大类。后面的py文件是这个大类下的文件。咱们须要将属于一个大类的文件归档到同一个文件夹下面。

http://matplotlib.org/examples/widgets/span_selector.py http://matplotlib.org/examples/widgets/rectangle_selector.py

http://matplotlib.org/examples/widgets/slider_demo.py

http://matplotlib.org/examples/widgets/radio_buttons.py

http://matplotlib.org/examples/widgets/menu.py

http://matplotlib.org/examples/widgets/multicursor.py

http://matplotlib.org/examples/widgets/lasso_selector_demo.py

好比网页为http://matplotlib.org/examples/widgets/span_selector.py

urlparse(request.url).path 获得的结果是examples/widgets/span_selector.py

dirname(path)获得的结果是examples/widgets

basename(dirname(path))获得的结果是widgets

join(basename(dirname(str)),basename(str))获得的结果是widgets\ span_selector.py

重写pipeline.py以下：

from scrapy.pipelines.files import FilesPipeline
from urlparse import urlparse
from os.path import basename,dirname,join
class FileDownloadPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        path=urlparse(request.url).path
        temp=join(basename(dirname(path)),basename(path))
        return '%s/%s' % (basename(dirname(path)), basename(path))

运行程序发现生成的文件名仍是散列值的。缘由在于在以前的setting.py中，咱们设置的是'scrapy.pipelines.files.FilesPipeline':1

这将会直接采用FilesPipeline。如今咱们重写了FilesPipeline就须要更改这个设置，改成FileDownloadPipeline

ITEM_PIPELINES = {
#    'file_download.pipelines.SomePipeline': 300,
#    'scrapy.pipelines.files.FilesPipeline':1,
    'file_download.pipelines.FileDownloadPipeline':1,
}

再次运行，获得以下的结果：同一类的文件都被归类到了同一个文件夹下面。

且文件名采用的是更直观的方式。这样比散列值的文件名看起来直观多了

matplotlib文件打包的下载连接以下，有须要的能够下载

https://files.cnblogs.com/files/zhanghongfeng/matplotlib.rar

scrapy工程代码以下：

https://files.cnblogs.com/files/zhanghongfeng/file_download.rar