使用python多进程爬取网页图片

时间 2021-08-14

标签 css html python 正则表达式 json windows 数组浏览器网络 app 栏目 Python 繁體版

原文原文链接

1. 爬虫简单介绍

当咱们打开一个网页，在上面发现一些了有用的信息以后，因而经过人工的方式从网页上一顿操做将信息记录起来，而经过爬虫，则能够利用一些设定好的规则以及方法来自动的从该网页上获取信息，总而言之就是解放双手，释放天性。

2. 爬取图片

是的，今天就是要爬取这个网站上的图片，这个网站上的图片基本上都是一些高清大图，有不少的beautiful girls，因此我要爬下来，当作个人电脑背景。
css

2.1 简单介绍

如图所示，首先拿到一个网页，咱们须要对这个网页作解析，找到图片对应的标签，找到页数对应的标签，找到以后把里面的url地址提取出来，而后下载就能够了,具体的处理流程以下图所示：

这里借助了python的几个模块：html

bs4 用来解析html，分析html来拿到对应的URL
requests 用来获取html对象
multiprocessing 使用多进程来提升下载图片的效率

下面只对bs4作一个简单的介绍python

3. bs4模块使用介绍

官方介绍
> Beautiful Soup 是一个能够从HTML或XML文件中提取数据的Python库.它可以经过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工做时间
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每一个节点都是Python对象,全部对象能够概括为4种: Tag , NavigableString , BeautifulSoup , Comment

这里会用到前三个对象：Tag, NavigableString, BeautifulSoup正则表达式

总而言之就是能够帮助咱们更简单的去解析html。

下面以一段示例来进行说明：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

3.1 建立一个BeautifulSoup对象

from bs4 import BeautifulSoup
# 传入上面的那一段html
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

执行结果
json

html_doc = """
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

3.1.1 获取标题

soup = BeautifulSoup(html_doc, 'html.parser')
soup.title

执行结果windows

The Dormouse's story

3.1.2 获取指定标签

soup.p

执行结果数组

<p class="title"><b>The Dormouse's story</b></p>

3.1.3 查找指定的全部标签

soup.find_all('a')

这里须要注意的是，find_all方法查找的是全部的某个标签，例如这里写的是查找全部的a标签，返回的是一个列表。浏览器

3.1.4 获取某个标签里的某个属性

soup.p['class']

执行结果网络

['title']

3.2 使用Tag对象

Tag对象跟原生的xml或者html中的tag(标签)相同，能够直接经过对应的名称来获取，什么意思呢？以下所示：

来打印下tag的全部属性就知道了app

源内容为：<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup = BeautifulSoup(html_doc, 'html.parser')
t = soup.a
print(t.attrs)

输出结果为：

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

能够看到t这个标签有href,class,id这些属性，那咱们分别来打印下这些属性的结果

soup = BeautifulSoup(html_doc, 'html.parser')
t = soup.a
print(t['href'])
print(t['class'])
print(t['id'])

输出结果为

http://example.com/elsie
['sister']
link1

class输出的结果为一个数组，之因此是数组，是由于class为多值属性

另外Tag对象还有俩比较重要的属性：name和string，咱们先经过结果来看下这俩属性的做用

soup = BeautifulSoup(html_doc, 'html.parser')
t = soup.a
print(t.name)
print(t.string)

结果为

a
Elsie

可见，name即为标签的名称，string即为标签中包含的字符串。

3.3 查找文档树

查找文档树比较重要，由于本文在爬取图片的时候，就是经过搜索指定标签来获取我想要的内容的。
在查找文档树时，比较经常使用的一个方法就是`find_all`了，能够经过传入指定的字符串，也能够经过自定义正则表达式，也能够传一个列表，下面咱们分别介绍下。

3.3.1 查找全部指定的标签

源内容：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('a'))

这里就是查找全部的a标签，返回的是一个数组(列表)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

既然返回的是一个列表，那么咱们就能够对这个返回结果进行遍历

soup = BeautifulSoup(html_doc, 'html.parser')
for r in soup.find_all('a'):
    print(r.string)

这里就是获取<a></a>标签中包含的字符串，结果以下：

Elsie
Lacie
Tillie

3.3.2 自定义正则表达式进行搜索

soup = BeautifulSoup(html_doc, 'html.parser')
for r in soup.find_all(id=re.compile(r'link(\d+)')):
    print(r)

这个就是经过正则表达式来指定咱们要匹配的内容，id=link+数字，知足条件的就是那三个a标签

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

3.3.3 传入一个列表，同时搜索多个标签

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(['a','p']))

结果为

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

这样的话输出的结果就会不少，那么咱们能不能添加一些过滤器呢？

3.3.4 使用过滤器

咱们须要对咱们上面的示例html作下修改，内容以下：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<a href="http://example.com/tillie" class="sister" id="link4">Tillie</a> 
<p class="story">...</p>
"""

新增了一个a标签，id=link4，方便咱们后面调试，下面定义一个过滤器，参考官方文档

soup = BeautifulSoup(html_doc, 'html.parser')

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))

执行结果

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

最终结果没有咱们刚加的那个id=link4的a标签，so, 过滤器生效了。

3.3.5 使用keywork参数

若是直接指定一个名称的参数，在搜索时，极可能不是很准确，这个时候若是知道某个tag的属性，就能够经过这个来搜索了

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(id='link4'))

执行结果以下

[<a class="sister" href="http://example.com/tillie" id="link4">Tillie</a>]

结果就是咱们刚加的a标签

若是咱们想搜索包含id这个属性的全部tag，则可使用find_all(id=True)

3.3.6 构造字典参数

可是有时候有的属性没法搜索，例如: data-*属性，这个时候就能够经过attrs参数来定义一个字典参数来搜索包含特殊属性的tag，以下：

soup.find_all(atrs={"data-foo": "要搜索的值"})

3.3.7 按CSS搜索

按照CSS类名搜索tag的功能很是实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 作参数会致使语法错误.从Beautiful Soup的4.1.1版本开始,能够经过 class_ 参数搜索有指定CSS类名的tag:

soup.find_all("a", class_="story")

class_参数一样接受不一样类型的过滤器 ,字符串,正则表达式,方法或True:

# 指定正则
soup.find_all(class_=re.compile("itl"))

# 经过自定义过滤器
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)

最后一个执行结果为

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

到这里基本上就对bs4这个模块有了一个基本的认识，知道这些咱们就能够来爬取咱们想要的图片了。

4. 一步一步的爬取网络图片

首先咱们先访问https://wallhaven.cc/这个网站，搜索一下咱们想搜的一些图片，例如输入关键词sexy girl，浏览器地址栏上就变成了https://wallhaven.cc/search?q=sexy girl&page=2这个地址，而后在搜索下其余的，发现这个网站的搜索结果的连接是有规律的，以下所示:

https://wallhaven.cc/search?q=关键词&参数

知道这个信息后，那咱们就直接使用requests来获取这个网页信息了。

4.1 解析网站的URL

f12看了下请求时的一些header，就随便拿了几个，而后直接使用requests

import requests

def request_client(url):
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'
    headers = {
        'user-agent': user_agent,
        'accept-ranges': 'bytes',
        'accept-language': 'zh-CN,zh;q=0.9'
    }
    req = requests.get(url, headers=headers)
    return req

print(request_client("https://wallhaven.cc/search?q=sexy%20girl").text)

结果就返回了一个html内容，下面这段是关于获取图片地址的地方

<li>
    <figure class="thumb thumb-4y9pv7 thumb-sfw thumb-general" data-wallpaper-id="4y9pv7" style="width:300px;height:200px">
        <img alt="loading" class="lazyload" data-src="https://th.wallhaven.cc/small/4y/4y9pv7.jpg" src="" />
        <a class="preview" href="https://wallhaven.cc/w/4y9pv7" target="_blank">
        </a>
        <div class="thumb-info">
            <span class="wall-res">
                1920 x 1200
            </span>
            <a class="jsAnchor overlay-anchor wall-favs" data-href="https://wallhaven.cc/wallpaper/fav/4y9pv7">
                9
                <i class="fa fa-fw fa-star">
                </i>
            </a>
            <a class="jsAnchor thumb-tags-toggle tagged" data-href="https://wallhaven.cc/wallpaper/tags/4y9pv7" title="Tags">
                <i class="fas fa-fw fa-tags">
                </i>
            </a>
        </div>
    </figure>
</li>

能够看到图片地址是在data-src这个属性下的，另外咱们还知道这个<img>标签下的class=lazyload, 待会咱们能够经过这两点信息来使用正则来获取到图片URL

def get_img_url_list(soup):
    # 主要是为了取出url，并将url解析成能够进行下载的连接
    def get_url(tag):
        re_img = re.compile(r'data-src="(.+?\.jpg)"')
        url = re_img.findall(str(tag))[0]
        _, img_name = os.path.split(url)
        replace_content = {
            'th.wallhaven.cc': 'w.wallhaven.cc',
            '/small/': '/full/',
            img_name: 'wallhaven-' + img_name
        }
        for k, v in replace_content.items():
            url = url.replace(k, v)
        return url
    img_url_list = []
    for tag in soup.find_all("img", class_="lazyload"):
        img_url_list.append(get_url(tag))
    return img_url_list

这一步咱们返回了一个元素为图片URL的列表，而且代码里对获取的URL作了处理，由于咱们拿到的URL并非真正的图片地址，经过打开一个图片，在浏览器f12上分析图片地址变成了

# 真正的下载地址
https://w.wallhaven.cc/full/4o/wallhaven-4ozvv9.jpg
# html中的地址
https://th.wallhaven.cc/small/4o/4ozvv9.jpg

因此在代码里作了以下替换, small ---> full, 4ozvv9.jpg ---> wallhaven-4ozvv9.jpg

4.2 获取页数

这一步须要继续分析刚获取的html，截取关键一段

<ul class="pagination" data-pagination='{"total":638,"current":1,"url":"https:\/\/wallhaven.cc\/search?q=animals&amp;page=1"}' role="navigation">
    <li>
        <span aria-hidden="true" original-tile="Previous Page">
    <i class="far fa-angle-double-left">
    </i>
    </span>
    </li>
    <li aria-current="page" class="current">
        <span original-title="Page 1">
            1
        </span>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=2" original-title="Page 2">
            2
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=3" original-title="Page 3">
            3
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=4" original-title="Page 4">
            4
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=5" original-title="Page 5">
            5
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=6" original-title="Page 6">
            6
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=7" original-title="Page 7">
           7
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=8" original-title="Page 8">
           8
        </a>
    </li>
    <li aria-disabled="true">
        <span>
            …
        </span>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=637" original-title="Page 637">
            637
        </a>
    </li>
    <li>
        <a href="https://wallhaven.cc/search?q=animals&amp;page=638" original-title="Page 638">
            638
        </a>
    </li>
    <li>
        <a aria-label="next" class="next" href="https://wallhaven.cc/search?q=animals&amp;page=2" rel="next">
            <i class="far fa-angle-double-right">
            </i>
        </a>
    </li>
</ul>

分析<ul></ul>标签里的内容，能够看出，页数是在data-pagination这个属性下的，因此咱们只须要拿到这个属性对应的value就能够了

def get_max_page(soup):
    result = soup.find('ul', class_='pagination')['data-pagination']
    to_json = json.loads(result)
    return to_json['total'] if 'total' in to_json else 1

在返回的时候简单判断下，保证返回的值能让后面的代码继续运行，由于页数不影响咱们的结果。

4.3 图片下载

def getImg(img_url_list: list, save_path):
    if not os.path.isdir(save_path):
        os.makedirs(save_path)
    # 对保存的路径简单处理下
    end_swith = '\\' if platform.system().lower() == 'windows' else '/'

    if not save_path.endswith(end_swith):
        save_path = save_path + end_swith
    # 开始下载并保存到指定目录下
    for img in img_url_list:
        _, save_name = os.path.split(img)
        whole_save_path = save_path + save_name
        img_content = request_client(img).content
        with open(whole_save_path, 'wb') as fw:
            fw.write(img_content)
        print("ImageUrl: %s download successfully." % img)
    return

下载比较简单，只要拿到图片地址就能够正常下载了。

4.4 并行下载

为了提升下载的速度，这里使用了多进程multiprocessing，另外为了保证使用多进程时，不把机器CPU跑满，这里不会使用所有的核数

def run(base_url, save_path, page=1):
    url = base_url + '&page=%d' % page
    pageHtml = request_client(url).text
    img_url_list = get_img_url_list(BeautifulSoup(pageHtml, 'lxml'))
    getImg(img_url_list, save_path)

if __name__ == '__main__':
    start_time = time.time()
    baseUrl = "https://wallhaven.cc/search?q=sexy%20girls&atleast=2560x1080&sorting=favorites&order=desc"
    save_path = '/data/home/dogfei/Pictures/Wallpapers'
    baseHtml = request_client(baseUrl).text
    pages = get_max_page(BeautifulSoup(baseHtml, 'lxml'))
    # 将CPU核数减一，避免CPU跑满
    cpu = cpu_count() - 1
    print("Cpu cores: %d" % cpu)
    pages = cpu if pages > cpu else pages
    # 建立一个进程池
    pool = Pool(processes=cpu)
    for p in range(1, pages + 1):
        pool.apply_async(run, args=(baseUrl, save_path, p,))
    pool.close()
    pool.join()
    end_time = time.time()
    print("Total time: %.2f seconds" % (end_time - start_time))

这里在下载的时候，不会把全部页的图片都下载了，会作一个简单的判断，当总页数不超过CPU的核数的时候，会所有下载，不然，只会下载CPU核数对应的页数。

5. 总结

源码：

import re
import os
import json
import time
import requests
import platform
from bs4 import BeautifulSoup
from bs4 import NavigableString
from multiprocessing import Pool, cpu_count


def request_client(url):
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36'
    headers = {
        'user-agent': user_agent,
        'accept-ranges': 'bytes',
        'accept-language': 'zh-CN,zh;q=0.9'
    }
    req = requests.get(url, headers=headers)
    return req


def get_max_page(soup):
    result = soup.find('ul', class_='pagination')['data-pagination']
    to_json = json.loads(result)
    return to_json['total'] if 'total' in to_json else 1


def get_img_url_list(soup):
    # 主要是为了取出url，并将url解析成能够进行下载的连接
    def get_url(tag):
        re_img = re.compile(r'data-src="(.+?\.jpg)"')
        url = re_img.findall(str(tag))[0]
        _, img_name = os.path.split(url)
        replace_content = {
            'th.wallhaven.cc': 'w.wallhaven.cc',
            '/small/': '/full/',
            img_name: 'wallhaven-' + img_name
        }
        for k, v in replace_content.items():
            url = url.replace(k, v)
        return url
    img_url_list = []
    for tag in soup.find_all("img", class_="lazyload"):
        img_url_list.append(get_url(tag))
    return img_url_list


def getImg(img_url_list: list, save_path):
    if not os.path.isdir(save_path):
        os.makedirs(save_path)

    end_swith = '\\' if platform.system().lower() == 'windows' else '/'

    if not save_path.endswith(end_swith):
        save_path = save_path + end_swith

    for img in img_url_list:
        _, save_name = os.path.split(img)
        whole_save_path = save_path + save_name
        img_content = request_client(img).content
        with open(whole_save_path, 'wb') as fw:
            fw.write(img_content)
        print("ImageUrl: %s download successfully." % img)
    return


def run(base_url, save_path, page=1):
    url = base_url + '&page=%d' % page
    pageHtml = request_client(url).text
    img_url_list = get_img_url_list(BeautifulSoup(pageHtml, 'lxml'))
    getImg(img_url_list, save_path)


if __name__ == '__main__':
    # 指定要下载的连接
    baseUrl = "https://wallhaven.cc/search?q=sexy%20girls&atleast=2560x1080&sorting=favorites&order=desc"
    # 指定要保存的目录位置
    save_path = '/data/home/dogfei/Pictures/Wallpapers'
    ######## 如下不须要修改
    start_time = time.time()
    baseHtml = request_client(baseUrl).text
    pages = get_max_page(BeautifulSoup(baseHtml, 'lxml'))
    cpu = cpu_count() - 1
    print("Cpu cores: %d" % cpu)
    pages = cpu if pages > cpu else pages
    pool = Pool(processes=cpu)
    for p in range(1, pages + 1):
        pool.apply_async(run, args=(baseUrl, save_path, p,))
    pool.close()
    pool.join()
    end_time = time.time()
    print("Total time: %.2f seconds" % (end_time - start_time))

想要进行深刻交流的能够关注我哦，公众号：feelwow

欢迎各位朋友关注个人公众号，来一块儿学习进步哦