是的,今天就是要爬取这个网站上的图片,这个网站上的图片基本上都是一些高清大图,有不少的beautiful girls,因此我要爬下来,当作个人电脑背景。
css
这里借助了python的几个模块:html
下面只对bs4
作一个简单的介绍python
这里会用到前三个对象:
Tag
,NavigableString
,BeautifulSoup
正则表达式
<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p>
from bs4 import BeautifulSoup # 传入上面的那一段html soup = BeautifulSoup(html_doc, 'html.parser') print(soup.prettify())
执行结果
json
html_doc = """ <html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html> """
soup = BeautifulSoup(html_doc, 'html.parser') soup.title
执行结果windows
The Dormouse's story
soup.p
执行结果数组
<p class="title"><b>The Dormouse's story</b></p>
soup.find_all('a')
这里须要注意的是,find_all方法查找的是全部的某个标签,例如这里写的是查找全部的
a
标签,返回的是一个列表。浏览器
soup.p['class']
执行结果网络
['title']
Tag
对象跟原生的xml或者html中的tag(标签)相同,能够直接经过对应的名称来获取,什么意思呢?以下所示:
来打印下tag的全部属性就知道了app
源内容为:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup = BeautifulSoup(html_doc, 'html.parser') t = soup.a print(t.attrs)
输出结果为:
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
能够看到t
这个标签有href
,class
,id
这些属性,那咱们分别来打印下这些属性的结果
soup = BeautifulSoup(html_doc, 'html.parser') t = soup.a print(t['href']) print(t['class']) print(t['id'])
输出结果为
http://example.com/elsie ['sister'] link1
class输出的结果为一个数组,之因此是数组,是由于class为多值属性
另外Tag
对象还有俩比较重要的属性:name
和string
,咱们先经过结果来看下这俩属性的做用
soup = BeautifulSoup(html_doc, 'html.parser') t = soup.a print(t.name) print(t.string)
结果为
a Elsie
可见,name
即为标签的名称,string
即为标签中包含的字符串。
源内容:
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all('a'))
这里就是查找全部的a
标签,返回的是一个数组(列表)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
既然返回的是一个列表,那么咱们就能够对这个返回结果进行遍历
soup = BeautifulSoup(html_doc, 'html.parser') for r in soup.find_all('a'): print(r.string)
这里就是获取<a></a>
标签中包含的字符串,结果以下:
Elsie Lacie Tillie
soup = BeautifulSoup(html_doc, 'html.parser') for r in soup.find_all(id=re.compile(r'link(\d+)')): print(r)
这个就是经过正则表达式来指定咱们要匹配的内容,id=link+数字,知足条件的就是那三个a
标签
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all(['a','p']))
结果为
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]
这样的话输出的结果就会不少,那么咱们能不能添加一些过滤器呢?
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <a href="http://example.com/tillie" class="sister" id="link4">Tillie</a> <p class="story">...</p> """
新增了一个a
标签,id=link4
,方便咱们后面调试,下面定义一个过滤器,参考官方文档
soup = BeautifulSoup(html_doc, 'html.parser') def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') print(soup.find_all(has_class_but_no_id))
执行结果
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>, <p class="story">...</p>]
最终结果没有咱们刚加的那个id=link4
的a
标签,so, 过滤器生效了。
soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all(id='link4'))
执行结果以下
[<a class="sister" href="http://example.com/tillie" id="link4">Tillie</a>]
结果就是咱们刚加的a
标签
若是咱们想搜索包含
id
这个属性的全部tag,则可使用find_all(id=True)
可是有时候有的属性没法搜索,例如: data-*属性,这个时候就能够经过attrs
参数来定义一个字典参数来搜索包含特殊属性的tag,以下:
soup.find_all(atrs={"data-foo": "要搜索的值"})
soup.find_all("a", class_="story")
class_
参数一样接受不一样类型的 过滤器 ,字符串,正则表达式,方法或True
:
# 指定正则 soup.find_all(class_=re.compile("itl")) # 经过自定义过滤器 def has_six_characters(css_class): return css_class is not None and len(css_class) == 6 soup.find_all(class_=has_six_characters)
最后一个执行结果为
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
到这里基本上就对bs4
这个模块有了一个基本的认识,知道这些咱们就能够来爬取咱们想要的图片了。
首先咱们先访问https://wallhaven.cc/这个网站,搜索一下咱们想搜的一些图片,例如输入关键词sexy girl
,浏览器地址栏上就变成了https://wallhaven.cc/search?q=sexy girl&page=2这个地址,而后在搜索下其余的,发现这个网站的搜索结果的连接是有规律的,以下所示:
https://wallhaven.cc/search?q=关键词&参数
知道这个信息后,那咱们就直接使用requests
来获取这个网页信息了。
f12
看了下请求时的一些header
,就随便拿了几个,而后直接使用requests
import requests def request_client(url): user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36' headers = { 'user-agent': user_agent, 'accept-ranges': 'bytes', 'accept-language': 'zh-CN,zh;q=0.9' } req = requests.get(url, headers=headers) return req print(request_client("https://wallhaven.cc/search?q=sexy%20girl").text)
结果就返回了一个html内容,下面这段是关于获取图片地址的地方
<li> <figure class="thumb thumb-4y9pv7 thumb-sfw thumb-general" data-wallpaper-id="4y9pv7" style="width:300px;height:200px"> <img alt="loading" class="lazyload" data-src="https://th.wallhaven.cc/small/4y/4y9pv7.jpg" src="" /> <a class="preview" href="https://wallhaven.cc/w/4y9pv7" target="_blank"> </a> <div class="thumb-info"> <span class="wall-res"> 1920 x 1200 </span> <a class="jsAnchor overlay-anchor wall-favs" data-href="https://wallhaven.cc/wallpaper/fav/4y9pv7"> 9 <i class="fa fa-fw fa-star"> </i> </a> <a class="jsAnchor thumb-tags-toggle tagged" data-href="https://wallhaven.cc/wallpaper/tags/4y9pv7" title="Tags"> <i class="fas fa-fw fa-tags"> </i> </a> </div> </figure> </li>
能够看到图片地址是在data-src
这个属性下的,另外咱们还知道这个<img>
标签下的class=lazyload
, 待会咱们能够经过这两点信息来使用正则来获取到图片URL
def get_img_url_list(soup): # 主要是为了取出url,并将url解析成能够进行下载的连接 def get_url(tag): re_img = re.compile(r'data-src="(.+?\.jpg)"') url = re_img.findall(str(tag))[0] _, img_name = os.path.split(url) replace_content = { 'th.wallhaven.cc': 'w.wallhaven.cc', '/small/': '/full/', img_name: 'wallhaven-' + img_name } for k, v in replace_content.items(): url = url.replace(k, v) return url img_url_list = [] for tag in soup.find_all("img", class_="lazyload"): img_url_list.append(get_url(tag)) return img_url_list
这一步咱们返回了一个元素为图片URL的列表,而且代码里对获取的URL作了处理,由于咱们拿到的URL并非真正的图片地址,经过打开一个图片,在浏览器f12
上分析图片地址变成了
# 真正的下载地址 https://w.wallhaven.cc/full/4o/wallhaven-4ozvv9.jpg # html中的地址 https://th.wallhaven.cc/small/4o/4ozvv9.jpg
因此在代码里作了以下替换, small ---> full, 4ozvv9.jpg ---> wallhaven-4ozvv9.jpg
这一步须要继续分析刚获取的html,截取关键一段
<ul class="pagination" data-pagination='{"total":638,"current":1,"url":"https:\/\/wallhaven.cc\/search?q=animals&page=1"}' role="navigation"> <li> <span aria-hidden="true" original-tile="Previous Page"> <i class="far fa-angle-double-left"> </i> </span> </li> <li aria-current="page" class="current"> <span original-title="Page 1"> 1 </span> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=2" original-title="Page 2"> 2 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=3" original-title="Page 3"> 3 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=4" original-title="Page 4"> 4 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=5" original-title="Page 5"> 5 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=6" original-title="Page 6"> 6 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=7" original-title="Page 7"> 7 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=8" original-title="Page 8"> 8 </a> </li> <li aria-disabled="true"> <span> … </span> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=637" original-title="Page 637"> 637 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=638" original-title="Page 638"> 638 </a> </li> <li> <a aria-label="next" class="next" href="https://wallhaven.cc/search?q=animals&page=2" rel="next"> <i class="far fa-angle-double-right"> </i> </a> </li> </ul>
分析<ul></ul>
标签里的内容,能够看出,页数是在data-pagination
这个属性下的,因此咱们只须要拿到这个属性对应的value就能够了
def get_max_page(soup): result = soup.find('ul', class_='pagination')['data-pagination'] to_json = json.loads(result) return to_json['total'] if 'total' in to_json else 1
在返回的时候简单判断下,保证返回的值能让后面的代码继续运行,由于页数不影响咱们的结果。
def getImg(img_url_list: list, save_path): if not os.path.isdir(save_path): os.makedirs(save_path) # 对保存的路径简单处理下 end_swith = '\\' if platform.system().lower() == 'windows' else '/' if not save_path.endswith(end_swith): save_path = save_path + end_swith # 开始下载并保存到指定目录下 for img in img_url_list: _, save_name = os.path.split(img) whole_save_path = save_path + save_name img_content = request_client(img).content with open(whole_save_path, 'wb') as fw: fw.write(img_content) print("ImageUrl: %s download successfully." % img) return
下载比较简单,只要拿到图片地址就能够正常下载了。
为了提升下载的速度,这里使用了多进程multiprocessing
,另外为了保证使用多进程时,不把机器CPU跑满,这里不会使用所有的核数
def run(base_url, save_path, page=1): url = base_url + '&page=%d' % page pageHtml = request_client(url).text img_url_list = get_img_url_list(BeautifulSoup(pageHtml, 'lxml')) getImg(img_url_list, save_path) if __name__ == '__main__': start_time = time.time() baseUrl = "https://wallhaven.cc/search?q=sexy%20girls&atleast=2560x1080&sorting=favorites&order=desc" save_path = '/data/home/dogfei/Pictures/Wallpapers' baseHtml = request_client(baseUrl).text pages = get_max_page(BeautifulSoup(baseHtml, 'lxml')) # 将CPU核数减一,避免CPU跑满 cpu = cpu_count() - 1 print("Cpu cores: %d" % cpu) pages = cpu if pages > cpu else pages # 建立一个进程池 pool = Pool(processes=cpu) for p in range(1, pages + 1): pool.apply_async(run, args=(baseUrl, save_path, p,)) pool.close() pool.join() end_time = time.time() print("Total time: %.2f seconds" % (end_time - start_time))
这里在下载的时候,不会把全部页的图片都下载了,会作一个简单的判断,当总页数不超过CPU的核数的时候,会所有下载,不然,只会下载CPU核数对应的页数。
源码:
import re import os import json import time import requests import platform from bs4 import BeautifulSoup from bs4 import NavigableString from multiprocessing import Pool, cpu_count def request_client(url): user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36' headers = { 'user-agent': user_agent, 'accept-ranges': 'bytes', 'accept-language': 'zh-CN,zh;q=0.9' } req = requests.get(url, headers=headers) return req def get_max_page(soup): result = soup.find('ul', class_='pagination')['data-pagination'] to_json = json.loads(result) return to_json['total'] if 'total' in to_json else 1 def get_img_url_list(soup): # 主要是为了取出url,并将url解析成能够进行下载的连接 def get_url(tag): re_img = re.compile(r'data-src="(.+?\.jpg)"') url = re_img.findall(str(tag))[0] _, img_name = os.path.split(url) replace_content = { 'th.wallhaven.cc': 'w.wallhaven.cc', '/small/': '/full/', img_name: 'wallhaven-' + img_name } for k, v in replace_content.items(): url = url.replace(k, v) return url img_url_list = [] for tag in soup.find_all("img", class_="lazyload"): img_url_list.append(get_url(tag)) return img_url_list def getImg(img_url_list: list, save_path): if not os.path.isdir(save_path): os.makedirs(save_path) end_swith = '\\' if platform.system().lower() == 'windows' else '/' if not save_path.endswith(end_swith): save_path = save_path + end_swith for img in img_url_list: _, save_name = os.path.split(img) whole_save_path = save_path + save_name img_content = request_client(img).content with open(whole_save_path, 'wb') as fw: fw.write(img_content) print("ImageUrl: %s download successfully." % img) return def run(base_url, save_path, page=1): url = base_url + '&page=%d' % page pageHtml = request_client(url).text img_url_list = get_img_url_list(BeautifulSoup(pageHtml, 'lxml')) getImg(img_url_list, save_path) if __name__ == '__main__': # 指定要下载的连接 baseUrl = "https://wallhaven.cc/search?q=sexy%20girls&atleast=2560x1080&sorting=favorites&order=desc" # 指定要保存的目录位置 save_path = '/data/home/dogfei/Pictures/Wallpapers' ######## 如下不须要修改 start_time = time.time() baseHtml = request_client(baseUrl).text pages = get_max_page(BeautifulSoup(baseHtml, 'lxml')) cpu = cpu_count() - 1 print("Cpu cores: %d" % cpu) pages = cpu if pages > cpu else pages pool = Pool(processes=cpu) for p in range(1, pages + 1): pool.apply_async(run, args=(baseUrl, save_path, p,)) pool.close() pool.join() end_time = time.time() print("Total time: %.2f seconds" % (end_time - start_time))
欢迎各位朋友关注个人公众号,来一块儿学习进步哦