再次爬取干货集中营的福利图片

时间 2020-05-06

标签再次干货集中营图片繁體版

原文原文链接

值得学习的地方html

1.utc时间转换成普通时间的函数,也就是把2015-06-05T03:54:29.403Z格式的时间转换成2015-06-05 11:54:29python

2.使用requrests获取https连接开头的图片数据json

以前爬取过干货集中营的照片,地址:https://www.cnblogs.com/sanduzxcvbnm/p/9209544.htmlapi

今天再重爬一次,换了一种通俗易懂的方式浏览器

首先分析网站函数

1.网站地址:https://gank.io/,要爬取的是网站上展现出来的图片学习

2.在网站首页有API地址,点进去查看发现是干货集中营 API 文档,网站

其中有这样一个数据显示url

尝试访问:https://gank.io/api/data/福利/10/1,出来以下json数据spa

3.经过API文档描述,能够知道连接地址中的请求个数和请求页数,获取到全部的图片信息

好比:https://gank.io/api/data/福利/700/1,请求700条数据,获取第一页,发现返回的结果中总共有668条数据(截止到20190115),第二页没有数据

再次请求:https://gank.io/api/data/福利/600/1,请求600条数据,获取第一页,发现确实有600条数据,而后再请求第二页(https://gank.io/api/data/福利/600/2),有68条数据

结合以上分析,截止到今天(20190115),共有668条数据,避免使用翻页的状况,就直接使用以下网址获取所有数据:https://gank.io/api/data/福利/700/1

其次根据获取到的数据,提取出所须要的图片连接

最后下载这些图片并保存

#!/usr/bin/env python # -*- coding: utf-8 -*- import datetime import os from _md5 import md5 import requests from requests import RequestException def get_one_page(url): try: response = requests.get(url) if response.status_code == 200: return response.json() except: return None def utc_to_local(utc_time_str, utc_format='%Y-%m-%dT%H:%M:%S.%fZ'): """ "2015-06-05T03:54:29.403Z"格式的时间转换成2015-06-05 11:54:29 """ local_format = "%Y-%m-%d %H:%M:%S" utc_dt = datetime.datetime.strptime(utc_time_str, utc_format) local_dt = utc_dt + datetime.timedelta(hours=8) time_str = local_dt.strftime(local_format) return time_str def https_to_http(url): """ 把图片连接是https的换成http """ if url[0:5] == 'https': url = url.replace(url[0:5], 'http') return url def parse_one_page(html): items = html['results'] for item in items: yield { 'id': item['_id'], 'publishedAt': utc_to_local(item['publishedAt']), 'url': https_to_http(item['url']) } # 请求图片url,获取图片二进制数据 def download_image(url): try: response = requests.get(url) if response.status_code == 200: save_image(response.content) # response.contenter二进制数据 response.text文本数据 return None except RequestException: print('请求图片出错', url) return None def save_image(content): """ 须要提早建好目录D:\\pachong\\gank\\ """ file_path = 'D:\\pachong\\gank\\{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg') if not os.path.exists(file_path): with open(file_path, 'wb') as f: f.write(content) def main(): url = 'http://gank.io/api/data/福利/700/1' html = get_one_page(url) for item in parse_one_page(html): download_image(item['url']) if __name__ == '__main__': main()

最终结果显示:

总共668条数据,实际只有558条数据,经过分析获取到的图片url,发现有些图片url连接自己已失效,使用浏览器打开这些图片连接会报以下错误:

失效连接展现: