Python网络爬虫2 - 爬取新浪微博用户图片

时间 2019-11-16

标签 python 网络爬虫新浪微博用户图片栏目 Python 繁體版

原文原文链接

该博客首发于 www.litreily.topcss

其实，新浪微博用户图片爬虫是我学习python以来写的第一个爬虫，只不过当时懒，后来爬完Lofter后以为有必要总结一下，因此就有了第一篇爬虫博客。如今暂时闲下来了，准备把新浪的这个也补上。html

言归正传，既然选择爬新浪微博，那固然是有需求的，这也是学习的主要动力之一，没错，就是美图。sina用户多数微博都是包含图片的，并且是组图居多，单个图片的较少。python

为了不侵权，本文以本人微博litreily为例说明整个爬取过程，虽然图片较少，质量较低，但爬取方案是绝对ok的，使用时只要换个用户ID就能够了。linux

分析sina站点

获取用户ID

在爬取前，咱们须要知道的是每一个用户都有一个用户名，而一个用户名又对应一个惟一的整型数字ID，相似于学生的学号，本人的是2657006573。至于怎么根据用户名去获取ID，有如下两种方法：git

进入待爬取用户主页，在浏览器网址栏中便可看到一串数据，那就是用户ID
Ctrl-U查看待爬取用户的源码，搜索"uid，注意是双引号

实际上是能够在已知用户名的状况下经过爬虫自动获取到uid的，可是我当时初学python，并无考虑充分，因此后面的源码是以用户ID做为输入参数的。github

图片存储参数解析

用户全部的图片都被存放至这样的路径下，真的是全部图片哦！！！正则表达式

https://weibo.cn/{uid}/profile?filter={filter_type}&page={page_num}

# example
https://weibo.cn/2657006573/profile?filter=0&page=1
uid: 2657006573
filter_type: 0
page_num: 1
复制代码

注意，是weibo.cn而不是weibo.com，至于我是怎么找到这个页面的，说实话，我也忘了。。。数据库

连接中包含3个参数，uid, filter_mode 以及 page_num。其中，uid就是前面说起的用户ID，page_num也很好理解，就是分页的当前页数，从1开始增长，那么，这个filter_mode是什么呢？express

不着急，咱们先来看看页面↓windows

能够看到，滤波类型filter_mode指的就是筛选条件，一共三个：

filter=0 所有微博（包含纯文本微博，转载微博）
filter=1 原创微博（包含纯文本微博）
filter=2 图片微博（必须含有图片，包含转载）

我一般会选择原创，由于我并不但愿爬取结果中包含转载微博中的图片。固然，你们依照本身的须要选择便可。

图链解析

好了，参数来源都知道了，咱们回过头看看这个网页。页面是否是感受就是个空架子？毫无css痕迹，不要紧，新浪原本就没打算把这个页面主动呈现给用户。但对于爬虫而言，这倒是极好的，为何这么说？缘由以下：

图片齐全，没有遗漏，就是个可视化的数据库
样式少，页面简单，省流量，爬取快
静态网页，分页存储，所见即所得
源码包含了全部微博的首图和组图连接

这样的网页用来练手再合适不过。但要注意的是上面第4点，什么是首图和组图连接呢，很好理解。每篇博客可能包含多张图片，那就是组图，但该页面只显示博客的第一张图片，即所谓的首图，组图连接指向的是存储着该组图全部图片的网址。

因为本人微博没组图，因此此处以刘亦菲微博为例，说明单图及组图的图链格式

图中的上面一篇微博只有一张图片，能够轻易获取到原图连接，注意是原图，由于咱们在页面能看到的是缩略图，但要爬取的固然是原图啦。

图中下面的微博包含组图，在图片右侧的Chrome开发工具能够看到组图连接。

https://weibo.cn/mblog/picAll/FCQefgeAr?rl=2

打开组图连接，能够看到图片以下图所示：

能够看到缩略图连接以及原图连接，而后咱们点击原图看一下。

能够发现，弹出页面的连接与上图显示的不一样，但与上图中的缩略图连接极为类似。它们分别是：

缩略图：http://ww1.sinaimg.cn/thumb180/c260f7ably1fn4vd7ix0qj20rs1aj1kx.jpg
原图： http://wx1.sinaimg.cn/large/c260f7ably1fn4vd7ix0qj20rs1aj1kx.jpg

能够看出，只是一个thumb180和large的区别。既然发现了规律，那就好办多了，咱们只要知道缩略图的网址，就能够将域名后的第一级子域名替换成large就能够了，而不用获取原图连接再跳转一次。

并且，屡次尝试能够发现组图连接及缩略图连接知足正则表达式：

# 1. 组图连接：
imglist_reg = r'href="(https://weibo.cn/mblog/picAll/.{9}\?rl=2)"'

# 2. 缩略图
img_reg = r'src="(http://w.{2}\.sinaimg.cn/(.{6,8})/.{32,33}.(jpg|gif))"'
复制代码

到此，新浪微博的解析过程就结束了，图链的格式以及获取方式也都清楚了。下面就能够设计方案进行爬取了。

肯定爬取方案

根据解析结果，很容易制定出如下爬取方案：

给定微博用户名litreily
进入待爬取用户主页，便可从网址中获取uid: 2657006573
获取本人登陆微博后的cookies（请求报文须要用到cookies）
逐一爬取 https://weibo.cn/2657006573/profile?filter=0&page={1,2,3,...}
解析每一页的源码，获取单图连接及组图连接，

单图：直接获取该图缩略图连接；
组图：爬取组图连接，循环获取组图页面全部图片的缩略图连接

循环将第5步获取到的图链替换为原图连接，并下载至本地
重复第4-6步，直至没有图片

获取cookies

针对以上方案，其中有几个重点内容，其一就是cookies的获取，我暂时还没学怎么自动获取cookies，因此目前是登陆微博后手动获取的。

下载网页

下载网页用的是python3自带的urllib库，当时没学requests，之后可能也不多用urllib了。

def _get_html(url, headers):
    try:
        req = urllib.request.Request(url, headers = headers)
        page = urllib.request.urlopen(req)
        html = page.read().decode('UTF-8')
    except Exception as e:
        print("get %s failed" % url)
        return None
    return html
复制代码

获取存储路径

因为我是在win10下编写的代码，可是我的比较喜欢用bash，因此图片的存储路径有如下两种格式，_get_path函数会自动判断当前操做系统的类型，而后选择相应的路径。

def _get_path(uid):
    path = {
        'Windows': 'D:/litreily/Pictures/python/sina/' + uid,
        'Linux': '/mnt/d/litreily/Pictures/python/sina/' + uid
    }.get(platform.system())

    if not os.path.isdir(path):
        os.makedirs(path)
    return path
复制代码

幸亏windows是兼容linux系统的斜杠符号的，否则程序中的相对路径替换还挺麻烦。

下载图片

因为选用的urllib库，因此下载图片就使用urllib.request.urlretrieve了

# image url of one page is saved in imgurls
for img in imgurls:
    imgurl = img[0].replace(img[1], 'large')
    num_imgs += 1
    try:
        urllib.request.urlretrieve(imgurl, '{}/{}.{}'.format(path, num_imgs, img[2]))
        # display the raw url of images
        print('\t%d\t%s' % (num_imgs, imgurl))
    except Exception as e:
        print(str(e))
        print('\t%d\t%s failed' % (num_imgs, imgurl))
复制代码

源码

其它细节详见源码

#!/usr/bin/python3
# -*- coding:utf-8 -*-
# author: litreily
# date: 2018.02.05
"""Capture pictures from sina-weibo with user_id."""

import re
import os
import platform

import urllib
import urllib.request

from bs4 import BeautifulSoup


def _get_path(uid):
    path = {
        'Windows': 'D:/litreily/Pictures/python/sina/' + uid,
        'Linux': '/mnt/d/litreily/Pictures/python/sina/' + uid
    }.get(platform.system())

    if not os.path.isdir(path):
        os.makedirs(path)
    return path


def _get_html(url, headers):
    try:
        req = urllib.request.Request(url, headers = headers)
        page = urllib.request.urlopen(req)
        html = page.read().decode('UTF-8')
    except Exception as e:
        print("get %s failed" % url)
        return None
    return html


def _capture_images(uid, headers, path):
    filter_mode = 1      # 0-all 1-original 2-pictures
    num_pages = 1
    num_blogs = 0
    num_imgs = 0

    # regular expression of imgList and img
    imglist_reg = r'href="(https://weibo.cn/mblog/picAll/.{9}\?rl=2)"'
    imglist_pattern = re.compile(imglist_reg)
    img_reg = r'src="(http://w.{2}\.sinaimg.cn/(.{6,8})/.{32,33}.(jpg|gif))"'
    img_pattern = re.compile(img_reg)
    
    print('start capture picture of uid:' + uid)
    while True:
        url = 'https://weibo.cn/%s/profile?filter=%s&page=%d' % (uid, filter_mode, num_pages)

        # 1. get html of each page url
        html = _get_html(url, headers)
        
        # 2. parse the html and find all the imgList Url of each page
        soup = BeautifulSoup(html, "html.parser")
        # <div class="c" id="M_G4gb5pY8t"><div>
        blogs = soup.body.find_all(attrs={'id':re.compile(r'^M_')}, recursive=False)
        num_blogs += len(blogs)

        imgurls = []        
        for blog in blogs:
            blog = str(blog)
            imglist_url = imglist_pattern.findall(blog)
            if not imglist_url:
                # 2.1 get img-url from blog that have only one pic
                imgurls += img_pattern.findall(blog)
            else:
                # 2.2 get img-urls from blog that have group pics
                html = _get_html(imglist_url[0], headers)
                imgurls += img_pattern.findall(html)

        if not imgurls:
            print('capture complete!')
            print('captured pages:%d, blogs:%d, imgs:%d' % (num_pages, num_blogs, num_imgs))
            print('directory:' + path)
            break

        # 3. download all the imgs from each imgList
        print('PAGE %d with %d images' % (num_pages, len(imgurls)))
        for img in imgurls:
            imgurl = img[0].replace(img[1], 'large')
            num_imgs += 1
            try:
                urllib.request.urlretrieve(imgurl, '{}/{}.{}'.format(path, num_imgs, img[2]))
                # display the raw url of images
                print('\t%d\t%s' % (num_imgs, imgurl))
            except Exception as e:
                print(str(e))
                print('\t%d\t%s failed' % (num_imgs, imgurl))
        num_pages += 1
        print('')


def main():
    # uids = ['2657006573','2173752092','3261134763','2174219060']
    uid = '2657006573'
    path = _get_path(uid)

    # cookie is form the above url->network->request headers
    cookies = ''
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
            'Cookie': cookies}

    # capture imgs from sina
    _capture_images(uid, headers, path)


if __name__ == '__main__':
    main()

复制代码

使用时记得修改main函数中的cookies和uid！

爬取测试

写在最后

该爬虫已存放至开源项目capturer，欢迎交流
因为是首个爬虫，因此许多地方有待改进，相对的LOFTER爬虫就更娴熟写了
目前没有发现新浪微博有明显的反爬措施，但仍是按需索取为好