python爬虫学习：爬虫QQ说说并生成词云图，回忆满满

时间 2019-11-17

原文原文链接

自学过一段时间的python，用django本身作了个网站，也用requests+BeautifulSoup爬虫过些简单的网站，周末研究学习了一波，准备爬取QQ空间的说说，并把内容存在txt中，读取生成云图。
很久不登qq了，空间说说更是几年不玩了，里面满满的都是上学时候的回忆，看着看着就笑了，笑着笑着就...哈哈哈~~
无图言虚空html

当年的我仍是那么风华正茂、幽默风趣...
言归正传，本次使用的是 selenium模拟登陆+ BeautifulSoup4爬取数据+ wordcloud生成词云图

BeautifulSoup安装

pip install beautifulsoup4
这里有beautifulsoup4 的官方文档
还须要用到解析器，我选择的是html5lib解析器pip install html5lib
下表列出了主要的解析器,以及它们的优缺点:html5

解析器	使用方法	优点	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快文档容错能力强	须要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml")	速度快惟一支持XML的解析器	须要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

selenium模拟登陆

使用selenium模拟登陆QQ空间，安装pip install selenium
我用的是chrom浏览器，webdriver.Chrome()，获取Chrome浏览器的驱动。
这里还须要下载安装对应浏览器的驱动，不然在运行脚本时，会提示 chromedriver executable needs to be in PATH错误，用的是mac，网上找的一篇下载驱动的文章，https://blog.csdn.net/zxy987872674/article/details/53082896
同理window的也同样，下载对应的驱动，解压后，将下载的**.exe 放到Python的安装目录，例如 D:\python 。同时须要将Python的安装目录添加到系统环境变量里。python

qq登陆页http://i.qq.com，利用webdriver打开qq空间的登陆页面git

driver = webdriver.Chrome()
driver.get("http://i.qq.com")
复制代码

打开以后右击检查查看页面元素，发现账号密码登陆在 login_frame里，先定位到所在的frame， driver.switch_to.frame("login_frame") ，再自动点击账号密码登陆按钮，自动输入账号密码登陆，而且打开说说页面，详细代码以下

friend = '' # 朋友的QQ号，**朋友的空间要求容许你能访问**，这里能够输入本身的qq号
user = ''  # 你的QQ号
pw = ''  # 你的QQ密码

 # 获取浏览器驱动
driver = webdriver.Chrome()
 # 浏览器窗口最大化
driver.maximize_window()
 # 浏览器地址定向为qq登录页面
driver.get("http://i.qq.com")

 # 定位到登陆所在的frame
driver.switch_to.frame("login_frame")

 # 自动点击帐号登录方式
driver.find_element_by_id("switcher_plogin").click()
 # 帐号输入框输入已知qq帐号
driver.find_element_by_id("u").send_keys(user)
 # 密码框输入已知密码
driver.find_element_by_id("p").send_keys(pw)
 # 自动点击登录按钮
driver.find_element_by_id("login_button").click()
 # 让webdriver操纵当前页
driver.switch_to.default_content()
 # 跳到说说的url, friend能够任意改为你想访问的空间，好比这边访问本身的qq空间
driver.get("http://user.qzone.qq.com/" + friend + "/311")
复制代码

这个时候能够看到已经打开了qq说说的页面了，注意部分空间打开以后会出现一个提示框，须要先模拟点击事件关闭这个提示框github

tm我之前居然还有个黄钻，好可怕~~，空间头像也是那么的年轻、主流...

try:
    #找到关闭按钮，关闭提示框
    button = driver.find_element_by_id("dialog_button_111").click()
except:
    pass
复制代码

同时由于说说内容是动态加载的，须要自动下拉滚动条，加载出所有的内容，再模拟点击下一页加载内容。具体代码见下面。web

BeautifulSoup爬取说说

F12查看内容，能够找到说说在feed_wrap这个<div>，<ol>里面的<li>标签数组里面，具体每条说说内容在<div> class="bd"的<pre>标签中。
chrome

next_num = 0  # 初始“下一页”的id
while True:
    # 下拉滚动条，使浏览器加载出所有的内容，
    # 这里是从0开始到5结束 分5 次加载完每页数据
    for i in range(0, 5):
        height = 20000 * i  # 每次滑动20000像素
        strWord = "window.scrollBy(0," + str(height) + ")"
        driver.execute_script(strWord)
        time.sleep(2)

    # 这里须要选中 说说 所在的frame，不然找不到下面须要的网页元素
    driver.switch_to.frame("app_canvas_frame")
    # 解析页面元素
    content = BeautifulSoup(driver.page_source, "html5lib")
    # 找到"feed_wrap"的div里面的ol标签
    ol = content.find("div", class_="feed_wrap").ol
    # 经过find_all遍历li标签数组
    lis = ol.find_all("li", class_="feed")

    # 将说说内容写入文件，使用 a 表示内容能够连续不清空写入
    with open('qq_word.txt', 'a', encoding='utf-8') as f:
        for li in lis:
            bd = li.find("div", class_="bd")
            #找到具体说说所在标签pre，获取内容
            ss_content = bd.pre.get_text()
            f.write(ss_content + "\n")

    # 当已经到了尾页，“下一页”这个按钮就没有id了，能够结束了
    if driver.page_source.find('pager_next_' + str(next_num)) == -1:
        break
    # 找到“下一页”的按钮，由于下一页的按钮是动态变化的，这里须要动态记录一下
    driver.find_element_by_id('pager_next_' + str(next_num)).click()
    # “下一页”的id
    next_num += 1
    # 由于在下一个循环里首先还要把页面下拉，因此要跳到外层的frame上
    driver.switch_to.parent_frame()

复制代码

至此QQ说说已经爬取下来，而且保存在了qq_word文件里
接下来生成词云图django

词云图

使用wordcloud包生成词云图，pip install wordcloud
这里还可使用jieba分词，我并无使用，由于我以为qq说说的句子读起来才有点感受，我的喜爱，用jieba分词能够看到说说高频次的一些词语。
设置下wordcloud的一些属性，注意这里要设置font_path属性，不然汉字会出现乱码。
这里还有个要提醒的是，若是使用了虚拟环境的，不要在虚拟环境下运行如下脚本，不然可能会报错 RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are using (Ana)Conda please install python.app and replace the use of 'python' with 'pythonw'. See 'Working with Matplotlib on OSX' in the Matplotlib FAQ for more information. ，我就遇到了这种状况，deactivate 退出了虚拟环境再跑的canvas

# coding:utf-8

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 生成词云
def create_word_cloud(filename):
    # 读取文件内容
    text = open("{}.txt".format(filename), encoding='utf-8').read()
    # 设置词云
    wc = WordCloud(
        # 设置背景颜色
        background_color="white",
        # 设置最大显示的词云数
        max_words=2000,
        # 这种字体都在电脑字体中，window在C:\Windows\Fonts\下，mac我选的是/System/Library/Fonts/PingFang.ttc 字体
        font_path='/System/Library/Fonts/PingFang.ttc',
        height=1200,
        width=2000,
        # 设置字体最大值
        max_font_size=100,
        # 设置有多少种随机生成状态，即有多少种配色方案
        random_state=30,
    )

    myword = wc.generate(text)  # 生成词云
    # 展现词云图
    plt.imshow(myword)
    plt.axis("off")
    plt.show()
    wc.to_file('qq_word.png')  # 把词云保存下


if __name__ == '__main__':
    create_word_cloud('qq_word')

复制代码

至此，爬取qq说说内容，并生成词云图。
源码github地址: github.com/taixiang/sp…数组

欢迎关注个人博客：blog.manjiexiang.cn/
欢迎关注微信号：春风十里不如认识你