Python 爬虫实战（一）：使用 requests 和 BeautifulSoup

时间 2019-11-17

标签 python 爬虫实战使用 requests beautifulsoup 栏目 Python 繁體版

原文原文链接

Python 基础

我以前写的《Python 3 极简教程.pdf》，适合有点编程基础的快速入门，经过该系列文章学习，可以独立完成接口的编写，写写小东西没问题。html

requests

requests，Python HTTP 请求库，至关于 Android 的 Retrofit，它的功能包括 Keep-Alive 和链接池、Cookie 持久化、内容自动解压、HTTP 代理、SSL 认证、链接超时、Session 等不少特性，同时兼容 Python2 和 Python3，GitHub：github.com/requests/re… 。html5

安装

Mac：python

pip3 install requests复制代码

Windows：mysql

pip install requests复制代码

发送请求

HTTP 请求方法有 get、post、put、delete。git

import requests

# get 请求
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all')

# post 请求
response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert')

# put 请求
response = requests.put('http://127.0.0.1:1024/developer/api/v1.0/update')

# delete 请求
response = requests.delete('http://127.0.0.1:1024/developer/api/v1.0/delete')复制代码

请求返回 Response 对象，Response 对象是对 HTTP 协议中服务端返回给浏览器的响应数据的封装，响应的中的主要元素包括：状态码、缘由短语、响应首部、响应 URL、响应 encoding、响应体等等。github

# 状态码
print(response.status_code)

# 响应 URL
print(response.url)

# 响应短语
print(response.reason)

# 响应内容
print(response.json())复制代码

定制请求头

请求添加 HTTP 头部 Headers，只要传递一个 dict 给 headers 关键字参数就能够了。sql

header = {'Application-Id': '19869a66c6',
          'Content-Type': 'application/json'
          }
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all/', headers=header)复制代码

构建查询参数

想为 URL 的查询字符串(query string)传递某种数据，好比：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2 ，Requests 容许你使用 params 关键字参数，以一个字符串字典来提供这些参数。数据库

payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)复制代码

还能够将 list 做为值传入：编程

payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

# 响应 URL
print(response.url)# 打印：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2&key2=value3复制代码

post 请求数据

若是服务器要求发送的数据是表单数据，则能够指定关键字参数 data。json

payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", data=payload)复制代码

若是要求传递 json 格式字符串参数，则可使用 json 关键字参数，参数的值均可以字典的形式传过去。

obj = {
    "article_title": "小公务员之死2"
}
# response = requests.post('http://127.0.0.1:1024/developer/api/v1.0/insert', json=obj)复制代码

响应内容

Requests 会自动解码来自服务器的内容。大多数 unicode 字符集都能被无缝地解码。请求发出后，Requests 会基于 HTTP 头部对响应的编码做出有根据的推测。

# 响应内容
# 返回是 是 str 类型内容
# print(response.text())
# 返回是 JSON 响应内容
print(response.json())
# 返回是二进制响应内容
# print(response.content())
# 原始响应内容，初始请求中设置了 stream=True
# response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', stream=True)
# print(response.raw())复制代码

超时

若是没有显式指定了 timeout 值，requests 是不会自动进行超时处理的。若是遇到服务器没有响应的状况时，整个应用程序一直处于阻塞状态而无法处理其余请求。

response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', timeout=5)  # 单位秒数复制代码

代理设置

若是频繁访问一个网站，很容易被服务器屏蔽掉，requests 完美支持代理。

# 代理
proxies = {
    'http': 'http://127.0.0.1:1024',
    'https': 'http://127.0.0.1:4000',
}
response = requests.get('http://127.0.0.1:1024/developer/api/v1.0/all', proxies=proxies)复制代码

BeautifulSoup

BeautifulSoup，Python Html 解析库，至关于 Java 的 jsoup。

安装

BeautifulSoup 3 目前已经中止开发，直接使用BeautifulSoup 4。

Mac：

pip3 install beautifulsoup4复制代码

Windows：

pip install beautifulsoup4复制代码

安装解析器

我用的是 html5lib，纯 Python 实现的。

Mac：

pip3 install html5lib复制代码

Windows：

pip install html5lib复制代码

简单使用

BeautifulSoup 将复杂 HTML 文档转换成一个复杂的树形结构，每一个节点都是 Python 对象。

解析

from bs4 import BeautifulSoup

def get_html_data():
    html_doc = """ <html> <head> <title>WuXiaolong</title> </head> <body> <p>分享 Android 技术，也关注 Python 等热门技术。</p> <p>写博客的初衷：总结经验，记录本身的成长。</p> <p>你必须足够的努力，才能看起来绝不费力！专一！精致！ </p> <p class="Blog"><a href="http://wuxiaolong.me/">WuXiaolong's blog</a></p> <p class="WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong">公众号：吴小龙同窗</a> </p> <p class="GitHub"><a href="http://example.com/tillie" class="sister" id="link3">GitHub</a></p> </body> </html> """
    soup = BeautifulSoup(html_doc, "html5lib")复制代码

tag

tag = soup.head
print(tag)  # <head><title>WuXiaolong</title></head>
print(tag.name)  # head
print(tag.title)  # <title>WuXiaolong</title>
print(soup.p)  # <p>分享 Android 技术，也关注 Python 等热门技术。</p>
print(soup.a['href'])  # 输出 a 标签的 href 属性：http://wuxiaolong.me/复制代码

注意：tag 若是多个匹配，返回第一个，好比这里的 p 标签。

查找

print(soup.find('p'))  # <p>分享 Android 技术，也关注 Python 等热门技术。</p>复制代码

find 默认也是返回第一个匹配的标签，没找到匹配的节点则返回 None。若是我想指定查找，好比这里的公众号，能够指定标签的如 class 属性值：

# 由于 class 是 Python 关键字，因此这里指定为 class_。
print(soup.find('p', class_="WeChat"))
# <p class="WeChat"><a href="https://open.weixin.qq.com/qr/code?username=MrWuXiaolong">公众号</a> </p>复制代码

查找全部的 P 标签：

for p in soup.find_all('p'):
    print(p.string) 复制代码

实战

前段时间，有用户反馈，个人我的 APP 挂了，虽然这个 APP 我已经再也不维护，可是我也得起码保证它能正常运行。大部分人都知道这个 APP 数据是爬来的（详见：《手把手教你作我的app》），数据爬来的好处之一就是不用本身管数据，弊端是别人网站挂了或网站的 HTML 节点变了，我这边就解析不到，就没数据。此次用户反馈，我在想要不要把他们网站数据直接爬虫了，正好自学 Python，练练手，嗯说干就干，原本是想着先用 Python 爬虫，MySQL 插入本地数据库，而后 Flask 本身写接口，用 Android 的 Retrofit 调，再用 bmob sdk 插入 bmob……哎，费劲，感受行不通，后来我得知 bmob 提供了 RESTful，解决大问题，我能够直接 Python 爬虫插入就行了，这里我演示的是插入本地数据库，若是用 bmob，是调 bmob 提供的 RESTful 插数据。

网站选定

我选的演示网站：meiriyiwen.com/random ，你们能够发现，每次请求的文章都不同，正好利用这点，我只要定时去请求，解析本身须要的数据，插入数据库就 OK 了。

建立数据库

我直接用 NaviCat Premium 建立的，固然也能够用命令行。

建立表

建立表 article，用的 pymysql，表须要 id，article_title，article_author，article_content 字段，代码以下，只须要调一次就行了。

import pymysql


def create_table():
    # 创建链接
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn')
    # 建立名为 article 数据库语句
    sql = '''create table if not exists article ( id int NOT NULL AUTO_INCREMENT, article_title text, article_author text, article_content text, PRIMARY KEY (`id`) )'''
    # 使用 cursor() 方法建立一个游标对象 cursor
    cursor = db.cursor()
    try:
        # 执行 sql 语句
        cursor.execute(sql)
        # 提交事务
        db.commit()
        print('create table success')
    except BaseException as e:  # 若是发生错误则回滚
        db.rollback()
        print(e)

    finally:
        # 关闭游标链接
        cursor.close()
        # 关闭数据库链接
        db.close()


if __name__ == '__main__':
    create_table()
复制代码

解析网站

首先须要 requests 请求网站，而后 BeautifulSoup 解析本身须要的节点。

import requests
from bs4 import BeautifulSoup


def get_html_data():
    # get 请求
    response = requests.get('https://meiriyiwen.com/random')

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id='article_show')
    article_title = article.h1.string
    print('article_title=%s' % article_title)
    article_author = article.find('p', class_="article_author").string
    print('article_author=%s' % article.find('p', class_="article_author").string)
    article_contents = article.find('div', class_="article_text").find_all('p')
    article_content = ''
    for content in article_contents:
        article_content = article_content + str(content)
        print('article_content=%s' % article_content)复制代码

插入数据库

这里作了一个筛选，默认这个网站的文章标题是惟一的，插入数据时，若是有了一样的标题就不插入。

import pymysql


def insert_table(article_title, article_author, article_content):
    # 创建链接
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn',
                         charset="utf8")
    # 插入数据
    query_sql = 'select * from article where article_title=%s'
    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
    # 使用 cursor() 方法建立一个游标对象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 执行 sql 语句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事务
            db.commit()
            print('--------------《%s》 insert table success-------------' % article_title)
            return True
        else:
            print('--------------《%s》 已经存在-------------' % article_title)
            return False

    except BaseException as e:  # 若是发生错误则回滚
        db.rollback()
        print(e)

    finally:  # 关闭游标链接
        cursor.close()
        # 关闭数据库链接
        db.close()复制代码

定时设置

作了一个定时，过段时间就去爬一次。

import sched
import time


# 初始化 sched 模块的 scheduler 类
# 第一个参数是一个能够返回时间戳的函数，第二个参数能够在定时未到达以前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被周期性调度触发的函数
def print_time(inc):
    # to do something
    print('to do something')
    schedule.enter(inc, 0, print_time, (inc,))


# 默认参数 60 s
def start(inc=60):
    # enter四个参数分别为：间隔事件、优先级（用于同时间到达的两个事件同时执行时定序）、被调用触发的函数，
    # 给该触发函数的参数（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == '__main__':
    # 5 s 输出一次
    start(5)复制代码

完整代码

import pymysql
import requests
from bs4 import BeautifulSoup
import sched
import time


def create_table():
    # 创建链接
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn')
    # 建立名为 article 数据库语句
    sql = '''create table if not exists article ( id int NOT NULL AUTO_INCREMENT, article_title text, article_author text, article_content text, PRIMARY KEY (`id`) )'''
    # 使用 cursor() 方法建立一个游标对象 cursor
    cursor = db.cursor()
    try:
        # 执行 sql 语句
        cursor.execute(sql)
        # 提交事务
        db.commit()
        print('create table success')
    except BaseException as e:  # 若是发生错误则回滚
        db.rollback()
        print(e)

    finally:
        # 关闭游标链接
        cursor.close()
        # 关闭数据库链接
        db.close()


def insert_table(article_title, article_author, article_content):
    # 创建链接
    db = pymysql.connect(host='localhost',
                         user='root',
                         password='root',
                         db='python3learn',
                         charset="utf8")
    # 插入数据
    query_sql = 'select * from article where article_title=%s'
    sql = 'insert into article (article_title,article_author,article_content) values (%s, %s, %s)'
    # 使用 cursor() 方法建立一个游标对象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 执行 sql 语句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事务
            db.commit()
            print('--------------《%s》 insert table success-------------' % article_title)
            return True
        else:
            print('--------------《%s》 已经存在-------------' % article_title)
            return False

    except BaseException as e:  # 若是发生错误则回滚
        db.rollback()
        print(e)

    finally:  # 关闭游标链接
        cursor.close()
        # 关闭数据库链接
        db.close()


def get_html_data():
    # get 请求
    response = requests.get('https://meiriyiwen.com/random')

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id='article_show')
    article_title = article.h1.string
    print('article_title=%s' % article_title)
    article_author = article.find('p', class_="article_author").string
    print('article_author=%s' % article.find('p', class_="article_author").string)
    article_contents = article.find('div', class_="article_text").find_all('p')
    article_content = ''
    for content in article_contents:
        article_content = article_content + str(content)
        print('article_content=%s' % article_content)

    # 插入数据库
    insert_table(article_title, article_author, article_content)


# 初始化 sched 模块的 scheduler 类
# 第一个参数是一个能够返回时间戳的函数，第二个参数能够在定时未到达以前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被周期性调度触发的函数
def print_time(inc):
    get_html_data()
    schedule.enter(inc, 0, print_time, (inc,))


# 默认参数 60 s
def start(inc=60):
    # enter四个参数分别为：间隔事件、优先级（用于同时间到达的两个事件同时执行时定序）、被调用触发的函数，
    # 给该触发函数的参数（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == '__main__':
    start(60*5)
复制代码

问题：这只是对一篇文章爬虫，若是是那种文章列表，点击是文章详情，这种如何爬虫解析？首先确定要拿到列表，再循环一个个解析文章详情插入数据库？尚未想好该如何作更好，留给后面的课题吧。

最后

虽然我学 Python 纯属业余爱好，可是也要学以至用，否则这些知识很快就忘记了，期待下篇 Python 方面的文章。

参考

快速上手 — Requests 2.18.1 文档

爬虫入门系列（二）：优雅的HTTP库requests

Beautiful Soup 4.2.0 文档

爬虫入门系列（四）：HTML文本解析库BeautifulSoup