BeautifulSoup：网页解析利器上手简介

时间 2019-12-05

原文原文链接

关于爬虫的案例和方法，咱们已讲过许多。不过在以往的文章中，大可能是关注在 如何把网页上的内容抓取下来 。今天咱们来分享下，当你已经把内容爬下来以后， 如何提取出其中你须要的具体信息 。html

网页被抓取下来，一般就是 str 字符串类型的对象 ，要从里面寻找信息，最直接的想法就是直接经过字符串的 find 方法 和 切片操做 ：前端

s = '<p>价格：15.7 元</p>'
start = s.find('价格：')
end = s.find(' 元')
print(s[start+3:end])  
# 15.7

这能应付一些极简单的状况，但只要稍稍复杂一点，这么写就会累死人。更通用的作法是使用 正则表达式 ：html5

import re
s = '<p>价格：15.7 元</p>'
r = re.search('[\d.]+', s)
print(r.group())
# 15.7

正则表达式是处理文本解析的万金油，什么状况均可以应对。但惋惜掌握它须要必定的学习成本， 本来咱们有一个网页提取的问题，用了正则表达式，如今咱们有了两个问题。python

HTML 文档自己是 结构化的文本 ，有必定的规则，经过它的结构能够简化信息提取。因而，就有了 lxml、pyquery、BeautifulSoup 等网页信息提取库。通常咱们会用这些库来提取网页信息。其中， lxml 有很高的解析效率，支持 xPath 语法 （一种能够在 HTML 中查找信息的规则语法）； pyquery 得名于 jQuery（知名的前端 js 库），能够用相似 jQuery 的语法解析网页 。但咱们今天要说的，是剩下的这个：正则表达式

BeautifulSoup

BeautifulSoup（下文简称 bs）翻译成中文就是“美丽的汤”，这个奇特的名字来源于《 爱丽丝梦游仙境 》（这也是为什么在其官网会配上奇怪的插图，以及用《爱丽丝》的片断做为测试文本）。编程

bs 最大的特色我以为是 简单易用 ，不像正则和 xPath 须要刻意去记住不少特定语法，尽管那样会效率更高更直接。 对大多数 python 使用者来讲，好用会比高效更重要 。这也是我本身使用并推荐 bs 的主要缘由。工具

接下来介绍点 bs 的基本方法，让你看完就能用起来。考虑到“只收藏不看党”的阅读体验，先给出一个“ 嫌长不看版 ”的总结：性能

随 anaconda 附带，也能够经过 pip 安装
指定 不一样解析器在性能、容错性上会有差别 ，致使结果也可能不同
基本使用流程： 经过文本初始化 bs 对象 -> 经过 find/find_all 或其余方法检测信息 -> 输出或保存
能够迭代式的查找，好比先定位出一段内容，再其上继续检索
开发时应注意不一样方法的返回类型，出错时多看报错、多加输出信息
官方文档 很友好，也有中文，推荐阅读

安装

推荐使用 pip 进行安装（关于 pip 见前文《Crossin：如何安装 Python 的第三方模块》）：学习

pip install beautifulsoup4

要注意，包名是 beautifulsoup4 ，若是不加上 4，会是老版本也就是 bs3，它是为了兼容性而存在，目前已不推荐。咱们这里说 bs，都是指 bs4。测试

bs4 也能够直接经过安装 anaconda 得到（介绍见前文《Crossin：Python数据科学环境：Anaconda 了解一下》）。

bs 在使用时须要指定一个“ 解析器 ”：

html.parse - python 自带，但容错性不够高，对于一些写得不太规范的网页会丢失部份内容
lxml - 解析速度快，需额外安装
xml - 同属 lxml 库，支持 XML 文档
html5lib - 最好的容错性，但速度稍慢

这里的 lxml 和 html5lib 都须要额外安装，不过若是你用的是 anaconda，都是一并安装好的。

快速上手

咱们就用官网上的文档做例子：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用 bs 的初始化操做，是用文本建立一个 BeautifulSoup 对象，建议手动指定解析器：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

获取其中的某个结构化元素及其属性：

soup.title  # title 元素
# <title>The Dormouse's story</title>

soup.p  # 第一个 p 元素
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']  # p 元素的 class 属性
# ['title']

soup.p.b  # p 元素下的 b 元素
# <b>The Dormouse's story</b>

soup.p.parent.name  # p 元素的父节点的标签
# body

并非全部信息均可以简单地经过结构化获取，一般使用 find 和 find_all 方法进行查找：

soup.find_all('a')  # 全部 a 元素
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id='link3')  # id 为 link3 的元素
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a

find 和 find_all 能够有多个搜索条件叠加，好比 find('a', id='link3', class_='sister')
find 返回的是一个 bs4.element.Tag 对象 ，这个对象能够进一步进行搜索。若是有多个知足的结果，find 只返回第一个 ；若是没有，返回 None。
find_all 返回的是一个 由 bs4.element.Tag 对象组成的 list ，无论找到几个或是没找到，都是 list。

输出：

x = soup.find(class_='story')
x.get_text()  # 仅可见文本内容
# 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.'
x.prettify()  # 元素完整内容
# '<p class="story">\n Once upon a time there were three little sisters; and their names were\n <a class="sister" href="http://example.com/elsie" id="link1">\n  Elsie\n </a>\n ,\n <a class="sister" href="http://example.com/lacie" id="link2">\n  Lacie\n </a>\n and\n <a class="sister" href="http://example.com/tillie" id="link3">\n  Tillie\n </a>\n ;\nand they lived at the bottom of a well.\n</p>\n'

若是你有前端开发经验，对 CSS 选择器很熟悉，bs 也为你提供了相应的方法：

soup.select('html head title')
# [<title>The Dormouse's story</title>]
soup.select('p > #link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

以上就是 BeautifulSoup 的一个极简上手介绍，对于 bs 能作什么，想必你已有了一个初步认识。若是你要在开发中使用，建议再看下它的 官方文档 。文档写得很清楚，也有中文版，你只要看了最初的一小部分，就能够在代码中派上用场了。更多的细节能够在使用时进一步搜索具体方法和参数设置。

中文版文档 地址：

Beautiful Soup 4.2.0 文档www.crummy.com

对于爬虫的其余方面，推荐阅读咱们以前的相关文章：

════

其余文章及回答：

学编程：如何自学Python | 新手引导 | 一图学Python

开发案例：智能防挡弹幕 | 红包提醒 | 流浪地球

欢迎搜索及关注： Crossin的编程教室