爬虫解析库——BeautifulSoup

时间 2019-11-07

原文原文链接

　　解析库就是在爬虫时本身制定一个规则，帮助咱们抓取想要的内容时用的。经常使用的解析库有re模块的正则、beautifulsoup、pyquery等等。正则彻底能够帮咱们匹配到咱们想要住区的内容，但正则比较麻烦，因此这里咱们会用beautifulsoup。html

beautifulsoup

　　Beautiful Soup 是一个能够从HTML或XML文件中提取数据的Python库。它可以经过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式。Beautiful Soup会帮咱们节省数小时甚至数天的工做时间。Beautiful Soup 3 目前已经中止开发，官网推荐在如今的项目中使用Beautiful Soup 4。前端

安装：html5

pip install beautifulsoup4

　　Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器。其中一个是 lxml 。咱们日常在使用中推荐使用lxml。另外一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,python

pip install lxml      pip install html5lib

　　下表列出了主要的解析器,以及它们的优缺点,官网推荐使用lxml做为解析器,由于效率更高. 在Python2.7.3以前的版本和Python3中3.2.2以前的版本,必须安装lxml或html5lib, 由于那些Python版本的标准库中内置的HTML解析方法不够稳定.正则表达式

解析器	使用方法	优点	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	须要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml", "xml"])`express `BeautifulSoup(markup, "xml")`浏览器	速度快惟一支持XML的解析器	须要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.htmlide

基本使用

　　容错处理：BeautifulSoup文档的容错能力指的是在html代码不完整的状况下，使用该模块能够识别该错误。使用BeautifulSoup解析某些没写完整标签的代码会自动补全该闭合标签，获得一个 BeautifulSoup 的对象，并能按照标准的缩进格式的结构输出。函数

举个栗子：网站

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'lxml') #具备容错功能，第二个参数是解析器名，这里咱们肯定用lxml
res=soup.prettify() #处理好缩进，结构化显示
print(res)

View Code

遍历文档树操做

　　遍历文档树：即直接经过标签名字选择，特色是选择速度快，但若是存在多个相同的标签则只返回第一个

一、用法

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

print(soup.p) #存在多个相同的标签则只返回第一个

二、获取标签的名称 ====> soup.p.name
三、获取标签的属性 ====> soup.p.attrs
四、获取标签的内容 ====> soup.p.string #p下的文本只有一个时，取到，不然为None
五、嵌套选择 ====> soup.body.a.string
六、子节点、子孙节点 ====> soup.p.contents soup.p.descendants
七、父节点、祖先节点 ====> soup.a.parent soup.a.parents

八、兄弟节点 ====>

soup.a.next_sibling #下一个兄弟
soup.a.previous_sibling#上一个兄弟
list(soup.a.next_siblings) #下面的兄弟们=>生成器对象
soup.a.previous_siblings)#上面的兄弟们=>生成器对象

具体操做示例：

#遍历文档树：即直接经过标签名字选择，特色是选择速度快，但若是存在多个相同的标签则只返回第一个
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#一、用法
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
# soup=BeautifulSoup(open('a.html'),'lxml')#打开一个HTML文件

print(soup.p) #<p class="title" id="my p"><b class="boldest" id="bbb">The Dormouse's story</b></p>
# 即便存在多个相同的标签也只返回第一个


#二、获取标签的名称
print(soup.p.name)#p

#三、获取标签的属性
print(soup.p.attrs)#{'id': 'my p', 'class': ['title']}

#四、获取标签的内容
print(soup.p.string) #The Dormouse's story     p标签中的文本只有一个时，取到，不然为None
print(soup.p.strings) #拿到一个生成器对象, 取到p下全部的文本内容
print(soup.p.text) #取到p下全部的文本内容
for line in soup.stripped_strings: #去掉空白
    print(line)
    """
        The Dormouse's story
        The Dormouse's story
        Once upon a time there were three little sisters; and their names were
        Elsie
        ,
        Lacie
        and
        Tillie
        ;
        they lived at the bottom of a well.
        ...
    """


'''
若是tag包含了多个子节点,tag就没法肯定 .string 方法应该调用哪一个子节点的内容, .string 的输出结果是 None，若是只有一个子节点那么就输出该子节点的文本，好比下面的这种结构，soup.p.string 返回为None,但soup.p.strings就能够找到全部文本
<p id='list-1'>
    哈哈哈哈
    <a class='sss'>
        <span>
            <h1>aaaa</h1>
        </span>
    </a>
    <b>bbbbb</b>
</p>
'''

#五、嵌套选择
print(soup.head.title.string)#The Dormouse's story
print(soup.body.a.string)#Elsie


#六、子节点、子孙节点
print(soup.p.contents) # [<b class="boldest" id="bbb">The Dormouse's story</b>]    p下全部子节点
print(soup.p.children) #获得一个迭代器,包含p下全部子节点

for i,child in enumerate(soup.p.children):
    print(i,child)
    """
        0 <b class="boldest" id="bbb">The Dormouse's story</b>
        <generator object descendants at 0x0000005FE37D3150>
        0 <b class="boldest" id="bbb">The Dormouse's story</b>
        1 The Dormouse's story
    """

print(soup.p.descendants) #获取子孙节点,p下全部的标签都会选择出来,返回一个对象
for i,child in enumerate(soup.p.descendants):
    print(i,child)
    """
        0 <b class="boldest" id="bbb">The Dormouse's story</b>
        1 The Dormouse's story
    """

#七、父节点、祖先节点
print('dddddd',soup.a.parent) #获取a标签的父节点
"""
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
"""
print(soup.a.parents) #找到a标签全部的祖先节点，父亲的父亲，父亲的父亲的父亲...返回一个对象


#八、兄弟节点

print(soup.a.next_sibling) #下一个兄弟
print(soup.a.previous_sibling) #上一个兄弟

print(list(soup.a.next_siblings)) #下面的兄弟们=>生成器对象
print(soup.a.previous_siblings) #上面的兄弟们=>生成器对象

View Code

搜索文档树操做

　　搜索文档树的方法主要是运用过滤器、find、CSS选择器等等，这里要注意find和find_all的区别。

过滤器的筛选功能相对较弱，但速度较快
find和find_all是日常用的比较多的方法
前端的CSS游刃有余的前端大牛能够选择使用CSS选择器

五种过滤器

　　过滤器即用一种方法来得到咱们爬虫想要抓取的内容。这里有5种过滤器，分别是字符串、正则、列表、True和自定义方法。下面咱们进行详述

设咱们从网站获得了这样一段html

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

1、字符串过滤器

　　字符串过滤器是依靠标签名进行过滤的

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
#这里咱们用了find_all，find_all是找到全部的结果，以列表的形式返回。以后会作详述

print(soup.find_all('b'))#[<b class="boldest" id="bbb">The Dormouse's story</b>]

print(soup.find_all('a'))
"""
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

2、正则表达式

　　正则表达在任何地方都适用，只要导入re模块就可使用正则

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

import re
print(soup.find_all(re.compile('^b')))#找到全部b开头的标签，结果是找到了body标签和b标签。他会将整个标签包含标签内容都返回

"""
[<body>
<p class="title" id="my p"><b class="boldest" id="bbb">The Dormouse's story</b>
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>, 
<b class="boldest" id="bbb">The Dormouse's story</b>]
"""

3、列表过滤器

　　列表过滤器的方法是将字符串过滤器中的参数由字符串变成列表，列表里面仍是以字符串的形式进行过滤。列表中包含多个字符串，就会从文档中找到全部符合规则的并返回

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

print(soup.find_all(['a','b']))#找到文档中全部的a标签和b标签
"""
['<b class="boldest" id="bbb">The Dormouse's story</b>,
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>']

"""

4、True过滤器

　　True过滤器实际上是一种范围很大的过滤器，它的用法是只要知足某种条件均可以

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

print(soup.find_all(name=True))#只要是个标签就ok
print(soup.find_all(attrs={"id":True}))#找到全部含有id属性的标签
print(soup.find_all(name='p',attrs={"id":True}))#找到全部含有id属性的p标签

#找到全部标签并返回其标签名
for tag in soup.find_all(True):
    print(tag.name)

5、自定义方法

　　自定义方法即自定义的过滤器，有的时候咱们没有合适的过滤器时就能够写一个函数做为自定义的过滤器，该函数的参数只能是一个。

自定义函数的方法通常不经常使用，但咱们得知道有这个方法，在特殊的状况下咱们会用到。

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
#自定义函数，找到全部有class属性但没有id属性的p标签
def has_class_but_no_id(tag):
    res = (tag.name == 'p' and tag.has_attr("class") and not tag.has_attr('id'))
    return res

print(soup.find_all(has_class_but_no_id))

find和find_all

　　find()方法和find_all()方法的用法是同样的，只不过他们搜寻的方式和返回值不同

===>find()方法是找到文档中符合条件的第一个元素，直接返回该结果。元素不存在时返回None

===>find_all()方法是找到文档中全部符合条件的元素，以列表的形式返回。元素不存在时返回空列表

find( name , attrs={} , recursive=True , text=None , **kwargs )
find_all( name , attrs={} , recursive=True , text=None , limit=None , **kwargs )
#find_all比 find多一个参数：limit，下面会提到

下面咱们就来详细说一下这五个参数

1、name参数

　　name即标签名，搜索name的过滤器能够是上述5中过滤器的任何一种

from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html_doc,'lxml')
#2.一、name: 搜索name参数的值可使任一类型的过滤器 ,字符串,正则表达式,列表,方法或是 True .
print(soup.find_all(name=re.compile('^t')))#[<title>The Dormouse's story</title>]
print(soup.find(name=re.compile('^t')))#<title>The Dormouse's story</title>

2、attr参数

　　attr就是标签的属性，因此该查找方式就是靠属性进行过滤，过滤器也能够是任意一种

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'lxml')
print(soup.find_all('p',attrs={'class':'story'}))#全部class属性中有story的p标签组成的列表，好长的说。。
print(soup.find('p',attrs={'class':'story'}))#第一个符合条件的p标签

3、recursive参数

　　recursive参数默认为True，指的是在搜索某标签时会自动检索当前标签的全部子孙节点。若只想搜索直接子节点，不要孙节点，能够将该参数改成false

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'lxml')

print(soup.html.find_all('a'))#列表你懂的
print(soup.html.find('a'))#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.html.find_all('a',recursive=False))#[]
print(soup.html.find('a',recursive=False))#None

4、text参数

　　text即文本，也就是按文本内容搜索。text参数通常不作单独使用，都是配合着name或者attr用的，做用是进一步缩小搜索的范围

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'lxml')
#找到文本Elsie，单独使用没什么意义，通常配合前面两个参数使用
print(soup.find_all(text='Elsie'))#['Elsie']
print(soup.find(text='Elsie'))#'Elsie'
#找到文本是Elsie的a标签
print(soup.find_all('a',text='Elsie'))#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
print(soup.find('a',text='Elsie'))#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

5、**kwargs

　　键值对形式的搜索条件，键是name或者某个属性，值是过滤器的形式。支持除自定义形式意外的4种过滤器

from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html_doc,'lxml')
print(soup.find_all(id=re.compile('my')))#[<p class="title" id="my p"><b class="boldest" id="bbb">The Dormouse's story</b></p>]
print(soup.find_all(href=re.compile('lacie'),id=re.compile('\d')))#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
print(soup.find_all(id=True)) #查找有id属性的标签

###注意！！！按照类名查找时关键字是class_，class_=value,value能够是五种过滤器
print(soup.find_all('a',class_='sister')) #查找类为sister的a标签
print(soup.find_all('a',class_='sister ssss')) #查找类为sister和sss的a标签，顺序错误也匹配不成功
print(soup.find_all(class_=re.compile('^sis'))) #查找类为sister的全部标签

注：有些特殊的标签名不能用键值对的形式搜索，但支持属性attr的方式搜索。好比HTML5中的data-****标签

res = BeautifulSoup('<div data-foo="value">foo!</div>','lxml')
#print(res.find_all(data-foo="value"))#报错：SyntaxError: keyword can't be an expression
# 可是能够经过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:
print(data_soup.find_all(attrs={"data-foo": "value"}))# [<div data-foo="value">foo!</div>]

6、limit参数

　　limit是限制的意思，若是文档特别大而咱们又不须要全部符合条件的结果的时候会致使搜索很慢。好比咱们只要符合条件的前3个a标签，而文档中包含200个a标签，这种状况咱们就能够用到limit参数限制返回的结果的数量，效果与SQL中的limit相似。

　　find_all()中有limit参数而find()中没有的缘由是由于find()自己就只返回第一个结果，不存在限制的条件。

from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html_doc,'lxml')

print(soup.find_all('a',limit=3))

扩展：

　　find()和find_all()几乎是Beautiful Soup中最经常使用的方法，因此他们具备本身的简写方法

from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html_doc,'lxml')

soup.find_all("a")
soup("a")#find_all方法的简写版本

soup.find("head").find("title")# <title>The Dormouse's story</title>
soup.head.title#find方法的简写版本

soup.title.find_all(text=True)#简写了find的版本
soup.title(text=True)#find和find_all均简写了的版本

CSS选择器

　　CSS选择器的使用方法与CSS定位标签的方式类似。精髓就是.class 和 #id

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

#一、CSS选择器
print(soup.p.select('.sister'))
print(soup.select('.sister span'))

print(soup.select('#link1'))
print(soup.select('#link1 span'))

print(soup.select('#list-2 .element.xxx'))

print(soup.select('#list-2')[0].select('.element')) #能够一直select,但其实不必,select支持链式操做，因此一条select就能够了

# 二、获取属性
print(soup.select('#list-2 h1')[0].attrs)

# 三、获取内容
print(soup.select('#list-2 h1')[0].get_text())

　　固然，BeautifulSoup是一个很成熟的大模块，不会只具备这几种方法，但其实上述方法在爬虫中使用已经游刃有余了。

　　BeautifulSoup不只能够搜索文档树，还能修改文档树，但在爬虫中用不到修改的功能，因此这里咱们就不赘述了。

详情可研究官网 https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html