Python BeautifulSoup库的用法

时间 2019-11-11

原文原文链接

BeautifulSoup是一个能够从HTML或者XML文件中提取数据的Python库，它经过解析器把文档解析为利于人们理解的文档导航模式，有利于查找和修改文档。css

BeautifulSoup3目前已经中止开发，如今推荐使用BeautifulSoup4，它被移植到了bs4中。html

# 使用时须要导入
from bs4 import BeautifulSoup

解析器

BeautifulSoup4中经常使用4种主要的解析器，使用前须要安装：html5

#不一样系统安装方法
$ apt-get install Python-lxml
$ easy_install lxml
$ pip install lxml

# pycharm中安装能够先import xxx，显示有错误而后点击安装，安装后删除import语句，便可正常使用

解析器的优缺点对比
解析器	使用方法	优点	劣势
Python标准库	BeautifulSoup(DocumentName, "html.parser")python	Python的内置标准库不须要单独安装执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2前的班中文容错能力差
lxml HTML解析器	BeautifulSoup(DocumentName, "lxml")	速度快文档容错能力强	须要安装C语言库
lxml XML解析器	BeautifulSoup(DocumentName, "xml")express BeautifulSoup(DocumentName, ["lxml","xml"])浏览器	速度快惟一支持XML的解析器	须要安装C语言库
html5lib	BeautifulSoup(DocumentName, "html5lib")	最好的容错以浏览器的方式解析文档生成HTML5格式的文档	速度慢，须要依赖python库

不一样解析器的解析结果：编码

# 符合HTML标准的解析结果
htmldoc = "<a><p></p></a>"
print("None        :",BeautifulSoup(htmldoc))
print("html.parser :", BeautifulSoup(htmldoc, "html.parser"))
print("lxml        :", BeautifulSoup(htmldoc, "lxml"))
print("xml         :", BeautifulSoup(htmldoc, "lxml-xml"))
print("html5lib    :", BeautifulSoup(htmldoc, "html5lib"))

"""
结果为：
None        : <html><body><a><p></p></a></body></html>
html.parser : <a><p></p></a>
lxml        : <html><body><a><p></p></a></body></html>
xml         : <?xml version="1.0" encoding="utf-8"?>
              　　<a><p/></a>
html5lib    : <html><head></head><body><a><p></p></a></body></html>
"""

# 不符合HTML标准的解析结果
htmldoc = "<a></p></a>"
print("None        :",BeautifulSoup(htmldoc))
print("html.parser :", BeautifulSoup(htmldoc, "html.parser"))
print("lxml        :", BeautifulSoup(htmldoc, "lxml"))
print("xml         :", BeautifulSoup(htmldoc, "lxml-xml"))
print("html5lib    :", BeautifulSoup(htmldoc, "html5lib"))

"""
结果为：
None        : <html><body><a></a></body></html>
html.parser : <a></a>
lxml        : <html><body><a></a></body></html>
xml         : <?xml version="1.0" encoding="utf-8"?>
　　　　　　　　　　<a/>
html5lib    : <html><head></head><body><a><p></p></a></body></html>
"""

html5lib会把全部的标签不全，而且加上html、head、body，标准的html格式；默认、html.parser、lxml 解析器会把错误标签忽略掉。spa

编码

任何HTML或者XML文档都有本身的编码方式，但使用BeautifulSoup解析后，文档都会被转换为Unicode，输出时编码均为UTF-8。code

由于BeautifulSoup用来编码自动检测库来识别当前文档的编码，并自动转换为Unicode编码。但也有小几率会识别出错，能够用.original_encoding来检测编码格式。orm

而且设置from_encoding参数能够提升文档的解析速度。

htmldoc = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(htmldoc, from_encoding="iso-8859-8")
print(soup.h1)
print(soup.original_encoding)


"""
结果：
<h1>םולש</h1>
'iso8859-8'
"""

指定输出编码：

htmldoc = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(htmldoc, from_encoding="iso-8859-8")
print(soup.prettify("latin-1"))

"""
结果：
b'<h1>\n &#957;&#949;&#956;&#969;\n</h1>'
"""

遍历文档树：

1. 注释<class 'bs4.element.Comment'> 和替换Comment内容

htmldoc = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(htmldoc)
comment = soup.b.string
print(comment)
print(type(comment))
print(soup2.b)
print(soup2.b.prettify()) #comment特点输出方式

# 替换Comment
cdata= CData("A CData block")
comment.replace_with(cdata)
print(soup2.b.prettify())

"""
结果：
Hey, buddy. Want to buy a used parser?
<class 'bs4.element.Comment'>
<b><!--Hey, buddy. Want to buy a used parser?--></b>
<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>
<b>
<![CDATA[A CData block]]>
</b>
"""

CData使用时须要导入 from bs4 import CData

2. soup.tagName 返回类型 <class 'bs4.element.Tag'>，获得文档中第一个tagName标签 == soup.find(tagName)

soup.div #获得文章中第一个div标签
soup.a #获得文章中第一个a标签

3. soup.tagName.get_text() / soup.tagName.text 返回类型 <class 'str'>，获得该标签内容，对每一个BeautifulSoup处理后的对象都生效。

soup.a.get_text()
soup.a.text

4. soup.tagName.tagName["AttributeName"] 得到标签内属性值，逐级访问标签能够用 . 链接，某个属性值用 ["属性名"] 访问。

soup.div.a['href']

5. soup.ul.contents 返回类型为<class 'list'>，能够用下标访问其中的元素

　list内元素类型为 <class 'bs4.element.Tag'> or <class 'bs4.element.NavigableString'>

　若是是单一标签能够用 string 返回文本，若是超过1个标签就返回None

soup.ul.contents
type(soup.ul.contents) #<class 'list'>
soup.ul.contents[0].string
type(soup.ul.contents[0])

6. find_all(name, attrs, recursice, text, limit, **kwargs)，返回的类型为 <class 'bs4.element.ResultSet'>

# 找到全部标签为tagName的集合
soup.find_all("tagName")
soup.find_all("a")

# 找到全部标签为tagName 且 class=className的集合
soup.find_all("tagName", "className")
soup.find_all("div","Score")

# 找到全部id=idName的标签
soup.find_all(id = "idName")

# 使用多个指定名字的参数能够同时过滤tag的多个属性:
soup.find_all(href = re.compile("else"))
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# 有些tag属性在搜索不能使用,好比HTML5中的 data-* 属性
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
# 能够用attrs 参数定义一个字典参数来搜索包含特殊属性的tag
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

# 找到全部有id的标签
soup.find_all(id = True)

# 找到全部标签为tagName且class = "className"的标签
soup.find_all("tagName", class_ = "className")
# css class参数过滤
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

# 经过文本查找文本
soup.find_all(text = "textContent")
soup.find_all(text = ["Content1", "Content2"])
# 文本参数过滤
soup.find_all(text=re.compile("Dormouse"))

# 限制结果个数
soup.find_all("tagName", limit = 2)
elemSet = soup.find_all("div", limit = 2)
# 可循环出每一个元素
for item in elemSet:
    print(item)
    print(item.text)

7. 经过 css 选择器查找，返回值 <class 'list'>，list中元素类型为 <class 'bs4.element.Tag'>

htmldoc = """
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(htmldoc, "lxml")

# tag标签查找
soup.select("title")

# tag标签逐层查找
soup.select("html head title")
soup.select("head title")
soup.select("body a")

# tag标签下的直接子标签
soup.select("head > title")
soup.select("p > a")
soup.select("p > a:nth-of-type(2)")
soup.select("p > #link1")

# css类名查找
soup.select(".sister")

# id 查找
soup.select("#link1")

# 经过属性查找
soup.select("a[href]")
soup.select('a[href="http://example.com/elsie"]')
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')
type(soup.select("a[href]")) # <class 'list'>

# 经过循环获取到每一个tag
list = soup.select("a[href]") 
for item in list:
    print(item)
    print(type(item)) # <class 'bs4.element.Tag'>
    print(item.text)
    print(item.string)