上一次介绍正则表达式的时候,分享了一个爬虫实战,即爬取豆瓣首页全部的:书籍、连接、做者、出版日期等。在上个实战中咱们是经过正则表达式来解析源码爬取数据,总体来讲上次实战中的正则表达式是比较复杂的,因此引入了今天的主角BeautifulSoup:它是灵活方便的网页解析库,处理高效,并且支持多种解析器。使用Beautifulsoup,不用编写正则表达式就能够方便的实现网页信息的提取。html
pip install beautifulsoup4html5
解析器 | 使用方法 | 优点 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup, "html.parser") | Python的内置标准库、执行速度适中 、文档容错能力强 | Python 2.7.3 or 3.2.2)前的版本中文容错能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") | 速度快、文档容错能力强,经常使用 | 须要安装C语言库 lxml |
lxml XML 解析器 | BeautifulSoup(markup, "xml") | 速度快、惟一支持XML的解析器 | 须要安装C语言库 |
html5lib | BeautifulSoup(markup, "html5lib") | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
下面是一个不完整的html:body标签、html标签都没有闭合java
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
复制代码
下面使用lxml解析库解析上面的htmlpython
from bs4 import BeautifulSoup#引包
soup = BeautifulSoup(html, 'lxml')#声明bs对象和解析器
print(soup.prettify())#格式化代码,自动补全代码,进行容错的处理
print(soup.title.string)#打印出title标签中的内容
复制代码
下面是容错处理时标签补全后的结果和获取的title内容,能够看到html和body标签都被补全了:面试
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p >
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href=" " id="link1">
<!-- Elsie -->
</ a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</ a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</ a>
;
and they lived at the bottom of a well.
</p >
<p class="story">
...
</p >
</body>
</html>
The Dormouse's story
复制代码
####(1)选择元素 依旧使用上面的html正则表达式
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
复制代码
结果是:浏览器
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p >
复制代码
从结果发现只输出了一个p标签,可是HTML中有3个p标签 标签选择器的特性:当有多个标签的时候,它只返回第一个标签的内容学习
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
复制代码
输出结果:测试
dromouse dromouse大数据
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)
复制代码
输出结果:
The Dormouse's story
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)
复制代码
输出:
The Dormouse's story
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
复制代码
输出的是一个列表
['\n Once upon a time there were three little sisters; and their names were\n ',
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>,
'\n'
, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
, ' \n and\n '
, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
, '\n and they lived at the bottom of a well.\n ']
复制代码
另一种获取方式
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
复制代码
输出:
<list_iterator object at 0x1064f7dd8>
0
Once upon a time there were three little sisters; and their names were
 
1 <a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
2
 
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
4
and   
5    
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>           
6
and they lived at the bottom of a well.
复制代码
####(6)获取父节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)
复制代码
程序打印出的是p标签,即a标签的父节点:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
and they lived at the bottom of a well.
</p >
复制代码
于此相似的还有:
上面是标签选择器:处理速度很快,可是这种方式不能知足咱们解析HTML的需求。所以beautifulsoup还提供了一些其余的方法
**find_all( name , attrs , recursive , text , kwargs ) 可根据标签名、属性、内容查找文档 下面使用的测试HTML都是下面这个
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
复制代码
(1) 根据标签名,即name查找
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
复制代码
输出了全部的ul标签:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
复制代码
上述能够继续进行嵌套:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
#能够更进一步,获取li中的属性值:ul.find_all('li')[0]['class']
复制代码
(2)根据属性名进行查找
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(name='element'))
复制代码
输出:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
复制代码
(3)根据文本的内容,即text进行选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))
复制代码
输出:
['Foo;'Foo']
返回的不是标签,在查找的时候用途不大,更可能是作内容匹配
find( name , attrs , recursive , text , kwargs ) 和findall相似,只不过find方法只是返回单个元素
find_parents() find_parent() find_parents()返回全部祖先节点,find_parent()返回直接父节点。
find_next_siblings() find_next_sibling() find_next_siblings()返回后面全部兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings() find_previous_sibling() find_previous_siblings()返回前面全部兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
find_all_next() find_next() find_all_next()返回节点后全部符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous() 和 find_previous() find_all_previous()返回节点后全部符合条件的节点, find_previous()返回第一个符合条件的节点
经过select()直接传入CSS选择器便可完成选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
#选择class为panel中的class为panel-heading的HTML,选择class时要在前面加‘.’
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))#标签选择,选择ul标签中的li标签
print(soup.select('#list-2 .element'))#‘#’表示id选择:选择id为list-2中class为element中的元素
print(type(soup.select('ul')[0]))
复制代码
输出:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>
复制代码
也能够进行嵌套,不过不必,上面经过标签之间使用空格就实现了嵌套:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
复制代码
输出:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
复制代码
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])#或者 print(ul.attrs['id'])
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
print(li.get_text())
复制代码
更多关于Beautifulsoup的使用能够查看对应的文档说明
欢迎关注我的公众号【菜鸟名企梦】,公众号专一:互联网求职面经、java、python、爬虫、大数据等技术分享**: 公众号**菜鸟名企梦
后台发送“csdn”便可免费领取【csdn】和【百度文库】下载服务; 公众号菜鸟名企梦
后台发送“资料”:便可领取5T精品学习资料**、java面试考点和java面经总结,以及几十个java、大数据项目,资料很全,你想找的几乎都有