BeautifulSoup是灵活又方便的网页解析库,处理高效,支持多种解析器css
利用它不用编写正则表达式便可方便地实现网页信息的提取html
安装:pip3 install beautifulsoup4html5
用法详解:python
beautifulsoup支持的一些解析库正则表达式
解析器 | 使用方法 | 优点 | 劣势 |
Python标准库 | BeautifulSoup(makeup,"html.parser") | python的内置标准库,执行速度适中,文档容错能力强 | python2.7 or python3.2.2前的版本中文容错能力差 |
lxml HTML解析器 | BeautifulSoup(makeup,"lxml") | 速度快,文档容错能力强 | 须要安装c语言库 |
lxml XML解析器 | BeautifulSoup(makeup,"xmlr") | 速度快,惟一支持xml的解析器 | 须要安装c语言库 |
html5lib | BeautifulSoup(makeup,"html5lib") | 最好的容错性,以浏览器的方式解析文档,生成HTML5格式的文档 | 速度慢,不依赖外部扩展 |
import bs4 from bs4 import BeautifulSoup #下面是一段不完整的 html代码 html = ''' <html><head><title>The Demouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Domouse's story</b></p> <p class="story">Once upon a time there were three little sisters,and their name were <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a> <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a> <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a> and they lived the bottom of a wall</p> <p clas="stuy">..</p> ''' soup = BeautifulSoup(html,'lxml') #将代码补全,也就是容错处理 print(soup.prettify()) #选择title这个标签,并打印内容
print(soup.title.string)
输出结果为: <html> <head> <title> The Demouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Domouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters,and their name were <a class="sister" href="http://examlpe.com/elele" ld="link1"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/lacie" ld="link2"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/title" ld="link3"> <title> </title> </a> and they lived the bottom of a wall </p> <p clas="stuy"> .. </p> </body> </html> The Demouse's story
如上面例程中的soup.title.string,就是选择了title标签浏览器
选择元素:import bs4python2.7
from bs4 import BeautifulSoup #下面是一段不完整的 html代码 html = ''' <html><head><title>The Demouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Domouse's story</b></p> <p class="story">Once upon a time there were three little sisters,and their name were <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a> <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a> <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a> and they lived the bottom of a wall</p> <p clas="stuy">..</p> ''' soup = BeautifulSoup(html,'lxml') print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p)
输出结果为:
<title>The Demouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Demouse's story</title></head>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
#只输出第一个匹配结果
获取名称:函数
import bs4 from bs4 import BeautifulSoup #下面是一段不完整的 html代码 html = ''' <html><head><title>The Demouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Domouse's story</b></p> <p class="story">Once upon a time there were three little sisters,and their name were <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a> <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a> <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a> and they lived the bottom of a wall</p> <p clas="stuy">..</p> ''' soup = BeautifulSoup(html,'lxml') print(soup.title.name)
输出结果为:title
获取属性: url
import bs4 from bs4 import BeautifulSoup #下面是一段不完整的 html代码 html = ''' <html><head><title>The Demouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Domouse's story</b></p> <p class="story">Once upon a time there were three little sisters,and their name were <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a> <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a> <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a> and they lived the bottom of a wall</p> <p clas="stuy">..</p> ''' soup = BeautifulSoup(html,'lxml') print(soup.p.attrs['name']) print(soup.p['name'])
#注意soup.a.attrs或者soup.p['name']这两种获取属性的方法都是能够的
#还有就是要注意中括号!!!
获取内容:spa
如例程中所示,使用string方法,如:soup.title.string便可获取内容
嵌套选择:
如:print(soup.head.title.string)
子节点和子孙节点:
如:print(soup.p.contents)使用contents能够获取p标签的全部子节点,类型是一个列表
也可使用children,与contents不一样的是,children是一个迭代器,获取全部子节点,须要使用循环才能把他的内容取到如:
print(soup.p.children)
for i ,child in enumerate(soup.p.children):
print(i,child)
此外还有一个属性descendants,这个是获取全部的子孙节点,一样也是一个迭代器
print(soup.p.descendants)
for i ,child in enumerate(soup.p.descendants):
print(i,child)
注意:子节点,子孙节点和下面的父节点,祖先节点中使用的相似于soup.p语法,是获取第一个匹配到的p标签,因此这些节点也都是第一个匹配到的标签所对应的节点
父节点和祖先节点:
parent属性:获取全部的父节点
parents属性:获取全部的祖先节点
兄弟节点:
next_siblings属性
previous_siblings属性
--------------------------------------------------------------------------------------------------------------------
上面说的是标签选择器,速度比较快,可是不能知足解析html文档的需求的
find_all方法:
find_all(name,attrs,recursive,text,**kwargs)
可根据标签名、属性、内容查找文档
根据name进行查找:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.find_all('url')) 输出结果为: [<url class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url>, <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url>]
返回结果能够看到为一个列表,能够对列表进行循环,而后对每一项元素进行查找,如:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') for url in soup.find_all('url'): print(url.find_all('li')) 输出结果为: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>] [<li lass="element">Foo</li>, <li lass="element">Bar</li>]
根据attrs进行查找:
attrs传入的参数为字典形式的参数,如:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.find_all(attrs={'id':'list-1'}))#也能够soup.find_all(id='list-1')这样的来进行查找 print(soup.find_all(attrs={'name':'elements'})) 输出结果为: [<url class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url>] [<url class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url>]
###注意:能够利用soup.find_all(id='list-1')这样的来进行查找,但对于class属性,须要写成class_='内容'的形式,由于在python中,class是一个关键字,因此在这里看成属性进行查找的时候,须要写成class_的样子
利用text进行查找:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.find_all(text='Foo')) 输出结果为: ['Foo', 'Foo']
find方法,用法跟find_all方法是彻底同样的,只不过find_all返回全部元素,是一个列表,find返回单个元素,列表中的第一个值
find(name,attrs,recurslve,text,**kwargs)
find_parents()
find_parent()
find_next_siblings()
find_next_sibling()
find_previous_siblings()
find_previous_sibling()
find_all_next()
find_next()
find_all_previous()
find_previous()
这些函数的用法都同样,只不过实现的方式不同
经过select()直接传入css选择器便可完成选择
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') #若是选择的是class,须要加上一个点,.panel .panel-heading print(soup.select('.panel .panel-heading')) #直接选择标签 print(soup.select('url li')) #选择id,要用#来选 print(soup.select('#list-2 .element')) 输出结果为: [<div class="panel-heading"> <h4>hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
进行层层嵌套的选择:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') for url in soup.select('url'): print(url.select('li')) 输出结果为: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
获取属性
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') for url in soup.select('url'): print(url['id'])
#也可使用print(url.attrs['id']) 输出结果为: list-1 list-2
获取内容:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') for l in soup.select('li'): print(l.get_text()) 输出结果为: Foo Bar jay Foo Bar
推荐使用lxml解析库,必要时使用html.parser
标签选择筛选功能弱可是速度快
建议使用find(),find_all()查询匹配单个结果或多个结果
若是对css选择器熟悉建议使用select()
记住经常使用的获取属性和文本值的方法