安装: 命令行输入pip install beautifulsoup4html
BeautifulSoup支持的解析器from bs4 import BeautifulSoup html=''' <html><head><title>The Dormousae's story</title></head> <body> <p class="title" name="drimouse"><b>The Dormousae's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/title" class="sister" id="link3">Tillie</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> ''' soup=BeautifulSoup(html,'lxml') print(soup.prettify()) print(soup.title.string)
对于html咱们能够看到到,并非一个完整的HTML字符串,经过soup=BeautifulSoup(html,‘lxml’),对BeautifulSoup对象初始化,soup.prettify()方法能够把药解析的字符串以标准的缩进格式输出,
soup.title.string打印除title节点的内容。python
# html与上述的一致 soup=BeautifulSoup(html,'lxml') print(soup.title)# 打印title标签以及其中的内容 print(type(soup.title))#<class 'bs4.element.Tag'> print(soup.head)# 打印head标签以及其中的内容 print(soup.p)# 只会打印第一个p节点以及其中的内容
from bs4 import BeautifulSoup soup= BeautifulSoup(html.'lxml') print(soup.title.name) #打印出节点的名称title
from bs4 import BeautifulSoup soup= BeautifulSoup(html.'lxml') print(soup.p.attrs)#{'class': ['title'], 'name': 'drimouse'} print(soup.p.attrs['name'])#drimouse print(soup.p['name'])#drimouse
from bs4 import BeautifulSoup soup= BeautifulSoup(html.'lxml') print(soup.title.string)
print(soup.title.string)# print(soup.head.title.string) print(soup.head.title) print(type(soup.head.title)) print(type(soup.head.title.string)) # 打印结果依次为: The Dormousae's story The Dormousae's story <title>The Dormousae's story</title> <class 'bs4.element.Tag'> <class 'bs4.element.NavigableString'>
在作选择的时候,有时候不能作到一步就选到想要的节点元素,须要选中某一个节点元素,而后以它为基准再去选择它的子节点,父节点,兄弟节点等
(1)子节点和子孙节点:markdown
from bs4 import BeautifulSoup soup= BeautifulSoup(html.'lxml') print(soup.p.contents)#获取子节点 # [<b>The Dormousae's story</b>]
方法2:ide
from bs4 import BeautifulSoup soup= BeautifulSoup(html.'lxml') print(soup.p.children)# 迭代器类型 for i,child in enumerate(soup.p.children): print(i,child)
打印的结果为:
<list_iterator object at 0x000001BABACB9EF0>
0 The Dormousae’s storyui
子孙节点:spa
from bs4 import BeautifulSoup soup= BeautifulSoup(html.'lxml') print(soup.p.descendants)#获取子孙节点 for i,child in enumerate(soup.descendants): print(i,child)
(2)获取父节点和祖先节点命令行
soup=BeautifulSoup(html,'lxml') print(soup.a.parent)#获取父节点 print(soup.a.parents)#返回迭代器 print(list(enumerate(soup.a.parents)))#获取祖先节点
(3)兄弟节点:3d
from bs4 import BeautifulSoup soup= BeautifulSoup(html.'lxml') print(list(enumerate(soup.a.next_siblings)))#获取后面的兄弟节点 print(list(enumerate(soup.a.previous_siblings)))#获取前面的兄弟节点
打印结果:
[(0, ‘,\n’), (1, Lacie), (2, ’ and\n’), (3, Tillie), (4, ‘;\nand they lived at the boottom of a well.’)]code
[(0, ‘Once upon a time there were three little sisters;and their names were\n’)]orm
方法选择器:前面所说的都是经过属性来选择的,这种方法比较快,可是若是遇到比较复杂的选择的话,就比较麻烦,不灵活,BeautifulSoup库还提供了find_all(),以及find()方法
可根据标签名,属性,内容查找文档
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup=BeautifulSoup(html,'lxml') print(soup.find_all('ul')) print(type(soup.find_all('ul')[0])) for ul in soup.find_all('ul'): print(ul.find_all('li'))
打印结果
attrs属性:
from bs4 import BeautifulSoup soup=BeautifulSoup(html,'lxml') print(soup.find_all(attrs={'id':'list-1'})) print(soup.find_all(attrs={'name':'elements'}))等价于
print(soup.find_all(id='list-1')) print(soup.find_all(class_='element'))# 不能直接使用class,在python中class时关键字
text文本
from bs4 import BeautifulSoup soup=BeautifulSoup(html,'lxml') print(soup.find_all(text='Foo'))
find(name,attrs,recursive,text,**kwargs)
find返回单个元素,find_all返回全部元素
from bs4 import BeautifulSoup soup=BeautifulSoup(html,'lxml') print(soup.find('ul')) print(type(soup.find('ul'))) print(soup.find('page'))CSS选择器
经过select直接传入CSS选择器便可完成选择
(1)获取属性
from bs4 import BeautifulSoup soup=BeautifulSoup(html,'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])
(2)获取内容
from bs4 import BeautifulSoup soup=BeautifulSoup(html,'lxml') for li in soup.select('li'): print(li.get_text())总结:
总结:推荐使用lxml解析库,必要时使用html.parser 标签选择筛选功能弱可是速度快 建议使用find(),find_all()查询匹配单个结果或者多个结果 若是对CSS选择器熟悉建议使用select() 记住经常使用的获取属性值和文本的方法