Beautiful Soup 是一个能够从HTML或XML文件中提取数据的Python库.它可以经过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.css
多看官方文档https://beautifulsoup.readthedocs.io/zh_CN/latest/html
from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> '''
output: <html\> <head\> <title\> The Dormouse's story </title\> </head\> <body\> <p class\="title"\> <b\> The Dormouse's story </b\> </p\> <p class\="story"\> Once upon a time there were three little sisters; and their names were <a class\="sister" href\="http://example.com/elsie" id\="link1"\> Elsie </a\>, <a class\="sister" href\="http://example.com/lacie" id\="link2"\> Lacie </a\>and <a class\="sister" href\="http://example.com/tillie" id\="link3"\> Tillie </a\> ; and they lived at the bottom of a well. </p\> <p class\="story"\> ... </p\> </body\> </html\>
bs的使用和字典的使用极为类似,用.来进行运算浏览器
<title>The Dormouse's story</title>
函数
The Dormouse's story
spa
\['title'\]
code
bs能够屡次调用获得须要的标签内容orm
<p class\="title"\> <b\> The Dormouse's story </b\> </p\> input:print(soup.p.b.string) output: The Dormouse's story
find_all( name , attrs , recursive , string , **kwargs )xml
name: 根据标签名来进行查询(经常使用)htm
经常使用方法是将列表中的元素提取出来进行处理 alist = soup.find\_all('a') for a in alist: function(a)
html\=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find\_all(attrs\={'id': 'list-1'})) print(soup.find\_all(attrs\={'name': 'elements'})) 上面两句的output: \[<ul class\="list" id\="list-1" name\="elements"\> <li class\="element"\>Foo</li\> <li class\="element"\>Bar</li\> <li class\="element"\>Jay</li\> </ul\>\]
用tag的属性来进行搜索,搜索每一个tag的id属性three
soup.find_all(id = 'list-2')
class是特殊字,用下面方法进行处理
soup.find_all('',{"class":"element"})
能够用class_ = ...... 来对class属性进行搜索新属性
soup.find_all("div",class_ = "panel-body")
谷歌浏览器快速得到标签CSS选择器方法
用选择器对组件选择---->找到相应的语句----->右键------>
能够根据须要进行copy,selector即为CSS的路径
find_all()若没有找到相应的数据返回一个空的列表
find()则返回一个None