Beautiful Soup模块使用

时间 2019-11-07

标签 beautiful soup 模块使用繁體版

原文原文链接

1.Beautiful Soup模块的介绍

Beautiful Soup 是一个能够从HTML或XML文件中提取数据的Python库，简单来讲，它能将HTML的标签文件解析成树形结构，而后方便地获取到指定标签的对应属性，还能够方便的实现全站点的内容爬取和解析；html
Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，若是咱们不安装它，则 Python 会使用 Python默认的解析器； lxml 是python的一个解析库，支持HTML和XML的解析，html5lib解析器可以以浏览器的方式解析，且生成HTML5文档；html5

pip install beautifulsoup4
pip install html5lib
pip install lxml

2. Beautiful Soup模块解析HTML文档

假如如今有一段不完整的HTML代码，咱们如今要使用Beautiful Soup模块来解析这段HTML代码python

data = '''                                         
<html><head><title>The Dormouse's story</title></he
<body>                                             
<p class="title"><b id="title">The Dormouse's story</b></p>   
<p class="story">Once upon a time there were three 
<a href="http://example.com/elsie" class="sister" i
<a href="http://example.com/lacie" class="sister" i
<a href="http://example.com/tillie" class="sister" 
and they lived at the bottom of a well.</p>        
<p class="story">...</p>                           
'''

首先须要导入BeautifulSoup模块，再实例化BeautifulSoup对象

from bs4 import BeautifulSoup           
soup = BeautifulSoup(data,'lxml')

而后经过BeautifulSoup提供的方法就能够拿到HTML的元素、属性、连接、文本等，BeautifulSoup模块能够将不完整的HTML文档，格式化为完整的HTML文档，好比咱们打印print(soup.prettify())看一下输出什么？浏览器

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b id="title">
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three
   <a a="" and="" at="" bottom="" class="sister" href="http://example.com/elsie" i="" lived="" of="" the="" they="" well.="">
    <p class="story">
     ...
    </p>
   </a>
  </p>
 </body>
</html>

获取标签，如title标签，a标签等

print('title = {}'.format(soup.title))             
# 输出：title = <title>The Dormouse's story</title>

print('a={}'.format(soup.a))

获取标签的名称，如title标签，body标签等

print('title_name = {}'.format(soup.title.name))
# 输出：title_name = title

print('body_name = {}'.format(soup.body.name))
# 输出：body_name = body

获取标签的内容，如title标签

print('title_string = {}'.format(soup.title.string))
#  输出：title_string = The Dormouse's story

若是想要获取某个标签的父标签的名称，能够使用parent，如title标签，能够获得父标签head标签，且会自定补齐不完整的标签；

print('title_pareat_name = {}'.format(soup.title.parent))
# 输出：title_pareat_name = <head><title>The Dormouse's story</title>
</head>

获取第一个p标签

print('p = {}'.format(soup.p))

# 输出：p = <p class="title"><b>The Dormouse's story</b></p>

获取第一个p标签的class的值，获取第一个a标签的class值

print('p_class = {}'.format(soup.p["class"]))
# 输出：p_class = ['title']

print('a_class = {}'.format(soup.a["class"]))
# 输出：a_class = ['sister']

获取全部的标签

#  获取全部的a标签
print('a = {}'.format(soup.find_all('a')))

#  获取全部的p标签
print('p = {}'.format(soup.find_all('p')))

获取id为link3的标签

print('a_link = {}'.format(soup.find(id='title')))

# 输出：a_link = <b id="title">The Dormouse's story</b>

3.BeautifulSoup中的对象

BeautifulSoup对象分为四类，分别是Tag(获取标签), NavigableString(获取标签内容) , BeautifulSoup(根标签), Comment(标签内的全部的文本) ；

语法：编码

soup.标签名：获取HTML中的标签；code
soup.标签名.name：获取HTML中标签的名称；orm
soup.标签名.attrs：获取标签的全部属性；xml
soup.标签名.string：获取HTML中标签的文本内容；htm
soup.标签名.parent：获取HTML中标签的父标签；对象
prettify()方法：能够将Beautiful Soup的文档树格式化后以Unicode编码输出，每一个XML/HTML标签都独占一行；

4.遍历文档

contents：获取全部子节点，返回一个列表，能够经过下标取值；

soup = BeautifulSoup(html,"lxml")

# 返回一个列表
print(soup.p.contents)
# 拿到第一个子节点
print(soup.p.contents[0])

children：返回子节点的生成器对象；

for tag in soup.p.children:
    print(tag)

soup.strings：获取全部节点的内容，包括空格；

soup = BeautifulSoup(html,"lxml")
for content in soup.strings:
    print(repr(content))

soup.stripped_strings：获取全部节点的内容，不包括空格；

soup = BeautifulSoup(html,"lxml")
for tag in soup.stripped_strings:
    print(repr(tag))

5.查找标签

find_all()：查找全部指定标签名称的子节点（可同时查找多个标签），并判断是否符合过滤器的条件，返回一个列表；

soup = BeautifulSoup(html,"lxml")
print(soup.find_all('a'))
print(soup.find_all(['a','p']))
print(soup.find_all(re.compile('^a')))

find()：和find_all()差很少，可是find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法直接返回结果；

soup = BeautifulSoup(html,"lxml")
print(soup.find('a'))

更多关于Beautiful Soup模块的知识能够查看：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

参考：https://www.9xkd.com/user/plan-view.html?id=1742870460

1. Beautiful Soup模块的使用
2. Beautiful Soup模块
3. Beautiful Soup用法
4. Beautiful Soup
5. 安装外部模块 Beautiful Soup and requests
6. Learn Beautiful Soup(3)——使用Beautiful Soup进行查找
7. Beautiful Soup的用法
8. python Beautiful Soup库
9. Beautiful Soup库
10. Beautiful Soup Documentation
更多相关文章...
• Lua 模块与包 - Lua 教程
• DTD - XML 构建模块 - DTD 教程
• 委托模式
• Composer 安装与使用