【一块儿学爬虫】BeautifulSoup库详解

时间 2019-11-08

原文原文链接

回顾

上一次介绍正则表达式的时候，分享了一个爬虫实战，即爬取豆瓣首页全部的：书籍、连接、做者、出版日期等。在上个实战中咱们是经过正则表达式来解析源码爬取数据，总体来讲上次实战中的正则表达式是比较复杂的，因此引入了今天的主角BeautifulSoup：它是灵活方便的网页解析库，处理高效，并且支持多种解析器。使用Beautifulsoup，不用编写正则表达式就能够方便的实现网页信息的提取。html

1、 BeautifulSoup的安装

pip install beautifulsoup4html5

2、用法讲解

1. 解析库

解析器	使用方法	优点	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快、文档容错能力强，经常使用	须要安装C语言库 lxml
lxml XML 解析器	BeautifulSoup(markup, "xml")	速度快、惟一支持XML的解析器	须要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

2.基本使用

下面是一个不完整的html：body标签、html标签都没有闭合java

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
复制代码

下面使用lxml解析库解析上面的htmlpython

from bs4 import BeautifulSoup#引包
soup = BeautifulSoup(html, 'lxml')#声明bs对象和解析器
print(soup.prettify())#格式化代码，自动补全代码，进行容错的处理
print(soup.title.string)#打印出title标签中的内容
复制代码

下面是容错处理时标签补全后的结果和获取的title内容，能够看到html和body标签都被补全了：面试

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p >
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href=" " id="link1">
    <!-- Elsie -->
   </ a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </ a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </ a>
   ;
and they lived at the bottom of a well.
  </p >
  <p class="story">
   ...
  </p >
 </body>
</html>
The Dormouse's story
复制代码

3.标签选择器

####（1）选择元素依旧使用上面的html正则表达式

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
复制代码

结果是：浏览器

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p >
复制代码

从结果发现只输出了一个p标签，可是HTML中有3个p标签 标签选择器的特性：当有多个标签的时候，它只返回第一个标签的内容学习

（2）获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
复制代码

输出结果：测试

dromouse dromouse大数据

(3) 获取内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)
复制代码

输出结果：

The Dormouse's story

(4) 嵌套获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)
复制代码

输出：

The Dormouse's story

(5)获取子节点和子孙节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
复制代码

输出的是一个列表

['\n            Once upon a time there were three little sisters; and their names were\n            ', 
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>,
 '\n'
, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
, ' \n            and\n            '
, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
, '\n            and they lived at the bottom of a well.\n        ']
复制代码

另一种获取方式

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)
复制代码

输出：

<list_iterator object at 0x1064f7dd8>
0 
            Once upon a time there were three little sisters; and their names were
     &emsp;       
1 <a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
2 
&emsp;
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
4  
    and&emsp;&emsp;&emsp;
5 &emsp;&emsp;&emsp;
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
6 
    and they lived at the bottom of a well.
复制代码

####（6）获取父节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)
复制代码

程序打印出的是p标签，即a标签的父节点：

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
            and they lived at the bottom of a well.
        </p >
复制代码

于此相似的还有：

parents属性：输出当前标签的全部祖先节点
next_sibings 属性：输出当前标签以后的兄弟标签
previous_sibling属性输出当前标签以前的兄弟标签

上面是标签选择器：处理速度很快，可是这种方式不能知足咱们解析HTML的需求。所以beautifulsoup还提供了一些其余的方法

3.标准选择器

**find_all( name , attrs , recursive , text , kwargs ) 可根据标签名、属性、内容查找文档下面使用的测试HTML都是下面这个

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
复制代码

(1) 根据标签名，即name查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
复制代码

输出了全部的ul标签：

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
复制代码

上述能够继续进行嵌套：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))
   #能够更进一步，获取li中的属性值：ul.find_all('li')[0]['class']
复制代码

（2）根据属性名进行查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(name='element'))
复制代码

输出：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
复制代码

(3)根据文本的内容，即text进行选择

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))
复制代码

输出：

['Foo;'Foo']

返回的不是标签，在查找的时候用途不大，更可能是作内容匹配

find( name , attrs , recursive , text , kwargs ) 和findall相似，只不过find方法只是返回单个元素

find_parents() find_parent() find_parents()返回全部祖先节点，find_parent()返回直接父节点。

find_next_siblings() find_next_sibling() find_next_siblings()返回后面全部兄弟节点，find_next_sibling()返回后面第一个兄弟节点。

find_previous_siblings() find_previous_sibling() find_previous_siblings()返回前面全部兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。

find_all_next() find_next() find_all_next()返回节点后全部符合条件的节点, find_next()返回第一个符合条件的节点

find_all_previous() 和 find_previous() find_all_previous()返回节点后全部符合条件的节点, find_previous()返回第一个符合条件的节点

CSS选择器

经过select()直接传入CSS选择器便可完成选择

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
#选择class为panel中的class为panel-heading的HTML，选择class时要在前面加‘.’
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))#标签选择，选择ul标签中的li标签
print(soup.select('#list-2 .element'))#‘#’表示id选择：选择id为list-2中class为element中的元素
print(type(soup.select('ul')[0]))
复制代码

输出：

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>
复制代码

也能够进行嵌套，不过不必，上面经过标签之间使用空格就实现了嵌套：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))
复制代码

输出：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
复制代码

获取到html后如何获取属性和内容：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])#或者 print(ul.attrs['id'])
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())
复制代码

总结

推荐使用lxml解析库，必要时使用html.parser
标签选择筛选功能弱可是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
若是对CSS选择器熟悉建议使用select()，方便
记住经常使用的获取属性和文本值的方法

更多关于Beautifulsoup的使用能够查看对应的文档说明

资料分享

java学习笔记、10T资料、100多个java项目分享

欢迎关注我的公众号【菜鸟名企梦】，公众号专一：互联网求职面经、java、python、爬虫、大数据等技术分享**：公众号**菜鸟名企梦后台发送“csdn”便可免费领取【csdn】和【百度文库】下载服务；公众号菜鸟名企梦后台发送“资料”:便可领取5T精品学习资料**、java面试考点和java面经总结，以及几十个java、大数据项目，资料很全，你想找的几乎都有