python爬虫知识点总结（六）BeautifulSoup库详解

时间 2020-05-21

标签 python 爬虫知识总结 beautifulsoup 详解栏目 Python 繁體版

原文原文链接

官方学习文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/html

1、什么时BeautifulSoup？html5

答：灵活又方便的网页解析库，处理搞笑，支持多种解析器。正则表达式

　　利用它不用编写正则表达式便可方便地实现网页信息的提取。浏览器

2、安装学习

pip3 install bewautifulsoup4

3、用法讲解spa

解析器	使用方法	优点	劣势
Py't'hon标准库	BeautifulSoup(markup,"html.parser")	Python的内置标准库、执行速度适中、文档容错额能力强	Python2.7 or 3.2。2 前的版本中文容错额能力差
lxml HTML解析器	BeautifulSoup(markup,"lxml")	速度快、文档容错能力强	须要安装C语言库
lxml XML解析器	BeautifulSoup(markup,"xml")	速度快、惟一支持XML的解析器	须要安装C语言库
html5lib	BeautifulSoup(markup,"html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

4、基本使用code

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

5、标签选择器orm

lxml解析库xml

一、选择元素htm

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(soup.title.string)
print(type(soup.title))
print(soup.href)
print(soup.p)

二、获取名称

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

三、获取属性

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

四、获取内容

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)

五、嵌套选择

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)

六、子节点和子孙节点

.contents能够获取标签的子节点

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">
Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>
and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)# .contents能够获取标签的子节点

.children是一个迭代器,以换行符分隔,获取全部的子节点

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children) # .children是一个迭代器,以换行符分隔,获取全部的子节点
for i,child in enumerate(soup.p.children):
    print(i,child)

.descendants,以换行符分隔，获取全部的子孙节点

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants) # .descendants,以换行符分隔，获取全部的子孙节点
for i,child in enumerate(soup.p.descendants):
    print(i,child)

七、父节点和祖先节点

.parent,获取父节点

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent) # .parent,获取父节点

.parents,获取祖先节点

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parents))) # .parents,获取祖先节点

　
八、兄弟节点

.next_siblings,获取后面的兄弟节点

.previous_siblings,获取后面的兄弟节点

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings))) # .next_siblings,获取后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings))) # .previous_siblings,获取后面的兄弟节点

标签选择器

一、find_all(name,attrs,recursive,text,kwargs）**

可根据标签名、属性、内容查找文档

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

attrs

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

text

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo')) # text方法适用于文本匹配，不适用于标签查找

二、find(name.attrs,recursive,text,**kwargs)

find返回单个元素，find_all返回全部元素

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul')) 
print(type(soup.find('ul')))
print(soup.find('page'))

三、其余

find_parents()和 find_parent

find_parents()返回全部祖先节点，find_parent()返回直接父节点

find_next_siblings()和 find_next_siblings()

find_next_siblings()返回后面全部兄弟结点， find_next_siblings()返回后面第一个兄弟结点

find_previous_siblings()和find_previous_sibling()

find_previous_siblings()返回前面全部修兄弟节点，find_previous_sibling()返回前面第一个兄弟节点

find_all_next()和find_next()

find_all_next()返回节点后面全部符合条件的结点，find_next()返回第一个符合条件的结点

find_all_previous()和find_previous()

find_all_previous()返回结点前面全部符合条件的结点，find_previous()返回第一个符合条件的结点

CSS选择器

经过select()直接传入CSS选择器便可完成选择

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading')) # panel前面的.表明class属性
print(soup.select('ul li')) #ul li表示ul属性内的li属性，嵌套选择
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

一、获取属性

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

二、获取内容

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
    print(li.get_text())

总结

推荐使用lxml解析库，必要时使用html.parser
标签选择筛选功能弱可是速度快
建议使用find()、find_all()查询匹配单个结果或是多个结果
若是对CSS选择器熟悉建议使用select()