Python爬虫系列-BeautifulSoup详解

安装

pip3 install beautifulsoup4css

解析库

解析器 使用方法 优点 劣势
Python标准库 BeautifulSoup(markup,'html,parser') Python的内置标准库、执行速度适中、文档容错能力强 Python 2.7.3 or 3.2.2前的版本中文容错能力差
lxml HTML 解析库 BeautifulSoup(markup,'lxml') 速度快、文档容错能力强 须要安装C语言库
lxml XML 解析库 BeautifulSoup(markup,'xml') 速度快、惟一支持XML的解析器 须要安装C语言库
html5lib BeautifulSoup(markup,'xml') 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

基本使用

html = """ 
 <html dir="ltr" lang="en"><head><meta charset="utf-8"/>  <title>The Dormouse's story</title> </head><body><p class="title" name="dormouse"> <b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">   <!-- Elsie --></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
   </p>  <p class="story">   ...story go on...</p>
 """
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.prettify()

自动补全代码:html

<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dormouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;    and they lived at the bottom of a well
  </p>
  <p class="story">
   ...story go on...
  </p>
 </body>
</html>

print(soup.title.string)
输出html的标题:html5

The Dormouse's story浏览器

标签选择器

选择元素

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

输出结果以下:spa

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head>
<p class="title" name="dormouse"> <b>The Dormouse's story</b></p> #只返回第一个p标签

获取外层标签的名称

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

titlecode

获取内容的属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

两种获取属性名称的方法orm

dormouse
dormousexml

获取内容

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.b.string)

The Dormouse's storyhtm

嵌套选择

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.head.title.string)

The Dormouse's storythree

字节点和子孙节点

html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/>  <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n   <a class="sister" href="http://example.com/elsie" id="link1">   <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well\n  </p>  <p class="story">   ...story go on...</p>
 '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
['Once upon a time there were three little sisters;and their names were\n   ', <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'and', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';    and they lived at the bottom of a well\n  ']

children是一个迭代器:

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.p.children)
 for i,child in enumerate(soup.p.children):
      print(i,child)

<list_iterator object at 0x7fe986ba07f0>
0 Once upon a time there were three little sisters;and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>
2 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
3 and
4 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
5 ; and they lived at the bottom of a well

html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/>  <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n   <a class="sister" href="http://example.com/elsie" id="link1">   <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well\n  </p>  <p class="story">   ...story go on...</p>
...  '''
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.p.descendants)
 for i,child in enumerate(soup.p.descendants):
     print(i,child)

孙节点也被输出出来:

<generator object descendants at 0x7fe986c11468>
0 Once upon a time there were three little sisters;and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>
2
3 <span>Elsie </span>
4 Elsie
5 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
6 Lacie
7 and
8 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
9 Tillie
10 ; and they lived at the bottom of a well

父节点和祖先节点

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.a.parent)

显示结果:

<p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parent)))

显示结果:

[(0, 'Once upon a time there were three little sisters;and their names were\n   '), (1, <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>), (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (3, 'and'), (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (5, ';    and they lived at the bottom of a well\n  ')]

print(list(enumerate(soup.a.parents)))
显示全部结果:最后为源代码跟节点

[(0, <p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p>), (1, <body><p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p> <p class="story">   ...story go on...</p>
</body>), (2, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p> <p class="story">   ...story go on...</p>
</body></html>), (3, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p> <p class="story">   ...story go on...</p>
</body></html>)]

兄弟节点

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(list(enumerate(soup.a.next_siblings)))

显示以下:html [(0, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (1, 'and'), (2, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (3, '; and they lived at the bottom of a well\n ')]
print(list(enumerate(soup.a.previous_siblings)))
[(0, 'Once upon a time there were three little sisters;and their names were\n ')]

标准选择器

find_all(name,attrs,recursive,text,**kwargs)
可根据标签名、属性、内容查找文档

name

html = """
 <div class="panel">
   <div class="panel-heading">
     <h4>Helllo</h4>
   </div>
   <div class="panel-body">
     <ul class="list" id="list-1">
       <li class="element">Foo</li>
       <li class="element">Bar</li>
       <li class="element">Jay</li>
     </ul>
     <ul class="list list-small" id="list-2">
       <li class="element">Foo</li>
       <li class="element">Bar</li>
     </ul>
   </div>
 </div>
"""
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find_all('ul'))
 print(type(soup.find_all('ul')[0]))

显示结果以下:

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

<class 'bs4.element.Tag'>

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
     print(ul.find_all('li'))

显示结果以下

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

attrs

html = '''
 <div class="panel">\n  <div class="panel-heading">\n    <h4>Helllo</h4>\n  </div>\n  <div class="panel-body">\n    <ul class="list" id="list-1" name=elements>\n      <li class="element">Foo</li>\n      <li class="element">Bar</li>\n      <li class="element">Jay</li>\n    </ul>\n    <ul class="list list-small" id="list-2">\n      <li class="element">Foo</li>\n      <li class="element">Bar</li>\n    </ul>\n  </div>\n</div>
 '''
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find_all(attrs={'id':'list-1'}))
 print(soup.find_all(attrs={'name':'elements'}))

显示以下:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

另外知道ID或Class能够用下列方法查找:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

print(soup.find_all(class_='element'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

text

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find_all(text='Foo'))

['Foo', 'Foo']

find(name,attrs,recursive,text,**kwargs)
find返回单个元素,find_all返回全部元素

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find('ul'))
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

print(type(soup.find('ul')))

<class 'bs4.element.Tag'>

print(type(soup.find('page')))不存在返回结果:

<class 'NoneType'>

CSS选择器

经过select()直接传入CSS选择器便可完成选择

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.select('.panel .panel-heading'))
 print(soup.select('ul li'))
 print(soup.select('#list-2 .element'))
 print(soup.select('ul')[0])

显示结果以下:
[html <div class="panel-heading"> <h4>Helllo</h4> </div>]

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

遍历的用法:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
     print(ul.select('li'))

显示结果以下:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

获取属性

from bs4 import  BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 for ul in soup.select('ul'):
     print(ul['id'])
     print(ul.attrs['id'])

显示效果以下:
list-1
list-1
list-2
list-2

获取内容

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 for li in soup.select('li'):
     print(li.get_text())

显示结果:
Foo
Bar
Jay
Foo
Bar

总结:

  • 推荐使用lxml解析库,必要时使用html.parser
  • 标签选择筛选功能弱可是速度快
  • 建议使用find()、find_all()查询匹配单个结果或多个结果
  • 若是对CSS选择器书系建议使用select()
  • 记住经常使用的获取属性和文本值的方法
相关文章
相关标签/搜索