爬虫之BeautifulSoup， CSS

时间 2019-11-11

标签爬虫 beautifulsoup css 栏目网络爬虫繁體版

原文原文链接

1. Beautiful Soup的简介

2. Beautiful Soup 安装

能够利用 pip 或者 easy_install 来安装，如下两种方法都可html

easy_install beautifulsoup4python

pip install beautifulsoup4

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，若是咱们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更增强大，速度更快，推荐安装。

Python标准库：BeautifulSoup(markup, “html.parser”)

lxml HTML 解析器：BeautifulSoup(markup, “lxml”)

4. 建立 Beautiful Soup 对象

首先必需要导入 bs4 库：from bs4 import BeautifulSoupgit

咱们建立一个字符串，后面的例子咱们便会用它来演示github

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

建立 beautifulsoup 对象：soup = BeautifulSoup(html)

另外，咱们还能够用本地 HTML 文件来建立对象，例如soup = BeautifulSoup(open('index.html'))正则表达式

上面这句代码即是将本地 index.html 文件打开，用它来建立 soup 对象express

下面咱们来打印一下 soup 对象的内容，格式化输出：print soup.prettify()app

5. 四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每一个节点都是Python对象,全部对象能够概括为4种:ide

Tag
NavigableString
BeautifulSoup
Comment

Tag 是什么？通俗点讲就是 HTML 中的一个个标签，例如：<title>The Dormouse's story</title> ；<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>spa

上面的 title a 等等 HTML 标签加上里面包括的内容就是 Tag，下面咱们来感觉一下怎样用 Beautiful Soup 来方便地获取 Tags3d

对于 Tag，它有两个重要的属性，是 name 和 attrs，下面咱们分别来感觉一下

print soup.name

print soup.head.name

#[document]

#head

7.搜索文档树

（1）find_all( name , attrs , recursive , text , **kwargs )

（2）find( name , attrs , recursive , text , **kwargs )

它与 find_all() 方法惟一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果

find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索当前tag的全部tag子节点,并判断是否符合过滤器的条件

1）name 参数

name 参数能够查找全部名字为 name 的tag,字符串对象会被自动忽略掉

A.传字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中全部的标签

soup.find_all('b')

# [The Dormouse's story]

print soup.find_all('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.传正则表达式

若是传入正则表达式做为参数,Beautiful Soup会经过正则表达式的 match() 来匹配内容.下面例子中找出全部以b开头的标签,这表示<body>和标签都应该被找到

import re

for tag in soup.find_all(re.compile("^b")):

print(tag.name)

# body

# b

C.传列表

若是传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中全部<a>标签和标签

soup.find_all(["a", "b"])

# [The Dormouse's story,

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

D.传 True

True 能够匹配任何值,下面代码查找到全部的tag,可是不会返回字符串节点

for tag in soup.find_all(True):

print(tag.name)

# html

# head

# title

# body

# p

# b

# p

# a

E.传方法

若是没有合适过滤器,那么还能够定义一个方法,方法只接受一个元素参数 [4] ,若是这个方法返回 True 表示当前元素匹配而且被找到,若是不是则反回 False

下面方法校验了当前元素,若是包含 class 属性却不包含 id 属性,那么将返回 True:

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

将这个方法做为参数传入 find_all() 方法,将获得全部标签:

soup.find_all(has_class_but_no_id)

# [The Dormouse's story,

# Once upon a time there were...,

# ...]

         
      2）keyword 参数 
     
      注意：若是一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数看成指定名字tag的属性来搜索,若是包含一个名字为 id 的参数,Beautiful Soup会搜索每一个tag的”id”属性

 
       soup.find_all(id='link2') 
      
       # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 
      
       若是传入  
      href 参数,Beautiful Soup会搜索每一个tag的”href”属性

 
       soup.find_all(href=re.compile("elsie")) 
      
       # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

使用多个指定名字的参数能够同时过滤tag的多个属性

 
       soup.find_all(href=re.compile("elsie"), id='link1') 
      
       # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

在这里咱们想用 class 过滤，不过 class 是 python 的关键词，这怎么办？加个下划线就能够

soup.find_all("a", class_="sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

有些tag属性在搜索不能使用,好比HTML5中的 data-* 属性

 
       data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') 
      
       data_soup.find_all(data-foo="value") 
      
       # SyntaxError: keyword can't be an expression

可是能够经过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag

 
       data_soup.find_all(attrs={"data-foo": "value"}) 
      
       # [<div data-foo="value">foo!</div>]

3）text 参数

经过 text 参数能够搜搜文档中的字符串内容.与 name 参数的可选值同样, text 参数接受字符串 , 正则表达式 , 列表, True

soup.find_all(text="Elsie")

# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"

4）limit 参数

find_all() 方法返回所有的搜索结构,若是文档树很大那么搜索会很慢.若是咱们不须要所有结果,可使用 limit 参数限制返回结果的数量.效果与SQL中的limit关键字相似,当搜索到的结果数量达到 limit 的限制时,就中止搜索返回结果.

文档树中有3个tag符合搜索条件,但结果只返回了2个,由于咱们限制了返回数量

 
       soup.find_all("a", limit=2) 
      
       # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
      
       #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5）recursive 参数

调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的全部子孙节点,若是只想搜索tag的直接子节点,可使用参数 recursive=False .

一段简单的文档:

       < 
     html> 
      <head> 
     
        <title> 
     
         The Dormouse's story 
     
        </title> 
     
      </head> 
     
      ... 
     
      是否使用  
     recursive 参数的搜索结果:

8.CSS选择器

咱们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里咱们也能够利用相似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

（1）经过标签名查找

print soup.select('title')

#[<title>The Dormouse's story</title>]

print soup.select('a')

print soup.select('b')

#[The Dormouse's story]

（2）经过类名查找

print soup.select('.sister')

（3）经过 id 名查找

print soup.select('#link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

（4）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是同样的，例如查找 p 标签中，id 等于 link1的内容，两者须要用空格分开

print soup.select('p #link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

直接子标签查找

print soup.select("head > title")

#[<title>The Dormouse's story</title>]

（5）属性查找

查找时还能够加入属性元素，属性须要用中括号括起来，注意属性和标签属于同一节点，因此中间不能加空格，不然会没法匹配到。

print soup.select('a[class="sister"]')

print soup.select('a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

一样，属性仍然能够与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print soup.select('p a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

以上的 select 方法返回的结果都是列表形式，能够遍历形式输出，而后用 get_text() 方法来获取它的内容。

爬虫之BeautifulSoup， CSS

1. Beautiful Soup的简介

2. Beautiful Soup 安装

4. 建立 Beautiful Soup 对象

5. 四大对象种类

（2）NavigableString

（3）BeautifulSoup

（4）Comment

7.搜索文档树

（1）find_all( name , attrs , recursive , text , **kwargs )

（2）find( name , attrs , recursive , text , **kwargs )

find_all( name , attrs , recursive , text , **kwargs )

8.CSS选择器

（1）经过标签名查找

（2）经过类名查找

（5）属性查找