Python BeautifulSoup的使用

时间 2019-11-13

标签 python beautifulsoup 使用栏目 Python 繁體版

原文原文链接

1.BeautifulSoup的能够干什么？

BeautifulSoup 是一个能够从HTML或XML文件中提取数据的Python库.它可以经过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式；
BeautifulSoup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，经过解析文档为用户提供须要抓取的数据，由于简单，因此不须要多少代码就能够写出一个完整的应用程序；
BeautifulSoup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不须要考虑编码方式，除非文档没有指定一个编码方式，这时，BeautifulSoup就不能自动识别编码方式了。而后，你仅仅须要说明一下原始编码方式就能够了；
BeautifulSoup已成为和lxml、html6lib同样出色的python解释器，为用户灵活地提供不一样的解析策略或强劲的速度。

2.bs4在windows环境下的安装

到官网上下载，我下的版本是4.5.1；
下载完成后解压缩到，python的安装目录；
运行cmd，进入bs4文件夹；
执行 setup.py build；
执行 setup.py install 便可以完成安装；
安装完成后要检查是否安转成功。其实不必这么复杂，若是有pip 咱们直接pip安装便可。

3.四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每一个节点都是Python对象,全部对象能够概括为4种:css

Tag:Tag通俗点讲就是HTML中的一个个标签。Beautiful Soup是如何方便的获取Tags?首先咱们须要建立一个suop对象如：soup = BeautifulSoup(html文件,"html.parser")

soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

BeautifulSoup:BeautifulSoup 对象表示的是一个文档的所有内容.大部分时候,能够把它看成 Tag 对象，是一个特殊的 Tag，咱们能够分别获取它的类型，名称。
NavigableString:咱们已经得到标签的内容，怎么样获取标签内部的文字呢？用.string。它的类型是一个 NavigableString，翻译过来叫能够遍历的字符串。

print(soup.p.string)
print(type(soup.p.string))//结果会发现，打印出来的结果其实就是一个NavigableString类型

Comment:Comment 对象是一个特殊类型的 NavigableString 对象，其实输出的内容仍然不包括注释符号，可是若是很差好处理它，可能会对咱们的文本处理形成意想不到的麻烦.

//咱们找一个带有注释的标签：
print(soup.a)
//运行结果（一个带有注释的a标签）
<a class="sister" href="http://example.com/elsie" id="link1"><!--Elsie--></a>
print(soup.a.string)
//注：咱们在打印.string的时候，最好先判断（根据类型），
//避免运行结果（输出了注释，可是没有注释符号）
//Elsie注释与非注释混淆打印出来了
print(type(soup.a.string))
//运行结果（类型是Comment）
<class 'bs4.element.Comment'>

4.遍历文档树

遍历直接子节点。这里包含两个属性：.contents 、 .children 属性,tag 的 .content 属性能够将tag的子节点以列表的方式输出,.children它返回的不是一个 list，不过咱们能够经过遍历获取全部子节点。咱们打印输出 .children 看一下，能够发现它是一个 list 生成器对象。

查看其类型：print(soup.children)
运行结果：<list_iterator object at 0x0000000000E24438>

** 遍历全部子孙节点**。要点：.descendants属性。.contents 和 .children 属性仅包含tag的直接子点，.descendants 属性能够对全部tag的子孙节点进行递归循环，和 children相似，咱们也须要遍历获取其中的内容。
遍历节点内容。若是tag只有一个 NavigableString 类型子节点,那么这个tag能够使用 .string 获得子节点。若是一个tag仅有一个子节点,那么这个tag也可使用 .string 方法,输出结果与当前惟一子节点的 .string 结果相同。通俗点说就是：若是一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。若是标签里面只有惟一的一个标签了，那么 .string 也会返回最里面的内容。若是tag包含了多个子节点,tag就没法肯定，string 方法应该调用哪一个子节点的内容, .string 的输出结果是 None。

print(soup.head.string)
//输出： The Dormouse's story
print(soup.title.string)
//输出：The Dormouse's story
//若是tag包含太多子节点，例如：
print(soup.html.string)
//输出：None

那么咱们就须要利用.strings 、.stripped_strings 属性遍历多个内容。.strings获取多个内容，不过须要遍历获取；.strings 输出的字符串中可能包含了不少空格或空行,用 .stripped_strings 能够去除多余空白内容。html

遍历文档树父节点。使用.parent 属性能够得到节点的父节点；经过元素的 .parents属性能够递归获得元素的全部父辈节点。

content = soup.head.title.string
for parent in content.parents:
    print(parent.name)
输出：
title
head
html
[document]

遍历兄弟节点。兄弟节点能够理解为和本节点处在统一级的节点，.next_sibling 属性获取了该节点的下一个兄弟节点，.previous_sibling则与之相反，若是节点不存在，则返回 None。**注意：**实际文档中的tag的.next_sibling 和 .previous_sibling 属性一般是字符串或空白，由于空白或者换行也能够被视做一个节点，因此获得的结果多是空白或者换行。经过 .next_siblings 和``` .previous_siblings

1. **遍历文档树的先后节点**。```.next_element ```，```.previous_element ```属性与``` .next_sibling```，```.previous_sibling``` 不一样，它并非针对于兄弟节点，而是在全部节点，不分层次。

好比 head 节点为：<head><title>The Dormouse's story</title></head> 那么它的下一个节点为：print(soup.head.next_element) 输出：<title>The Dormouse‘s story</title> 并无层次关系python

### 5.搜索文档树。
主要是```find_all()```方法搜索当前tag的全部tag子节点,并判断是否符合过滤器的条件。```find_all( name , attrs , recursive , text , **kwargs )```
1. name参数:name 参数能够查找全部名字为 name 的tag,字符串对象会被自动忽略掉。
    - A.传字符串：最简单的过滤器是字符串。在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容

print(soup.find_all("b")) 输出：[<b>The Dormouse's story</b>]mysql

- B.传正则表达式：若是传入正则表达式做为参数,BeautifulSoup会经过正则表达式的 match() 来匹配内容.下面例子中找出全部以b开头的标签,这表示<body>和<b>标签都应该被找到。

import re for tag in soup.find_all(re.compile("^a")): print(tag.name) 输出： a a ajquery

- C.传列表：若是传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中全部<a>标签和<b>标签。

for tag in (soup.find_all(["p","b"])): print(tag.name) 输出： p b p p正则表达式

- 传 True：True 能够匹配任何值,下面代码查找到全部的tag,可是不会返回字符串节点。

for tag in (soup.find_all(True)): print(tag.name) 输出： html head title body p b psql

- 传方法：若是没有合适过滤器,那么还能够定义一个方法,方法只接受一个元素参数,若是这个方法返回 True 表示当前元素匹配而且被找到,若是不是则反回 False。下面方法校验了当前元素,若是包含 class 属性却不包含 id 属性,那么将返回 True。

def has_tag_but_no_id(tag): return tag.has_attr("class") and not tag.has_attr("id") for tagName in (soup.find_all(has_tag_but_no_id)): print(tagName.name)数据库

1. keyword参数
    - 有些tag属性在搜索不能使用,好比HTML5中的 data-* 属性

data_soup = BeautifulSoup("<div data-foo = 'value'>foo!</div>","html.parser") print(data_soup.find_all("data-foo")) 输出：[]windows

- 可是能够经过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag。

print(data_soup.find_all(attrs = {"data-foo":"value"})) 输出： [<div data-foo="value">foo!</div>]app

1. text参数：经过 text 参数能够搜索文档中的字符串内容。与 name 参数的可选值同样, text 参数接受 字符串 , 正则表达式 , 列表, True。

print(soup.find_all(text="Lacie")) 输出：['Lacie'] print(soup.find_all(text=["Lacie","Tillie"])) 输出：['Lacie', 'Tillie']

1. recursive 参数：调用tag的find_all() 方法时,BeautifulSoup会检索当前tag的全部子孙节点,若是只想搜索tag的直接子节点,可使用参数```recursive=False```。
1. **相似于find_all()的一些其余方法**
    - find_previous_siblings()，find_previous_sibling()：这2个方法经过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代, find_previous_siblings() 方法返回全部符合条件的前面的兄弟节点，find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点；
    - find_all_next()  find_next()：这2个方法经过 .next_elements 属性对当前 tag 的以后的 tag 和字符串进行迭代, find_all_next() 方法返回全部符合条件的节点, find_next() 方法返回第一个符合条件的节点；
    - find_all_previous() 和 find_previous()：这2个方法经过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代, find_all_previous() 方法返回全部符合条件的节点, find_previous()方法返回第一个符合条件的节点。

### 6.css选择器
1. 咱们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里咱们也能够利用相似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list。

print(soup.select("title")) 输出：[<title>The Dormouse‘s story</title>]（返回列表形式）

1. 经过类名查找

print(soup.select(".sister")) 输出：[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1. id和组合查找
    - 经过id查找

print(soup.select("#link2")) 输出：[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

- 组合查找（组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是同样的，例如查找 p 标签中，id 等于 link1的内容，两者须要用空格分开）。

print(soup.select("p #link1")) 输出：[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

1. 属性查找：查找时还能够加入属性元素，属性须要用中括号括起来，注意属性和标签属于同一节点，因此中间不能加空格，不然会没法匹配到。

print(soup.select("a[class='sister']")) //是否是很相似于jquery中的筛选输出： [<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

### 7.分享一下用BeautifulSoup爬虫的小例子
- 部分源码以下:

#url = http://wuhan.anjuke.com/sale/ from bs4 import BeautifulSoup import requests import time import pymysql import re url_QX = [] def url_qx(url): wb_data = requests.get(url).text soup = BeautifulSoup(wb_data,"html.parser") information = soup.select("div.div-border.items-list > div.items > span.elems-l > a")

# 咱们要找的就是下面的这些url,实际上语句执行的 查找出来的结果有多的，不要紧，咱们只取咱们要的
#<div class="items"><span class="item-title">区域：</span><span class="elems-l">
# <span class="selected-item">所有</span>
# <a href='http://wuhan.anjuke.com/sale/wuchanga/' class='' >武昌</a>
# <a href='http://wuhan.anjuke.com/sale/hongshana/' class='' >洪山</a><
# a href='http://wuhan.anjuke.com/sale/jiangan/' class='' >江岸</a>
# <a href='http://wuhan.anjuke.com/sale/jianghana/' class='' >江汉</a>
# <a href='http://wuhan.anjuke.com/sale/qiaokou/' class='' >硚口</a>
# <a href='http://wuhan.anjuke.com/sale/hanyang/' class='' >汉阳</a>
# <a href='http://wuhan.anjuke.com/sale/dongxihu/' class='' >东西湖</a>
# <a href='http://wuhan.anjuke.com/sale/qingshan/' class='' >青山</a>
# <a href='http://wuhan.anjuke.com/sale/jiangxiat/' class='' >江夏</a>
# <a href='http://wuhan.anjuke.com/sale/zhuankoukaifaqu/' class='' >沌口开发区</a>
# <a href='http://wuhan.anjuke.com/sale/huangpiz/' class='' >黄陂</a>
# <a href='http://wuhan.anjuke.com/sale/caidianz/' class='' >蔡甸</a>
# <a href='http://wuhan.anjuke.com/sale/hannanz/' class='' >汉南</a>
# <a href='http://wuhan.anjuke.com/sale/xinzhouz/' class='' >新洲</a>
# <a href='http://wuhan.anjuke.com/sale/qitao/' class='' >其余</a></span></div><!-- 区域 end-->
for url in information:
    data =  {
        "url":url.get("href"),
        "address:":url.get_text()
    }
    #print(data)
    url_QX.append(data)

url_qx("http://wuhan.anjuke.com/sale/") url_qxTop = url_QX[0:15] #这是武昌，汉口等区，咱们最后但愿获得的是武昌全部位置（好比说沙湖等位置） #print(url_qxTop) url_finally = [] #创建一个临时列表，用来进行地址的截取 def url_qx_finally(urlDict): for i in range(0,15): eachUrl = urlDict[i]["url"] wb_data = requests.get(eachUrl).text wb_soup = BeautifulSoup(wb_data,"html.parser") wb_information = wb_soup.select("div.div-border.items-list > div.items > span.elems-l > div.sub-items > a") #以武昌区为例，咱们找到了全部武昌区下面更详细的区 #<div class="items"><span class="item-title">区域：</span><span class="elems-l"> # <a href='http://wuhan.anjuke.com/sale/' class='' >所有</a><span class="selected-item">武昌</span><a href='http://wuhan.anjuke.com/sale/hongshana/' class='' >洪山</a><a href='http://wuhan.anjuke.com/sale/jiangan/' class='' >江岸</a><a href='http://wuhan.anjuke.com/sale/jianghana/' class='' >江汉</a><a href='http://wuhan.anjuke.com/sale/qiaokou/' class='' >硚口</a><a href='http://wuhan.anjuke.com/sale/hanyang/' class='' >汉阳</a><a href='http://wuhan.anjuke.com/sale/dongxihu/' class='' >东西湖</a><a href='http://wuhan.anjuke.com/sale/qingshan/' class='' >青山</a><a href='http://wuhan.anjuke.com/sale/jiangxiat/' class='' >江夏</a><a href='http://wuhan.anjuke.com/sale/zhuankoukaifaqu/' class='' >沌口开发区</a><a href='http://wuhan.anjuke.com/sale/huangpiz/' class='' >黄陂</a><a href='http://wuhan.anjuke.com/sale/caidianz/' class='' >蔡甸</a><a href='http://wuhan.anjuke.com/sale/hannanz/' class='' >汉南</a><a href='http://wuhan.anjuke.com/sale/xinzhouz/' class='' >新洲</a><a href='http://wuhan.anjuke.com/sale/qitao/' class='' >其余</a> # <div class="sub-items"><span class="selected-item">所有</span> # <span class="sub-letter-item" style="color: #f60;margin-right: 3px;">D</span> # <a href='http://wuhan.anjuke.com/sale/donghudongting/' class=''>东湖东亭</a> # <a href='http://wuhan.anjuke.com/sale/dongtingwuchanga/?from=shangquan' class=''>东亭</a>< # a href='http://wuhan.anjuke.com/sale/dingziqiao/?from=shangquan' class=''>丁字桥</a> # <span class="sub-letter-item" style="color: #f60;margin-right: 3px;">F</span> # <a href='http://wuhan.anjuke.com/sale/fujiapo/' class=''>傅家坡</a> # <span class="sub-letter-item" style="color: #f60;margin-right: 3px;">J</span> # <a href='http://wuhan.anjuke.com/sale/jiyuqiao/' class=''>积玉桥</a> # <span class="sub-letter-item" style="color: #f60;margin-right: 3px;">L</span><a href='http://wuhan.anjuke.com/sale/liangdaojie/' class=''>粮道街</a><span class="sub-letter-item" style="color: #f60;margin-right: 3px;">N</span><a href='http://wuhan.anjuke.com/sale/nanhuhuayuan/' class=''>南湖花园</a><span class="sub-letter-item" style="color: #f60;margin-right: 3px;">S</span><a href='http://wuhan.anjuke.com/sale/shuiguohu/' class=''>水果湖</a><a href='http://wuhan.anjuke.com/sale/simenkou/' class=''>司门口</a><a href='http://wuhan.anjuke.com/sale/shouyilu/?from=shangquan' class=''>首义路</a><span class="sub-letter-item" style="color: #f60;margin-right: 3px;">T</span><a href='http://wuhan.anjuke.com/sale/tuanjiedadao/' class=''>团结大道</a><span class="sub-letter-item" style="color: #f60;margin-right: 3px;">W</span><a href='http://wuhan.anjuke.com/sale/wuchanghuochezhan/' class=''>武昌火车站</a><a href='http://wuhan.anjuke.com/sale/wuchangzhoubian/' class=''>武昌周边</a><a href='http://wuhan.anjuke.com/sale/wutaizhafenghuo/' class=''>武泰闸烽火</a><a href='http://wuhan.anjuke.com/sale/wutaizha/?from=shangquan' class=''>武泰闸</a><span class="sub-letter-item" style="color: #f60;margin-right: 3px;">X</span><a href='http://wuhan.anjuke.com/sale/xiaodongmen/' class=''>小东门</a><a href='http://wuhan.anjuke.com/sale/xudong/' class=''>徐东</a><a href='http://wuhan.anjuke.com/sale/xujiapeng/' class=''>徐家棚</a><span class="sub-letter-item" style="color: #f60;margin-right: 3px;">Y</span><a href='http://wuhan.anjuke.com/sale/yangyuand/' class=''>杨园</a><a href='http://wuhan.anjuke.com/sale/yuemachangshouyi/' class=''>阅马场首义</a><span class="sub-letter-item" style="color: #f60;margin-right: 3px;">Z</span><a href='http://wuhan.anjuke.com/sale/zhonghualud/' class=''>中华路</a><a href='http://wuhan.anjuke.com/sale/ziyanglu/' class=''>紫阳路</a><a href='http://wuhan.anjuke.com/sale/zhongbeilu/' class=''>中北路</a><a href='http://wuhan.anjuke.com/sale/zhongnandingziqiao/' class=''>中南丁字桥</a></div></span></div> for wb_url in wb_information: finallyData = { "url":wb_url.get("href"), "address":wb_url.get_text() } url_finally.append(finallyData) url_qx_finally(url_qxTop) #这里咱们就获得了全部的url #print(url_finally) #接下来咱们链接mysql #打开数据库链接 conn = pymysql.connect( host = "localhost", user = "用户名", password = "密码", port = 3306, db = "mysql", charset = "UTF8" ) #使用cursor()方法获取操做游标 cursor = conn.cursor() #cursor.execute("DROP TABLE if 安居客20160917 exs") sql= """ create table 安居客20160917( 房价 CHAR (30), 地址 CHAR (200)) """ cursor.execute(sql) #cursor.close() #cursor.execute("select * from 安居客20160917") #results = cursor.fetchall() #print(results) for urlOrignal in url_finally: for i in range(1): url = (urlOrignal["url"] + "p{}/").format(i) data = requests.get(url).text soup = BeautifulSoup(data, "html.parser") address = soup.select("div.house-details > div > span.comm-address") #print(len(address)) prices = soup.select(("div.house-details > div > span:nth-of-type(3)")) #print(len(prices)) #break # print(price) #if prices == None: #continue #else: for price, address in zip(prices, address): price = price.get_text() #print(price) address = address.get("title") #print(address) try: cursor.execute("insert into 安居客20160917 (房价,地址) values (%s,%s)",(price,address)) except: conn.rollback() conn.commit() #print(url) url = urlOrignal cursor.close() conn.close()