Beautiful Soup 库的基本使用

示例网站:https://python123.io/ws/demo....html

>>> import requests
>>> r = requests.get('https://python123.io/ws/demo.html')
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,'html.parser')
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

咱们使用soup的prettify方法来漂亮打印HTML页面。
BeautifulSoup库的基本元素
html元素的介绍:html5

<p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
</p>

在以上的代码中,<p>..</p> 就是标签,Tag;
标签的名字name 就是 p
class 就是属性(Attributes),属性是由键值对构成的。
beautifulsoup类就是对应着HTML页面。
BeautifulSoup 库解析器
不管哪一种解析器均可以处理HTML文档的python

解析器 使用方法 条件
bs4的HTML解析器 BeautifulSoup(mk,'html.parser') 安装bs4库
lxml的HTML解析器 BeautifulSoup(mk,'lxml') pip install lxml
lxml的XML解析器 BeautifulSoup(mk,'xml') pip install lxml
html5lib解析器 BeautifulSoup(mk,'html5lib') pip install html5lib

BeautifulSoup 类的基本元素网络

基本元素 说明
Tag 标签,最基本的信息组织单元,分别用<>和</>代表开头和结尾
Name 标签的名字,<p>..</p>的名字是‘p’,格式:<tag>.name
Attributes 标签的属性,字典形式组织,格式:<tag>.attrs
NavigableString 标签内非属性字符串,<>..</>中字符串,格式:<tag>.string
Comment 标签内字符串的注释部分,一种特殊的Comment类型
>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

soup中全部的标签均可以使用soup.tag的形式访问,当文档中存在多个同名标签,则只访问第一个。
访问标签名字的方法:数据结构

>>> tag.name
'a'
>>> tag.parent.name
'p'
>>> tag.parent.parent.name
'body'

访问标签属性的方法,不管标签是否存在属性,都会返回一个字典。app

>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> tag.attrs['class']
['py1']
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>#标签是标签属性

访问NavigableString的方法:网站

>>> tag.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>#这个字符串不是普通字符串类型

访问Comment注释的方法ui

>>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>",'html.parser')
>>> newsoup.b.string
'This is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

若分析文本时不须要注释的内容,用类型来判断一下就能够过滤掉注释的内容。
HTML遍历方法:
1.从根节点向叶子节点下行遍历
2.从叶子节点向根节点上行遍历
3.从叶子节点到叶子节点的平行遍历方式
下行遍历方法:url

属性 说明
.contents 子节点的列表,将<tag>全部儿子节点存入列表
.children 子节点的迭代类型,与.contents相似,用于循环遍历儿子节点
.descendants 子孙节点的迭代类型,包含全部子孙节点,用于循环遍历
>>> soup.body
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5
>>> for child in soup.body.children:
    print(child)

    


<p class="title"><b>The demo python introduces several python courses.</b></p>


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>


>>> for child in soup.body.descendants:
    print(child)

    


<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.

上行遍历方法spa

属性 说明
.parent 节点的父亲标签
.parents 节点先辈标签的迭代类型,用于循环遍历先辈节点
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.parent
>>> for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

        
p
body
html
[document]

标签树的平行遍历

属性 说明
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签
.next_siblings 迭代类型,返回按照HTML文本顺序的后续全部平行节点标签
.previous_siblings 迭代类型,返回按照HTML文本顺序的前续全部平行节点标签

重要:平行遍历发送在同一个父亲节点下的各个节点之间

>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_sibling

bs4的html页面友好输出:

print(soup.prettify())

XML:扩展标记语言 与HTML格式相似。最先的通用信息标记语言,可扩展性好,但繁琐。主要采用
image.png
JSON:JS语言中面向对象的一种信息标记形式
JSON是有类型的键值对 key:value。信息有类型,适合程序处理(js),较XML简洁。主要用于节点通讯,不能注释
image.png
YAML:无类型键值对 key:value 。信息无类型,文本信息比例最高,可读性好。各种系统的配置文件,有注释易读
image.png
文本解析的基本思路
image.png

>>> for link in soup.find_all('a'):
    print(link.get('href'))

    
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

soup.find_all()方法:
<>.find_all(name,attrs,recursive,string,**kwargs)
返回一个列表类型,存储查找的结果
name:对标签名称的检索字符串

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a')[0]
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.find_all(['a','b'])#同时搜索两个参数须要传入一个列表
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>> for tag in soup.find_all(True):
    print(tag.name)

    
html
head
title
body
p
b
p
a
a

如今要搜索全部以‘b’开头的标签,包括和<body>标签。

image.png

attrs:对标签属性值的检索字符串,可标注属性检索
image.png
recursive:是否对子孙所有检索,默认True
image.png
string:对<>...</>中字符串区域的检索字符串
image.png
<tag>(..)等价于 <tag>.find_all(..)
soup(..)等价于 soup.find_all(..)

实例:中国大学排名的爬虫实现
URL连接为:http://www.zuihaodaxue.com/zu...
image.png
首先肯定咱们想得到的信息是否在HTML页面中,而不是用js造成的。
image.png
在这里,咱们在HTML页面中看到了所有信息,那么如今对程序结构进行一下初步设计。
步骤1:从网络上获取大学排名网页内容,getHTMLText()
步骤2:提取网页内容中信息到合适的数据结构,fillUnivList()(关键点)
步骤3:利用数据结构展现并输出结构,printUnivList()

import requests
from bs4 import BeautifulSoup as bs
import bs4
def getHTMLText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        print('爬取网页失败')

def fillUnivList(ulist,html):
    soup = bs(html,'html.parser')
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):#过滤掉非标签内容,必须引入bs4库
            tds = tr('td') #tr.find_all('td'),查找全部td标签
            ulist.append([tds[0].string,tds[1].string,tds[2].string])

def printUnivList(ulist,num):
    geshi = "{:^10}\t{:^6}\t{:^10}"
    print(geshi.format('排名','学校名称','省市'))
    for i in range(num):
        u = ulist[i]
        print(geshi.format(u[0],u[1],u[2]))

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20) #打印前20所大学
 
main()
相关文章
相关标签/搜索