BeautifulSoup4 入门

时间 2019-12-05

标签 beautifulsoup4 beautifulsoup 入门繁體版

原文原文链接

BeautifulSoup是Python包里最有名的HTML parser分解工具之一。简单易用

安装：

pip install beautifulsoup4

注意大小写，并且不要安装BeautifulSoup，由于BeautifulSoup表明3.0，已经中止更新。html

经常使用语法

参考我以前的文章：BeautifulSoup ：一些经常使用功能的使用和测试html5

# 建立实例
soup = BeautifulSoup(html, 'html5lib')

选择器

根据不一样的网页，选择器的使用会很不一样：shell

绝大部分下使用CSS选择器select()就足够了
若是按照标签属性名查找，而属性名中有-等特殊字符，那么就只能使用find()选择器了。

# 最佳选择器: CSS选择器（返回tag list）
results = soup.select('div[class*=hello_world] ~ div')

for tag in results:
    print(tag.string)       #print the tag's html string
    # print(tag.get_text())     #print its inner text

#单TAG精确选择器：返回单个tag. 
tag = soup.find('div', attrs={'class': 'detail-block'})
print(tag.get_text())

# 多Tag精确选择器: 返回的是text，不是tag
results = soup.find_all('div', attrs={'class': 'detail-block'})

# 多class选择器(标签含有多个Class)，重点是"class*="
results = soup.select('div[class*=hello_world] ~ div')

获取值

tag = soup.find('a')

# 只获取标签的文本内容
text = tag.get_text()

# 获取标签的所有内容(如<a href='sdfj'> asdfa</a>)
s = tag.string

# 获取标签的属性
link = tag['href']

修改值

参考：Beautiful Soup（四）--修改文档树函数

tag = soup.find('a', attrs={'class': 'detail-block'})

#修改属性
tag['href'] = 'https://google.com'

# 修改内容 <tag>..</tag>中间的内容
tag.string = 'New Content'

# 删除属性
del tag['class']

对象类型

在咱们使用选择器搜索各种tag标签时，BeautifulSoup会根据使用的函数而返回不一样类型的变量。而不一样的变量的使用方法也须要注意。工具

Tag类型（<class 'bs4.element.Tag'>）:测试
- tag.string
- tag.get_text()
可遍历字符串类型（bs4.element.NavigableString）:
Comment类型（<class 'bs4.element.Comment'>）:

增删改标签

参考：使用BeautifulSoup改变网页内容google

# 修改标签内容
tag = soup.find('title')
tag.string = 'New Title'

1. BeautifulSoup4入门
2. Python beautifulsoup4 快速入门
3. beautifulsoup4
4. BeautifulSoup4库
5. bs4(BeautifulSoup4)下载
6. Python3.7.0 安装beautifulsoup4 4.6.3
7. windows下安装beautifulsoup4
8. Python 中安装BeautifulSoup4
9. 七、安装BeautifulSoup4库
10. beautifulsoup4-4.3.2的安装
更多相关文章...
• Memcached入门教程 - NoSQL教程
• Neo4j数据库入门教程 - NoSQL教程
• YAML 入门教程
• Java Agent入门实战（一）-Instrumentation介绍与使用