Python爬虫：使用BeautifulSoup分析网页结构注意事项

时间 2019-12-02

标签 python 爬虫使用 beautifulsoup 分析网页结构注意事项栏目 Python 繁體版

原文原文链接

开始我用BeautifulSoup分析网页时候这样作：html

#从文件读取html源文件内容
with open("html.txt", "r", encoding='utf-8') as file:
    content = file.read()
    
#替换转义字符 
map = {"&lt;" : "<",
        "&gt;" : ">",
        "&amp;" : "&",
        "&quot;" : "\"",
        "&copy;" : "©"}
for (k, v) in map.items():
    content = content.replace(k, v)

#获取网页Tag结构
soup = BeautifulSoup(str, 'lxml')

后来发现会出现奇怪的问题，原来上面的替换画蛇添足。
BeautifulSoup会将HTML的实例都转换成Unicode编码，并且在获取内容时候会自动替换为字符串。
因此上面的代码能够直接简化为：python

soup = BeautifulSoup(open("html.txt", "r", encoding='utf-8'), 'lxml')

具体例子：web

from bs4 import BeautifulSoup  
html_str = ''' <html><body> <div> &gt; 咱们的祖国是花园 &lt;） </div> </body></html> '''
soup = BeautifulSoup(html_str, 'lxml')
print(soup.div)
print(soup.div.string)

输出正常：svg

<div>
&gt; 咱们的祖国是花园 &lt;）
</div>

> 咱们的祖国是花园 <）

若是咱们先对字符串进行了替换，以下面这个程序：ui

from bs4 import BeautifulSoup  
html_str = ''' <html><body> <div> > 咱们的祖国是花园 <） </div> </body></html> '''
soup = BeautifulSoup(html_str, 'lxml')
print(soup.div)
print(soup.div.string)

输出：编码

<div>
&gt; 咱们的祖国是花园 
</div>

> 咱们的祖国是花园

发现<）这两个字符会由于BeautifulSoup的容错能力而形成丢失。spa