Python3.x中beautifulsoup的使用注意事项

时间 2019-12-30

标签 python3.x python beautifulsoup 使用注意事项栏目 Python 繁體版

原文原文链接

beautifulsoup的官方中文文档：http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.htmlhtml

1.从urlopen中读取url，而后传入beautifulsoup，beautifulsoup默认网页编码格式是UTF-8，若是是GBK之类的会显示python

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.编码

就是说beautifulsoup看不懂这个网页，也没法解析网页。url

好比http://www.sina.com.cn/ 使用的就是gb2312（为何就不能用UTF啊，浪费我时间！！）spa

fg=urllib.request.urlopen('http://www.sina.com.cn/')
beautifulsoup(fg)

就显示上面的WARNINGcode

若是把新浪改为百度就能够正常使用，至于如何读取新浪，戳这里。htm

2.改变beautifulsoup的默认编码blog

c.BeautifulSoup(page, from_encoding='gb2312')文档