BeautifulSoup解决中文网页乱码

时间 2019-12-10

原文原文链接

如下代码，在执行结果中的中文出现乱码。html

from bs4 import BeautifulSoup
import urllib2

request = urllib2.Request('http://www.163.com')
response = urllib2.urlopen(request)
html_doc = response.read()
soup = BeautifulSoup(html_doc)

print soup.find_all('a')

由于中文页面编码是gb2312，gbk，在BeautifulSoup构造器中传入from_encoding = "gb18030"参数可解决乱码问题。编码

注：在BeautifulSoup3中，from_encoding需修改成fromEncoding。url

from bs4 import BeautifulSoup
import urllib2

request = urllib2.Request('http://www.163.com')
response = urllib2.urlopen(request)
html_doc = response.read()
soup = BeautifulSoup(html_doc, from_encoding = "gb18030")

print soup.find_all('a')