问题：python3 使用beautifulSoup时，出错UnicodeDecodeError: 'gbk' codec …….

时间 2019-11-13

标签问题 python3 python 使用 beautifulsoup 出错 unicodedecodeerror gbk codec 栏目 Python 繁體版

原文原文链接

想将html文件转为纯文本，用Python3调用beautifulSouphtml

超简单的代码一直出错，用于打开本地文件：测试

 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
from bs4 import BeautifulSoupfile = open('index.html')soup = BeautifulSoup(file,'lxml')print (soup)

出现下面的错误编码

UnicodeDecodeError : ‘gbk’ codec can’t decode byte 0xff in position 0: illegal multibyte sequencespa

beautifulSoup不是自称能够解析各类编码格式的吗？为何还会出现解析的问题？？？code

搜了不少关于beautifulSoup的都没有解决，忽然发现，若是把代码写成xml

 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
from bs4 import BeautifulSoupfile = open('index.html')str1 = file.read() # 错误出在这一行！！！soup = BeautifulSoup(str1,'lxml')print (soup)

原来如此！ 问题出在文件读取而非BeautifulSoup的解析上！！htm

好吧，查查为何文件读取有问题，直接上正解，一样四行代码utf-8

 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
from bs4 import BeautifulSoupfile = open('index.html','r',encoding='utf-16-le')soup = BeautifulSoup(file,'lxml')print (soup)

而后soup.get_text()获得标签中的文字get

其它

若是文件中存在多种编码并且报错，能够采用下面这种方式忽略，没测试–it

 
 
 
 
  
  
  
  
 
 
 
 
soup = BeautifulSoup(content.decode('utf-8','ignore'))

From WizNote