想将html文件转为纯文本,用Python3调用beautifulSouphtml
超简单的代码一直出错,用于打开本地文件:测试
from bs4 import BeautifulSoupfile = open('index.html')soup = BeautifulSoup(file,'lxml')print (soup)
出现下面的错误编码
UnicodeDecodeError : ‘gbk’ codec can’t decode byte 0xff in position 0: illegal multibyte sequencespa
beautifulSoup不是自称能够解析各类编码格式的吗?为何还会出现解析的问题???code
搜了不少关于beautifulSoup的都没有解决,忽然发现,若是把代码写成xml
from bs4 import BeautifulSoupfile = open('index.html')str1 = file.read() # 错误出在这一行!!!soup = BeautifulSoup(str1,'lxml')print (soup)
原来如此! 问题出在文件读取而非BeautifulSoup的解析上!!htm
好吧,查查为何文件读取有问题,直接上正解,一样四行代码utf-8
from bs4 import BeautifulSoupfile = open('index.html','r',encoding='utf-16-le')soup = BeautifulSoup(file,'lxml')print (soup)
而后soup.get_text()获得标签中的文字get
若是文件中存在多种编码并且报错,能够采用下面这种方式忽略,没测试–it
soup = BeautifulSoup(content.decode('utf-8','ignore'))