Requests 乱码

时间 2019-11-10

标签 requests 乱码繁體版

原文原文链接

当使用Requests请求网页时，出现下面图片中的一些乱码，我就一脸蒙逼。html

程序是这样的。python

def getLinks(articleUrl):
    headers = {
        "Uset-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.1.0.14126"
        }
    wb_data = requests.get(articleUrl,headers=headers)
    bsObj = BeautifulSoup(wb_data.text,"lxml")
    return bsObj

程序的中出现的乱码图片是这样的。app

怎么解决呢?好在有google大神，让我找到了一些前辈写的博客，拿去看吧，^_^。ui

http://blog.chinaunix.net/uid-13869856-id-5747417.htmlgoogle

http://blog.csdn.net/a491057947/article/details/47292923#t1编码

还有官网连接。两个地方都有讲到。(偷偷告诉你有chinese版本的，本身去找吧)spa

http://docs.python-requests.org/en/latest/user/quickstart/#response-content.net

http://docs.python-requests.org/en/master/user/advanced/#complianceunix

英文很差，咱们来看看中文版的说的是什么，见下图。code

好了，资料看完了，总结一下吧。

解决思路：

1.见到有乱码，不用怕，首先咱们来看看编码方式是什么？怎么看？把编码方式打印出来看看。

def getLinks(articleUrl):
    headers = {
        "Uset-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.1.0.14126"
        }
    wb_data = requests.get(articleUrl,headers=headers)
    bsObj = BeautifulSoup(wb_data.text,"lxml")
    hrefs = bsObj.find("div",{"class":"booklist clearfix"})
    print(wb_data.headers['content-type'])     print(wb_data.encoding) # response的内容编码
    print(wb_data.apparent_encoding) #response headers 里设置的编码
    print(requests.utils.get_encodings_from_content(wb_data.text)) #response返回的html header标签里设置的编码
    return bsObj

返回的是这些个鬼东西。

text/html
ISO-8859-1  # response的内容编码
UTF-8-SIG   #response headers 里设置的编码
['utf-8']   #response返回的html header标签里设置的编码

这下知道为啥乱码了，原来是response的内容编码和response headers 里设置的编码不同啊。

2.怎么办呢？不同，那咱们就改为同样的。改变response的内容编码格式。

有两种方法：

(1)使用.encoding属性改变response的内容编码,在代码里加上下面一行代码。

wb_data.encoding = 'utf-8' #手动指定编码方式

def getLinks(articleUrl):
    headers = {
        "Uset-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.1.0.14126"
        }
    wb_data = requests.get(articleUrl,headers=headers)
    wb_data.encoding = 'utf-8' #手动指定编码方式
    bsObj = BeautifulSoup(wb_data.text,"lxml")
    return bsObj

(2)使用原始的Response.content

bsObj = BeautifulSoup(wb_data.text,"lxml")
#将wb_data.text改成wb_data.content
bsObj = BeautifulSoup(wb_data.content,"lxml")

3.从前面连接里就能够看到，一位前辈写出了下面代码。解决这类问题，一劳永逸的方法。
我给应用到个人代码里，看看可行不？^_^。

原理是这样的，当response内容的编码是'ISO-8859-1',首先查找返回的Html的header标签里设置的编码；若是此编码不存在，查看response header设置的编码

def getLinks(articleUrl):
    headers = {
        "Uset-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.1.0.14126"
        }
    wb_data = requests.get(articleUrl,headers=headers)

    if wb_data.encoding == 'ISO-8859-1':
        encodings = requests.utils.get_encodings_from_content(wb_data.text)
        if encodings:
            encoding = encodings[0]
        else:
            encoding = wb_data.apparent_encoding
    encode_content = wb_data.content.decode(encoding,'replace').encode('utf-8','replace')
    
    bsObj = BeautifulSoup(encode_content,"lxml")    
    return bsObj

好了，这下就能解决这个问题了。哎，这个小鬼挺能折腾的。