Python urllib.request 踩坑

时间 2020-02-11

标签 python urllib.request urllib request 栏目 Python 繁體版

原文原文链接

BUG记录-且踩且珍惜，争取不在同一个地方摔倒两次html

一、背景

在项目开发过程当中，有一个需求须要得到对应标签的图片信息，就须要从图片服务器上查询，以前使用的是以下方法查询：python

import json
 import urllib
 
 url = 'http://127.0.0.1:8080/images/query/?type=%s&tags=%s'%('yuv', '4,3,6')
 
 print("url: " + str(url))
 response = urllib.request.urlopen(url)
 
 download_list = json.loads(response.read())

 print(download_list)
复制代码

以前数据量小的时候并无出现什么问题，可是当数据量大的时候，好比这次为192708Byte时，就出现了了以下错误：nginx

Traceback (most recent call last): File "/Users/min/Desktop/workspace/python/Demo/fuck.py", line 12, in <module> download_list = json.loads(response.read()) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 464, in read s = self._safe_read(self.length) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 618, in _safe_read raise IncompleteRead(b''.join(s), amt) http.client.IncompleteRead: IncompleteRead(144192 bytes read, 48516 more expected)json

二、分析结论

深刻下去以后，看到此情景下read()方法最终会走到以下方法块中：bash

def _safe_read(self, amt):
    """Read the number of bytes requested, compensating for partial reads. Normally, we have a blocking socket, but a read() can be interrupted by a signal (resulting in a partial read). Note that we cannot distinguish between EOF and an interrupt when zero bytes have been read. IncompleteRead() will be raised in this situation. This function should be used when <amt> bytes "should" be present for reading. If the bytes are truly not available (due to EOF), then the IncompleteRead exception can be used to detect the problem. """  s = []
    while amt > 0:
        print("1", amt)
        chunk = self.fp.read(min(amt, MAXAMOUNT))
        print(chunk)
        if not chunk:
            raise IncompleteRead(b''.join(s), amt)
        s.append(chunk)
        print('2',len(chunk))
        amt -= len(chunk)
        print('3',amt)
    return b"".join(s)
复制代码

能够看到，其实这个方法自己就是有缺陷的，即：we cannot distinguish between EOF and an interrupt when zero bytes have been read. 最终发现输出的DEBUG信息以下：服务器

1 192708
 
 b'[{"title": "\\u5ba4\\u5185\\u767d\\u8272\\u80cc\\u666f\\u5899+\\u6b63\\u5e38\\u5149+\\u8fd1\\u8ddd(\\u5927\\u8138)+\\u65e0\\u9762\\u90e8\\u7a7f\\u623..... #此处省略部分 2 144192 3 48516 1 48516 b'' 复制代码

问题定位，因此建议大数据的传输，尽量的避免使用urllib库，使用requests替代。app

另外貌似urllib.request获取的文件头信息比requests获取的头文件信息粗糙不少，好比缺乏最关键的Transfer-Encoding信息，具体细节以下：socket

****urllib.request：****
 
Server: nginx/1.14.0 (Ubuntu)
 Date: Mon, 17 Sep 2018 10:02:51 GMT
 
 Content-Type: text/html; charset=utf-8
 
 Content-Length: 192708
 
 Connection: close

 X-Frame-Options: SAMEORIGIN`**</pre>
 
 ****request：****
 
{'Server': 'nginx/1.14.0 (Ubuntu)', 'Date': 'Mon, 17 Sep 2018 09:55:15 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip'}
复制代码