urllib.request 踩坑

Python BUG记录-且踩且珍惜，争取不在同一个地方摔倒两次

1、背景

在项目开发过程中，有一个需求需要获得对应标签的图片信息，就需要从图片服务器上查询，之前使用的是如下方法查询：

import json
import urllib

url = 'http://127.0.0.1:8080/images/query/?type=%s&tags=%s'%('yuv', '4,3,6')

print("url: " + str(url))
response = urllib.request.urlopen(url)

download_list = json.loads(response.read())

print(download_list)

之前数据量小的时候并没有出现什么问题，但是当数据量大的时候，比如此次为192708Byte时，就出现了了如下错误：

Traceback (most recent call last): File "/Users/min/Desktop/workspace/python/Demo/fuck.py", line 12, in <module> download_list = json.loads(response.read()) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 464, in read s = self._safe_read(self.length) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 618, in _safe_read raise IncompleteRead(b''.join(s), amt) http.client.IncompleteRead: IncompleteRead(144192 bytes read, 48516 more expected)

2、分析结论

深入下去之后，看到此情景下read()方法最终会走到如下方法块中：

def _safe_read(self, amt):
     """Read the number of bytes requested, compensating for partial reads.   Normally, we have a blocking socket, but a read() can be interrupted by a signal (resulting in a partial read).   Note that we cannot distinguish between EOF and an interrupt when zero bytes have been read. IncompleteRead() will be raised in this situation.   This function should be used when <amt> bytes "should" be present for reading. If the bytes are truly not available (due to EOF), then the IncompleteRead exception can be used to detect the problem. """  s = []
     while amt > 0:
         print("1", amt)
         chunk = self.fp.read(min(amt, MAXAMOUNT))
         print(chunk)
         if not chunk:
             raise IncompleteRead(b''.join(s), amt)
         s.append(chunk)
         print('2',len(chunk))
         amt -= len(chunk)
         print('3',amt)
     return b"".join(s)

可以看到，其实这个方法本身就是有缺陷的，即：we cannot distinguish between EOF and an interrupt when zero bytes have been read. 最终发现输出的DEBUG信息如下：

1 192708

b'[{"title": "\\u5ba4\\u5185\\u767d\\u8272\\u80cc\\u666f\\u5899+\\u6b63\\u5e38\\u5149+\\u8fd1\\u8ddd(\\u5927\\u8138)+\\u65e0\\u9762\\u90e8\\u7a7f\\u623..... #此处省略部分

2 144192
3 48516
1 48516
b''

问题定位，所以建议大数据的传输，尽可能的避免使用urllib库，使用requests替代。

另外貌似urllib.request获取的文件头信息比requests获取的头文件信息粗糙很多，比如缺少最关键的Transfer-Encoding信息，具体细节如下：

 ****urllib.request：****
 
Server: nginx/1.14.0 (Ubuntu)
 Date: Mon, 17 Sep 2018 10:02:51 GMT
 
 Content-Type: text/html; charset=utf-8
 
 Content-Length: 192708
 
 Connection: close

 X-Frame-Options: SAMEORIGIN`**</pre>
 
 ****request：****
 
{'Server': 'nginx/1.14.0 (Ubuntu)', 'Date': 'Mon, 17 Sep 2018 09:55:15 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip'}