python3源码:php
import urllib.request from bs4 import BeautifulSoup response = urllib.request.urlopen("http://php.net/") html = response.read() soup=BeautifulSoup(html, "html5lib") text=soup.get_text(strip=True) print(text)
代码很简单,就是抓取http://php.net/页面文本内容,而后使用BeautifulSoup模块清除过滤掉多余的html标签。貌似第一次容许成功了,以后一直卡着再报错:html
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain result = func(*args) File "C:\Python36\lib\urllib\request.py", line 1361, in https_open context=self._context, check_hostname=self._check_hostname) File "C:\Python36\lib\urllib\request.py", line 1320, in do_open raise URLError(err) urllib.error.URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:841)>
实际上google浏览器是可以访问的。html5
此问题多是因为Web服务器上禁用了SSLv2,而比较老的python库Python 2.x尝试默认状况下与PROTOCOL_SSLv23创建链接。所以在这种状况下,须要选择请求使用的SSL版本。python
要更改HTTPS中使用的SSL版本,须要将该HTTPAdapter类子类化并将其挂载到 Session对象。例如,若是想强制使用TLSv1,则新的传输适配器将以下所示:浏览器
from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager class MyAdapter(HTTPAdapter): def init_poolmanager(self, connections, maxsize, block=False): self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize, block=block, ssl_version=ssl.PROTOCOL_TLSv1)
而后,能够将其挂载到Requests Session对象:服务器
s=requests.Session() s.mount('https://', MyAdapter()) response = urllib.request.urlopen("http://php.net/")
编写一个通用传输适配器仍是很简单,它能够从ssl构造函数中的包中获取任意SSL类型并使用它。函数
from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager class SSLAdapter(HTTPAdapter): '''An HTTPS Transport Adapter that uses an arbitrary SSL version.''' def __init__(self, ssl_version=None, **kwargs): self.ssl_version = ssl_version super(SSLAdapter, self).__init__(**kwargs) def init_poolmanager(self, connections, maxsize, block=False): self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize, block=block, ssl_version=self.ssl_version)
修改后的上述出错的代码:google
import urllib.request from bs4 import BeautifulSoup import requests from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager import ssl class MyAdapter(HTTPAdapter): def init_poolmanager(self, connections, maxsize, block=False): self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize, block=block, ssl_version=ssl.PROTOCOL_TLSv1) s=requests.Session() s.mount('https://', MyAdapter()) response = urllib.request.urlopen("http://php.net/") html = response.read() soup=BeautifulSoup(html, "html5lib") text=soup.get_text(strip=True) print(text)
能够正常抓取网页文本信息。url