Python实战:如何隐藏本身的爬虫身份

<div class="htmledit_views">html

<p>使用爬虫访问网站,须要尽量的隐藏本身的身份,以防被服务器屏蔽,在工做工程中,咱们有2种方式来实现这一目的,分别是延时访问和动态代理,接下来咱们会对这两种方式进行讲解</p> <p><span style="font-size:14px;"><strong>一、延时访问</strong></span></p> <p>见名之意,延时访问就是在访问网站时设置一个访问周期,每隔几秒钟访问一次,这样的方式更像是人为访问网站</p> <p></p><pre onclick="hljs.copyCode(event)"><code class="language-python hljs"><ol class="hljs-ln" style="width:982px"><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="1"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">import</span> time</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="2"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">import</span> urllib.request</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="3"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="4"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line">cnt = <span class="hljs-number">0</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="5"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-comment">#隐藏本身爬虫的身份的第一种策略是设置访问周期,使得程序更像是人为访问的</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="6"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">while</span> <span class="hljs-keyword">True</span>: <span class="hljs-comment">#每隔5秒钟访问一次百度网</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="7"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> url = <span class="hljs-string">"https://www.baidu.com"</span> <span class="hljs-comment">#设置url地址</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="8"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> param = {} <span class="hljs-comment">#设置参数,参数是字典</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="9"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> param = urllib.parse.urlencode(param).encode(<span class="hljs-string">'utf_8'</span>) <span class="hljs-comment">#将参数以utf-8编码方式来编码</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="10"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="11"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> req = urllib.request.Request(url, param)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="12"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> <span class="hljs-comment">#设置header的User-Agent属性,模拟该请求是由狐火浏览器发送的,也就是说欺骗服务器是人为发送的并未程序发送的</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="13"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> req.add_header(<span class="hljs-string">"User-Agent"</span>, <span class="hljs-string">"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0"</span>)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="14"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> response = urllib.request.urlopen(req) <span class="hljs-comment">#访问网络</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="15"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="16"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> html = response.read() <span class="hljs-comment">#读取响应的结果</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="17"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> result = html.decode(<span class="hljs-string">"utf-8"</span>) <span class="hljs-comment">#按照utf-8编码来进行解码</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="18"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> <span class="hljs-keyword">if</span> result != <span class="hljs-string">""</span>:</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="19"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> cnt += <span class="hljs-number">1</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="20"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> print(<span class="hljs-string">"第%s次攻击百度网"</span> %cnt)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="21"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> time.sleep(<span class="hljs-number">5</span>) <span class="hljs-comment">#程序睡眠5秒钟</span></div></div></li></ol></code><div class="hljs-button" data-title="复制"></div></pre>运行结果: <p>每隔5秒钟访问一次百度网</p> <p><img src="https://img-blog.csdn.net/20170615225313927?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcXpjNzA5MTk3MDA=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt=""><br><br></p> <p><strong><span style="font-size:14px;">二、动态代理</span></strong></p> <p>使用代理服务器来访问网站,这种方法很是霸道,能够模拟出不一样的服务器访问网站,也是最为推荐的一种方式,咱们能够在百度网上查找免费的代理服务器IP</p> <p></p><pre onclick="hljs.copyCode(event)"><code class="language-python hljs"><ol class="hljs-ln" style="width:1059px"><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="1"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">import</span> urllib.request</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="2"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">import</span> random</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="3"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="4"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line">ipList = [<span class="hljs-string">'119.6.144.73:81'</span>, <span class="hljs-string">'183.203.208.166:8118'</span>, <span class="hljs-string">'111.1.32.28:81'</span>] <span class="hljs-comment">#定义多个代理IP,代理IP能够在网上搜免费的</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="5"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line">cnt = <span class="hljs-number">0</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="6"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-comment">#隐藏本身爬虫的身份的第二种策略是使用代理,意思是模拟多个服务器访问</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="7"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"><span class="hljs-keyword">while</span> <span class="hljs-keyword">True</span>: <span class="hljs-comment">#使用代理服务器不停的访问百度网</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="8"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> proxy_support = urllib.request.ProxyHandler({<span class="hljs-string">'http'</span>:random.choice(ipList)}) <span class="hljs-comment">#定义一个代理对象,使用随机的ip</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="9"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="10"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> opener = urllib.request.build_opener(proxy_support)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="11"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> opener.add_handlers = [(<span class="hljs-string">"User-Agent"</span>, <span class="hljs-string">"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0"</span>)]</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="12"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> urllib.request.install_opener(opener)</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="13"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="14"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> response = urllib.request.urlopen(<span class="hljs-string">"https://www.baidu.com"</span>) <span class="hljs-comment">#访问网络</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="15"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> </div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="16"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> html = response.read() <span class="hljs-comment">#读取响应的结果</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="17"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> result = html.decode(<span class="hljs-string">"utf-8"</span>) <span class="hljs-comment">#按照utf-8编码来进行解码</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="18"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> <span class="hljs-keyword">if</span> result != <span class="hljs-string">""</span>:</div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="19"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> cnt += <span class="hljs-number">1</span></div></div></li><li><div class="hljs-ln-numbers"><div class="hljs-ln-line hljs-ln-n" data-line-number="20"></div></div><div class="hljs-ln-code"><div class="hljs-ln-line"> print(<span class="hljs-string">"第%s次攻击百度网"</span> %cnt)</div></div></li></ol></code><div class="hljs-button" data-title="复制"></div></pre>运行结果: <p>不停的攻击百度网</p> <p><img src="https://img-blog.csdn.net/20170615225529086?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcXpjNzA5MTk3MDA=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt=""><br><br></p> <p><br></p> <p><br></p> <p><br></p> </div>python

相关文章
相关标签/搜索