这里介绍几种工做中遇到过的常见反爬虫机制及应对策略。php
有些网站但愿被搜索引擎抓住,有些敏感信息网站不但愿被搜索引擎发现。web
网站内容的全部者是网站管理员,搜索引擎应该尊重全部者的意愿,为了知足以上等等,就须要提供一种网站和爬虫进行沟通的途径,给网站管理员表达本身意愿的机会。有需求就有供应,robots协议就此诞生。docker
scrapy是默认遵照robots协议的,须要咱们在settings.py文件中将代码改为数据库
ROBOTSTXT_OBEY = Flase
当访问过于频繁的时候,网站后台会去识别你的请求头,判断你是浏览器访问,仍是程序访问。api
咱们只须要伪造请求头信息,制造出浏览器访问的假象。浏览器
如下分别提供三个爬虫代码的请求头更改服务器
1 import requests 2 import re 3 4 url = 'https://list.tmall.com/search_product.htm?q=%B0%D7%BE%C6&type=p&vmarket=&spm=875.7931836%2FB.a2227oh.d100&from=mallfp..pc_1_searchbutton' 5 6 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063'} 7 8 content = requests.get(url,headers=headers)
1 from selenium import webdriver 2 from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 3 4 dcap = dict(DesiredCapabilities.PHANTOMJS) 5 dcap["phantomjs.page.settings.userAgent"] = ( 6 "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36" 7 ) 8 driver = webdriver.PhantomJS(desired_capabilities=dcap) 9 driver.get("https://httpbin.org/get?show_env=1") 10 driver.get_screenshot_as_file('01.png') 11 driver.quit()
scrapy随机更改请求头,首先在settings.py文件中添加请求头,再告诉scrapy去哪取请求头。dom
USER_AGENT_LIST=[ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] DOWNLOADER_MIDDLEWARES = { 'onenine.middleware.RandomUserAgentMiddleware': 400, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None, }
在settings.py中,咱们的路径设置应该是scrapy
yourproject.middlewares(文件名).middleware类
再在Middleware.py中定义请求头处理模块。分布式
from settings import USER_AGENT_LIST import random from scrapy import log class RandomUserAgentMiddleware(object): def process_request(self, request, spider): ua = random.choice(USER_AGENT_LIST) if ua: request.headers.setdefault('User-Agent', ua)
若是实在不想在网上找各类请求头,可使用fake-useragent伪造请求头
import requests from fake_useragent import UserAgent ua = UserAgent() headers = {'User-Agent': ua.random} url = '待爬网页的url' resp = requests.get(url, headers=headers)
咱们在页面上看的网页是js渲染事后的页面,当咱们去经过源码拿数据的时候就会发现找不到咱们相要的数据。
如下提供三种解决思路。
selenium+webdriver
使用浏览器驱动加上selenium能够彻底模拟人为操做。在解决验证码问题时有奇效。可是,通常状况下是不会建议使用selenium,主要是效率问题。虽然使用phantomjs等无头浏览器会减小内存消耗,效率比较其余两种方法仍是低了不少,且不适用作分布式开发。(最新版本的selenium和phantomjs分手了。)
有些相似于selenium,但较于selenium,scrapy_splash是基于Twisted和QT开发的轻量浏览器引擎,而且提供直接的http api。快速、轻量的特色使其容易进行分布式开发。scrapy和splash集合兼容作的特别好,不过须要在docker环境下运行。有兴趣的能够去自行了解。(目前使用遇到js渲染不彻底)
经过分析页面的api接口拿到想要数据。这是效率最快的一种方法,也是最麻烦的方法。有些网站对本身的api接口封锁特别严苛,淘宝其中一个api接口甚至会出现每拿四条数据就须要验证码识别的状况。
面对ip封锁,咱们只能使用代理ip。
通常作爬虫的只是出于学习目的的状况能够本身作一个ip池,经过抓取免费ip代理网站的ip放到数据库中本身用。可是通常免费的ip是不稳定的,存活时间短暂且不高匿。
商业的代理ip有不少,鱼龙混杂,ip切换模式也各有不一样。咱们如今使用的ip代理是能达到秒切,有效防止由于ip的缘由被ban,固然价格也可观。
如下是三种使用ip代理接口的代码
import requests # 要访问的目标页面 targetUrl = "http://test.abuyun.com/proxy.php" # 代理服务器 proxyHost = "http-dyn.abuyun.com" proxyPort = "9020" # 代理隧道验证信息 proxyUser = "H01234567890123D" proxyPass = "0123456789012345" proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, "user" : proxyUser, "pass" : proxyPass, } proxies = { "http" : proxyMeta, "https" : proxyMeta, } resp = requests.get(targetUrl, proxies=proxies)
from selenium import webdriver # 代理服务器 proxyHost = "http-dyn.abuyun.com" proxyPort = "9020" # 代理隧道验证信息 proxyUser = "H01234567890123D" proxyPass = "0123456789012345" service_args = [ "--proxy-type=http", "--proxy=%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, }, "--proxy-auth=%(user)s:%(pass)s" % { "user" : proxyUser, "pass" : proxyPass, }, ] # 要访问的目标页面 targetUrl = "http://test.abuyun.com/proxy.php" phantomjs_path = r"./phantomjs" driver = webdriver.PhantomJS(executable_path=phantomjs_path, service_args=service_args) driver.get(targetUrl) print driver.title print driver.page_source.encode("utf-8") driver.quit()
scrapy需在中间件middlewares.py中定义
import base64 # 代理服务器 proxyServer = "http://http-dyn.abuyun.com:9020" # 代理隧道验证信息 proxyUser = "H01234567890123D" proxyPass = "0123456789012345" # for Python3 #proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8") class ProxyMiddleware(object): def process_request(self, request, spider): request.meta["proxy"] = proxyServer request.headers["Proxy-Authorization"] = proxyAuth
再到settings.py文件中修改参数
DOWNLOADER_MIDDLEWARES = { # 'myproxies.middlewares.MyCustomDownloaderMiddleware': 543, 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543, 'yourproject.middlewares.ProxyMiddleware':125 }