【虫术】资深爬虫师带你爬取代理IP

时间 2019-12-05

原文原文链接

有时候在网站看小说，会莫名跳出来一个“疑似机器恶意爬取，暂时没法访问”这样相似的网站提示，须要刷新一下或者输入一个验证码才能从新进入，这样的状况偶有发生，相信你们都有遇到过。出现这个现象的缘由就是咱们浏览的网页采起了反爬虫的措施，特别作爬虫爬取网页，在某个ip单位时间请求网页次数过多时，服务器会拒绝服务，这种状况就是因为访问频率引发的封ip，这种状况靠解封不能很好的解决，因此咱们就想到了假装本机ip去请求网页，也就是咱们今天要讲的使用代理ip。python

目前网上有许多代理ip，有免费的也有付费的，例如西刺代理，豌豆代理，快代理等等，免费的虽然不用花钱但有效的代理不多且不稳定，付费的可能会好一点，不过今天我只爬取免费的西刺代理并将检测是否可用，将可用ip存入MongoDB，方便下次取出。web

运行平台：Windows面试

Python版本：Python3.6数据库

**IDE: **Sublime Text编程

其余：Chrome浏览器数组

简述流程为：浏览器

步骤1：了解requests代理如何使用服务器

步骤2：从西刺代理网页爬取到ip和端口网络

步骤3：检测爬取到的ip是否可用app

步骤4：将爬取的可用代理存入MongoDB

步骤5：从存入可用ip的数据库里随机抽取一个ip,测试成功后返回

对于requests来讲，代理的设置比较简单，只须要传入proxies参数便可。

不过须要注意的是，这里我是在本机安装了抓包工具Fiddler，并用它在本地端口8888建立了一个HTTP代理服务（用Chrome插件SwitchyOmega），即代理服务为：127.0.0.1:8888，咱们只要设置好这个代理，就能够成功将本机ip切换成代理软件链接的服务器ip了。

import requests

proxy = '127.0.0.1:8888'
proxies = {
    'http':'http://' + proxy,
    'https':'http://' + proxy
}

try:
    response = requests.get('http://httpbin.org/get',proxies=proxies)
    print(response.text)
except requests.exceptions.ConnectionError as e:
    print('Error',e.args)
http://httpbin.org/get

这里我是用来http://httpbin.erg/get做为测试网站，咱们访问该网页能够获得请求的有关信息，其中origin字段就是客户端ip，咱们能够根据返回的结果判断代理是否成功。返回结果以下：

{
    "args":{}，
    "headers":{
        "Accept":"*/*",
        "Accept-Encoding":"gzip, deflate",
        "Connection":"close",
        "Host":"httpbin.org",
        "User-Agent":"python-requests/2.18.4"
    },
    "origin":"xx.xxx.xxx.xxx",
    "url":"http://httpbin.org/get"
}

接下来咱们便开始爬取西刺代理，首先咱们打开Chrome浏览器查看网页，并找到ip和端口元素的信息。

能够看到，西刺代理以表格存储ip地址及其相关信息，因此咱们用BeautifulSoup提取时很方便便能提取出相关信息，可是咱们须要注意的是，爬取的ip颇有可能出现重复的现象，尤为是咱们同时爬取多个代理网页又存储到同一数组中时，因此咱们可使用集合来去除重复的ip。

27   def scrawl_xici_ip(num):
 28    '''
 29    爬取代理ip地址，代理的url是西刺代理
 30    '''  
 31    ip_list = []
 32    for num_page in range(1,num):
 33        url = url_ip + str(num_page)
 34        response = requests.get(url,headers=headers)
 35        if response.status_code == 200:
 36            content = response.text
 37            soup = BeautifulSoup(content,'lxml')
 38            trs = soup.find_all('tr')
 39            for i in range(1,len(trs)):
 40                tr = trs[i]
 41                tds = tr.find_all('td')      
 42                ip_item = tds[1].text + ':' + tds[2].text
 43                # print(ip_item)
 44                ip_list.append(ip_item)
 45                ip_set = set(ip_list) # 去掉可能重复的ip
 46                ip_list = list(ip_set)
 47            time.sleep(count_time) # 等待5秒
 48    return ip_list

将要爬取页数的ip爬取好后存入数组，而后再对其中的ip逐一测试。

51def ip_test(url_for_test,ip_info):
 52    '''
 53    测试爬取到的ip，测试成功则存入MongoDB
 54    '''
 55    for ip_for_test in ip_info:
 56        # 设置代理
 57        proxies = {
 58            'http': 'http://' + ip_for_test,
 59            'https': 'http://' + ip_for_test,
 60            }
 61        print(proxies)
 62        try:
 63            response = requests.get(url_for_test,headers=headers,proxies=proxies,timeout=10)
 64            if response.status_code == 200:
 65                ip = {'ip':ip_for_test}
 66                print(response.text)
 67                print('测试经过')
 68                write_to_MongoDB(ip)    
 69        except Exception as e:
 70            print(e)
 71            continue

这里就用到了上面提到的requests设置代理的方法，咱们使用http://httpbin.org/ip做为测试网站，它能够直接返回咱们的ip地址，测试经过后再存入MomgoDB数据库。

存入MongoDB的方法在上一篇糗事百科爬取已经提过了。链接数据库而后指定数据库和集合，再将数据插入就OK了。

74def write_to_MongoDB(proxies):
 75    '''
 76    将测试经过的ip存入MongoDB
 77    '''
 78    client = pymongo.MongoClient(host='localhost',port=27017)
 79    db = client.PROXY
 80    collection = db.proxies
 81    result = collection.insert(proxies)
 82    print(result)
 83    print('存储MongoDB成功')

最后运行查看一下结果吧

若是对Python编程、网络爬虫、机器学习、数据挖掘、web开发、人工智能、面试经验交流。感兴趣能够519970686，群内会有不按期的发放免费的资料连接，这些资料都是从各个技术网站搜集、整理出来的，若是你有好的学习资料能够私聊发我，我会注明出处以后分享给你们。

稍等，运行了一段时间后，可贵看到一连三个测试经过，赶忙截图保存一下，事实上是，毕竟是免费代理，有效的仍是不多的，而且存活时间确实很短，不过，爬取的量大，仍是能找到可用的，咱们只是用做练习的话，仍是勉强够用的。如今看看数据库里存储的吧。

由于爬取的页数很少，加上有效ip也少，再加上我没怎么爬，因此如今数据库里的ip并很少，不过也算是将这些ip给存了下来。如今就来看看怎么随机取出来吧。

85
 86def get_random_ip():
 87    '''
 88    随机取出一个ip
 89    '''
 90    client = pymongo.MongoClient(host='localhost',port=27017)
 91    db = client.PROXY
 92    collection = db.proxies
 93    items = collection.find()
 94    length = items.count()
 95    ind = random.randint(0,length-1)
 96    useful_proxy = items[ind]['ip'].replace('\n','')
 97    proxy = {
 98        'http': 'http://' + useful_proxy,
 99        'https': 'http://' + useful_proxy,
100        }   
101    response = requests.get(url_for_test,headers=headers,proxies=proxy,timeout=10)
102    if response.status_code == 200:
103        return useful_proxy
104    else:
105        print('此{ip}已失效'.format(useful_proxy))
106        collection.remove(useful_proxy)
107        print('已经从MongoDB移除')
108        get_random_ip()
109

因为担忧放入数据库一段时间后ip会失效，因此取出前我从新进行了一次测试，若是成功再返回ip，不成功的话就直接将其移出数据库。

这样咱们须要使用代理的时候，就能经过数据库随时取出来了。

总的代码以下：

import random
import requests
import time
import pymongo
from bs4 import BeautifulSoup

# 爬取代理的URL地址，选择的是西刺代理
url_ip = "http://www.xicidaili.com/nt/"

# 设定等待时间
set_timeout = 5

# 爬取代理的页数，2表示爬取2页的ip地址
num = 2

# 代理的使用次数
count_time = 5

# 构造headers
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}

# 测试ip的URL
url_for_test = 'http://httpbin.org/ip'

def scrawl_xici_ip(num):
    '''
    爬取代理ip地址，代理的url是西刺代理
    '''  
    ip_list = []
    for num_page in range(1,num):
        url = url_ip + str(num_page)
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            content = response.text
            soup = BeautifulSoup(content,'lxml')
            trs = soup.find_all('tr')
            for i in range(1,len(trs)):
                tr = trs[i]
                tds = tr.find_all('td')      
                ip_item = tds[1].text + ':' + tds[2].text
                # print(ip_item)
                ip_list.append(ip_item)
                ip_set = set(ip_list) # 去掉可能重复的ip
                ip_list = list(ip_set)
            time.sleep(count_time) # 等待5秒
    return ip_list

def ip_test(url_for_test,ip_info):
    '''
    测试爬取到的ip，测试成功则存入MongoDB
    '''
    for ip_for_test in ip_info:
        # 设置代理
        proxies = {
            'http': 'http://' + ip_for_test,
            'https': 'http://' + ip_for_test,
            }
        print(proxies)
        try:
            response = requests.get(url_for_test,headers=headers,proxies=proxies,timeout=10)
            if response.status_code == 200:
                ip = {'ip':ip_for_test}
                print(response.text)
                print('测试经过')
                write_to_MongoDB(ip)    
        except Exception as e:
            print(e)
            continue

def write_to_MongoDB(proxies):
    '''
    将测试经过的ip存入MongoDB
    '''
    client = pymongo.MongoClient(host='localhost',port=27017)
    db = client.PROXY
    collection = db.proxies
    result = collection.insert(proxies)
    print(result)
    print('存储MongoDB成功')

def get_random_ip():
    '''
    随机取出一个ip
    '''
    client = pymongo.MongoClient(host='localhost',port=27017)
    db = client.PROXY
    collection = db.proxies
    items = collection.find()
    length = items.count()
    ind = random.randint(0,length-1)
    useful_proxy = items[ind]['ip'].replace('\n','')
    proxy = {
        'http': 'http://' + useful_proxy,
        'https': 'http://' + useful_proxy,
        }   
    response = requests.get(url_for_test,headers=headers,proxies=proxy,timeout=10)
    if response.status_code == 200:
        return useful_proxy
    else:
        print('此{ip}已失效'.format(useful_proxy))
        collection.remove(useful_proxy)
        print('已经从MongoDB移除')
        get_random_ip()

def main():
    ip_info = []
    ip_info = scrawl_xici_ip(2)
    sucess_proxy = ip_test(url_for_test,ip_info)
    finally_ip = get_random_ip()
    print('取出的ip为：' + finally_ip)

if __name__ == '__main__':
    main()

【给技术人一点关爱！！！】