python爬虫日志（10）多进程爬取豆瓣top250

时间 2019-11-17

原文原文链接

又是一个实践，此次准备爬取豆瓣电影top250，并把数据保存到mysql中，虽然说数据量不大，对速度没有太大的要求，不过为了练习多进程爬虫，仍是用多进程的方法写了这个爬虫。python

多进程有什么用呢？在数据量少的时候，用不用多进程问题不大，但当数据量大的时候，多进程在效率提高上的做用就很是突出了。进程每多一个，速度就提高一倍，好比个人电脑是4核的，默认开4个进程（固然能够本身设置，但不宜过多），那么效率就能提高四倍。下面来看代码吧。mysql

from bs4 import BeautifulSoup
import requests, get_proxy, pymysql
from multiprocessing import Pool  #多进程须要用到的库，pool能够称为进程池


douban_urls = ['https://movie.douban.com/top250?start={}&filter='.format(i) for i in range(0, 250, 25)]
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
proxy_list = get_proxy.get_ip_list()   #爬取代理的函数能够看个人上一篇日志
db = pymysql.connect('localhost', 'root', '', 'doubantop250')  #连接数据库
cursor = db.cursor()


def get_info(douban_url):
    proxies = get_proxy.get_random_ip(proxy_list)   #设置代理，返回一个字典
    wb_data = requests.get(douban_url, headers=headers, proxies=proxies)
    soup = BeautifulSoup(wb_data.text, 'lxml')
    item_list = soup.find_all('div', 'item')
    for item in item_list:
        link = item.select('div.hd > a')[0].get('href')
        title = item.select('div.hd > a > span:nth-of-type(1)')[0].text
        info = item.select('.info .bd p:nth-of-type(1)')[0].text.split('/')
        year = info[-3].split()[-1]
        country = info[-2].strip()
        types = info[-1].strip()
        star = item.select('.rating_num')[0].text
        db.ping(reconnect=True)  #不知道为何使用多进程，总会发生InterfaceError (0, '')这个错误，因此每次执行语句以前先确保链接正常，就是Ping一下
        sql = '''INSERT INTO top250(link,title,year,country,type,star) VALUES("{}","{}",{},"{}","{}",{})'''.format(link, title, year, country, types, star)
        try:
            cursor.execute(sql)
            db.commit()   #这个函数用于同步数据库数据，若是不想由于可能引起一个错误，停止程序运行，使得前面抓的数据没有记录到数据库中就每次作修改后调用一下
        except pymysql.err.ProgrammingError:
            print(sql, '\n', douban_url)  #这里有一条数据比较特殊，因此会引起语法错误，不要紧先跳过


if __name__ == '__main__':  #这条语句是必须的，至于为何，能够不写这行，直接运行下面两行看看，ide会有提示说必须在这下面运行，因此要加上。其实这个__name__ == '__main__'就是证实程序是独自运行，而不是被其余程序调用而运行的，意思就像是我就是我本身。多进程就得在这种状态下才能够进行。
    pool = Pool() #能够用Pool(processes=)来设置进程数
    pool.map(get_info, douban_urls)  #map这个函数，能够将后面的列表里的元素，一个个做为参数放到前面的函数中运行，有个约定函数在map中做为参数能够不写括号
db.close()

成果sql

固然由于有一条数据比较特殊因此少了一条。少的那条是数据库

能够看到这里有多个时间，而且后面有括号里的多余的内容，不过不要紧，由于只有一条比较特殊，单独处理下就行了。dom