光棍节专题：python程序员如何爬取知乎用户找女友

时间 2019-11-16

标签光棍节专题 python 程序员如何用户栏目 Python 繁體版

原文原文链接

前言：本文主要讲scrapy框架的原理和使用，建议至少在理解掌握python爬虫原理后再使用框架(不要问我为何，我哭给你看)。html

双十一立刻就要来了，在举国一片“买买买”的呼声中，单身汪的咆哮声也愈发凄厉了。做为一个Python程序员，要如何找到小姐姐，避开暴击伤害，在智中取胜呢？因而就有了如下的对话：python

so~今天咱们的目标是，爬社区的小姐姐~并且，咱们又要用到新的姿式(雾)了~scrapy爬虫框架~mysql

1.scrapy原理

在写过几个爬虫程序以后，咱们就知道，利用爬虫获取数据大概的步骤：请求网页，获取网页，匹配信息，下载数据，数据清洗，存入数据库。git

scrapy是一个颇有名的爬虫框架，能够很方便的进行网页信息爬取。那么scrapy究竟是如何工做的呢？以前在网上看了很多scrapy入门的教程，大多数入门教程都配有这张图。程序员

_(:зゝ∠)_也不知道是这张图实在太经典了，仍是程序员们都懒得画图，第一次看到这个图的时候，米酱的心情是这样的github

通过了一番深刻的理解，大概知道这幅图的意思，让我来举个栗子(是的，我又要举奇怪的栗子了)：sql

当咱们想吃东西的时候，咱们会出门，走到街上，寻找一家想吃的点，而后点餐，服务员再通知厨房去作，最后菜到餐桌上，或者被打包带走。这就是爬虫程序在作的事，它要将全部获取数据须要进行的操做，都写好。chrome

而scrapy就像一个点餐app通常的存在，在订餐列表(spiders)选取本身目标餐厅里想吃的菜(items)，在收货(pipeline)处写上本身的收货地址(存储方式)，点餐系统(scrapy engine)会根据订餐状况要求商铺(Internet)的厨房(download)将菜作好，因为会产生多个外卖取货订单(request)，系统会根据派单(schedule)分配外卖小哥从厨房取货(request)和送货(response)。说着说着我都饿了。。。。数据库

什么意思呢？在使用scrapy时，咱们只须要设置spiders(想要爬取的内容)，pipeline(数据的清洗，数据的存储方式)，还有一个middlewares，是各功能间对接时的一些设置，就能够不用操心其余的过程，一切交给scrapy模块来完成。json

2.建立scrapy工程

安装scrapy以后，建立一个新项目

$ scrapy startproject zhihuxjj
复制代码

我用的是pycharm编译器，在spiders文件下建立zhihuxjj.py

在zhihuxjj.py这个文件中，咱们要编写咱们的爬取规则。

3.爬取规则制定(spider)

建立好了项目，让咱们来看一下咱们要吃的店和菜…哦不，要爬的网站和数据。

我选用了知乎做为爬取平台，知乎是没有用户从1到n的序列id的，每一个人能够设置本身的我的主页id，且为惟一。因此采选了选取一枚种子用户，爬取他的关注者，也能够关注者和粉丝一块儿爬，考虑到粉丝中有些三无用户，我仅选择了爬取关注者列表，再经过关注者主页爬取关注者的关注者，如此递归。

对于程序的设计，是这样的。

start url是scrapy中的一个标志性的值，它用于设置爬虫程序的开始，也就是从哪里开始爬，按照设定，从种子用户我的主页开始爬即是正义，可是考虑到我的主页的连接会进行重复使用，因此在这里我将起始url设成了知乎主页。

以后就是种子用户的我的主页，知乎粉丝多的大V不少，可是关注多的人就比较难发现了，这里我选择了知乎的黄继新，联合创始人，想必关注了很多优质用户(≖‿≖)✧。

分析一下我的主页可知，我的主页由'www.zhihu.com/people/' + 用户id 组成，咱们要获取的信息是用callback回调函数(敲黑板！！划重点！！)的方式设计，这里一共设计了俩个回调函数：用户的关注列表和关注者的我的信息。

使用chrome浏览器查看上图的页面可知获取关注列表的url，以及关注者的用户id。

将鼠标放在用户名上。

能够得到我的用户信息的url。分析url可知：

关注者列表连接构成：'https://www.zhihu.com/api/v4/members/' + '用户id' + '/followees?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=0&limit=20'
我的信息连接构成：'https://www.zhihu.com/api/v4/members/' + '用户id' + '?include=allow_message%2Cis_followed%2Cis_following%2Cis_org%2Cis_blocking%2Cemployments%2Canswer_count%2Cfollower_count%2Carticles_count%2Cgender%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'
复制代码

so，咱们在上一节中建立的zhihuxjj.py文件中写入如下代码。

import json
from zhihuxjj.items import ZhihuxjjItem
from scrapy import Spider,Request

class ZhihuxjjSpider(Spider):
    name='zhihuxjj' #scrapy用于区别其余spider的名字，具备惟一性。
    allowed_domains = ["www.zhihu.com"] #爬取范围
    start_urls = ["https://www.zhihu.com/"]
    start_user = "jixin"
    followees_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset={offset}&limit=20' #关注列表网址
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include=locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,avatar_hue,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics' #我的信息连接
    def start_requests(self):
        yield Request(self.followees_url.format(user=self.start_user,offset=0),callback=self.parse_fo) #回调种子用户的关注列表
        yield Request(self.user_url.format(user=self.start_user,include = self.user_include),callback=self.parse_user) #回调种子用户的我的信息

    def parse_user(self, response):
        result = json.loads(response.text)
        print(result)
        item = ZhihuxjjItem()
        item['user_name'] = result['name']
        item['sex'] = result['gender']  # gender为1是男，0是女，-1是未设置
        item['user_sign'] = result['headline']
        item['user_avatar'] = result['avatar_url_template'].format(size='xl')
        item['user_url'] = 'https://www.zhihu.com/people/' + result['url_token']
        if len(result['locations']):
            item['user_add'] = result['locations'][0]['name']
        else:
            item['user_add'] = ''
        yield item

    def parse_fo(self, response):
        results = json.loads(response.text)
        for result in results['data']:
            yield Request(self.user_url.format(user=result['url_token'], include=self.user_include),callback=self.parse_user)
            yield Request(self.followees_url.format(user=result['url_token'], offset=0),callback=self.parse_fo)  # 对关注者的关注者进行遍历，爬取深度depth+=1
        if results['paging']['is_end'] is False: #关注列表页是否为尾页
            next_url = results['paging']['next'].replace('http','https')
            yield Request(next_url,callback=self.parse_fo)
        else:
            pass
复制代码

这里须要划重点的是yield的用法，以及item['name']，将爬取结果赋值给item，就是告诉系统，这是咱们要选的菜…啊呸…要爬的目标数据。

4.设置其余信息

在items.py文件中，按照spider中设置的目标数据item，添加对应的代码。

import scrapy
class ZhihuxjjItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    user_name = scrapy.Field()
    sex  = scrapy.Field()
    user_sign = scrapy.Field()
    user_url = scrapy.Field()
    user_avatar = scrapy.Field()
    user_add = scrapy.Field()
    pass
复制代码

在pipeline.py中添加存入数据库的代码(数据库咋用上一篇文章写了哦~)。

import pymysql

def dbHandle():
    conn = pymysql.connect(
        host='localhost',
        user='root',
        passwd='数据库密码',
        charset='utf8',
        use_unicode=False
    )
    return conn

class ZhihuxjjPipeline(object):
    def process_item(self, item, spider):
        dbObject = dbHandle()  # 写入数据库
        cursor = dbObject.cursor()
        sql = "insert into xiaojiejie.zhihu(user_name,sex,user_sign,user_avatar,user_url,user_add) values(%s,%s,%s,%s,%s,%s)"
        param = (item['user_name'],item['sex'],item['user_sign'],item['user_avatar'],item['user_url'],item['user_add'])
        try:
            cursor.execute(sql, param)
            dbObject.commit()
        except Exception as e:
            print(e)
            dbObject.rollback()
        return item
复制代码

由于使用了pipeline.py，因此咱们还须要再setting.py文件中，将ITEM_PIPELINE注释解除，这里起到链接两个文件的做用。

到这里，基本就都设置好了，程序基本上就能够跑了。不过由于scrapy是遵循robots.txt法则的，因此让咱们来观察一下知乎的法则https://www.zhihu.com/robots.txt

emmmmmmm，看完法则了吗，很好，而后咱们在setting.py中，将ROBOTSTXT_OBEY 改为 False。(逃

好像…还忘了点什么，对了，忘记设置headers了。通用的设置headers的方法一样是在setting.py文件中，将DEFAULT_REQUEST_HEADERS的代码注释状态取消，并设置模拟浏览器头。知乎是要模拟登陆的，若是使用游客方式登陆，就须要添加authorization，至于这个authorization是如何获取的，我，就，不，告，诉，你(逃

DEFAULT_REQUEST_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
    'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
}
复制代码

为了减小服务器压力&防止被封，解除DOWNLOAD_DELAY注释状态，这是设置下载延迟，将下载延迟设为3(robots法则里要求是10，但10实在太慢了_(:зゝ∠)_知乎的程序员小哥哥看不见这句话看不见这句话…