scrapy爬取知乎用户信息并存入mongodb

时间 2019-11-24

原文原文链接

大体的思路流程以下：
html

经过这样的爬取流程，能够不断递归爬去大量数。 mongodb

咱们以轮子哥做为选定起始人，利用Google chrome 的审查观察网页的network chrome

能够看出实际上咱们在浏览器在得到这follower 和followee数据时，是调用了一个知乎的api.而且咱们观察每一个用户的数据，能够知道url_token这个参数是很是重要的，由于它标实了每一个用户，咱们能够用过url_token来定位这个用户。 json

为了获取单个用户更加详细的信息，咱们把鼠标放在用户头像上 api

这时候network中会出现一个相对应的A-jax请求：浏览器

打开cmd，经过scrapy startproject zhihuuser 命令建立项目开始正式写代码服务器

须要先把settings.py中的ROBOTSTXT_OBEY选择为false，让爬虫畅通无阻不受协议的约束，不然有些地方是访问不了的。
进入zhihuuser的文件夹，使用scrapy genspider zhihu www.zhihu.com建立知乎的爬虫
首先咱们测试这个爬虫，直接在cmd中运行爬虫。运行过程当中出现了卡顿，而且出现了400代号的服务器错误相应。这是由于知乎服务器有user-agent识别，因而咱们修改settings.py中的DEFAULT_REQUEST_HEADERS 参数，添加上浏览器默认的user-agent
改修初始请求：

第一步建立了几个user_url follow_url的格式，而且添加上了他的include参数（即follow_query） dom

第二步是重写了start_requests，初始的用户选择轮子哥（start_usr），经过format 补全了网址，而且分别选择回调函数parse_user和parse_follows scrapy
建立储存数据的容器items

由于数据太多，我这边没有写全。
6.储存每一个用户的信息。ide

咱们在爬虫中写parse_user函数，在其中引入UserItem()对象经过item.fields属性，能够获得useritem中的fields。从而填充容器

7.对于followers和followees的Ajax请求

在Ajax请求中有两个数据块：data和paging

①对于data，咱们主要须要的信息是url_token,它是某一个用户的惟一标识，经过这个标识能够构建出这个用户的主页网址。

②对于paging

paging是判断followers和followees的Ajax请求，也就是粉丝列表是否已是最后一列了。Is_end是判断是不是最后一页，next是下一页的url，这两个数据是咱们须要的。

首先判断是否有paging，当is_end是false的时候获取到下一页的followes和folloee列表的网址，而后yield Request

8.把用户信息存到Mongodb中，即改写pipeline

Scrapy官方文档有mongodb的代码，直接复制

为了去重，须要修改一条代码：

固然还须要在settings.py中修改设置参数

最后运行代码

代码：

爬虫代码（zhihu.py）:

# -*- coding: utf-8 -*-
import json

import scrapy
from scrapy import Request, Spider

from zhihuuser.items import UserItem


class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_user = 'excited-vczh'
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    user_query = 'allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(type=best_answerer)].topics'

    follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}'
    followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    def start_requests(self):
        yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
        yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, offset=0, limit=20),
                      self.parse_follows)
        yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, offset=0, limit=20),
                      self.parse_followers)

    def parse_user(self, response):
        result = json.loads(response.text)
        item = UserItem()  # 声明一个item对象的引用
        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item
        # 获取关注列表 followees
        yield Request(
            self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
            callback=self.parse_user)
        # 获取粉丝列表 followers
        yield Request(
            self.followers_url.format(user=result.get('url_token'), include=self.followers_query, limit=20, offset=0),
            callback=self.parse_user)

    def parse_follows(self, response):
        results = json.loads(response.text)
        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              callback=self.parse_user)
        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            yield Request(next_page, callback=self.parse_follows)

    def parse_followers(self, response):
        results = json.loads(response.text)
        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              callback=self.parse_user)
        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            yield Request(next_page, callback=self.parse_followers)

pipeline代码:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


import pymongo
from scrapy import item


class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        try:
            self.db['user'].update({'url_token': item['url_token']},{'$set':item}, True)
            #第一个参数是一个去重的条件 #第二个参数是传入一个item变量， 第三个True表示若是找到就会执行更新，else执行插入
        except:
            pass
        return item

items.py：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy import Field as f


class UserItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    id = f()
    name = f()
    employments = f()
    name = f()
    url_token = f()
    follower_count = f()
    url = f()
    answer_count = f()
    headline = f()