scrapy微信爬虫使用总结

时间 2019-12-10

标签 scrapy 微信爬虫使用总结栏目 Python 繁體版

原文原文链接

scrapy+selenium+Chrome+微信公众号爬虫

概述

一、微信公众号爬虫思路：html

参考：记一次微信公众号爬虫的经历python

二、scrapy框架图git

三、scrapy经典教程github

参考：mongodb

四、其它chrome

参考：shell

爬虫工程师劝退文

实践

一、环境的安装django

selenium安装(pip install selenium)
chromedriver安装(注意与chrome版本兼容问题)
beautifulsoup4
scrapy
MongoDB、pymongo

MongoDB：微信

mongodb的安装与启动cookie

MongoDB数据的导入与导出

具体命令以下：

python链接MongoDB，需安装pip install mongoengine

启动：

sudo ./mongod --port 27017 dbpath "/software/mongodb-4.0.0/data/db" --logpath "/software/mongodb-4.0.0/log/mongodb.log" --logappend --replSet rs0

Windows下MongoDB数据导出：

mongodump --port 27017 -d wechat -o D:\MongoDB

Linux下MongoDB数据导入：

./mongorestore -h 127.0.0.1 --port 27017 -d wechat --drop /software/mongodb-4.0.0/wechat

数据导入时注意：

Do you run mongo in replica set, i.e., mongod --replSet rs0?

If yes, please remember to run in your mongo shell the command: rs.initiate()

参考：

Python3网络爬虫开发实战教程

二、cookie获取

selenium进行登陆验证，保存cookies，为scrapy作准备。

参考：selenium使用cookie实现免登陆

三、爬虫

cookies：scrapy spider初始化函数调用Chromedriver，并获取cookies
定位：spider初始化函数利用Chromedriver定位到须要抓取的页面
解析：parse函数处理Chromedriver自动定scrapy爬虫利用selenium实现用户登陆和cookie传递位的页面信息，以及下一页URL
保存：scrapy配置MongoDB保存数据

参考：

scrapy爬虫利用selenium实现用户登陆和cookie传递

zhihu-scrapy-spider

AlipayQR.py

XMQ-BackUp

四、django调用爬虫

五、django构建搜索引擎，搜索爬过的信息

参考：

Python分布式爬虫打造搜索引擎代码+教程

环境配置：

elasticsearch-rtf安装、pip install mongo-connector、pip install mongo-connector[elastic5]、pip install elastic2-doc-manager

MongoDB数据同步到elasticsearch:

mongo-connector -m localhost:27017 -t localhost:9200 -d elastic2_doc_manager

其它问题

一、selenium在新页面定位元素问题

参考：

解决Selenium弹出新页面没法定位元素问题（Unable to locate element）

Selenium Webdriver元素定位的八种经常使用方式

二、pymongo 链接MongoDB的几种方式

三、在管道中关闭爬虫

spider.crawler.engine.close_spider(spider, 'bandwidth_exceeded')