python学习-scrapy学习笔记

时间 2019-12-19

原文原文链接

python-scrapy学习笔记

1、你能够为你的spider指定处理数据的pipeline，不过这须要一些代码

首先咱们须要一个装饰器（decorator），这个装饰器放到pipeline文件中，类的外部，由于多个pipeline须要用到这个装饰器html

def check_spider_pipeline(process_item_method): """该注解用在pipeline上 :param process_item_method: :return: """  @functools.wraps(process_item_method) def wrapper(self, item, spider): # message template for debugging msg = "{0} pipeline step".format(self.__class__.__name__) # if class is in the spider"s pipeline, then use the # process_item method normally. if self.__class__ in spider.pipeline: logging.info(msg.format("executing")) return process_item_method(self, item, spider) # otherwise, just return the untouched item (skip this step in # the pipeline) else: logging.info(msg.format("skipping")) return item return wrapper

装饰器的做用是判断spider中有没有设置这个pipeline方法，代码的关键在于python

if self.__class__ in spider.pipeline:

基于这个判断，咱们须要在spider中设置咱们的pipeline：mysql

pipeline = set([
    pipelines.RentMySQLPipeline, ])

在spider类中添加这段代码，创建这两段代码的联系。在pipeline中使用装饰器以后，咱们就会判断spider中是否受权了该方法去操做item。sql

固然，咱们在使用以前也必须将pipelines导入到文件中。数据库

二者创建联系以后，使用以下代码：flask

@check_spider_pipeline def process_item(self, item, spider):

此时，就大功告成了。每一个pipeline方法前都使用这个装饰器，而后在spider中受权方法的使用。session

2、利用ORM，咱们能够实现快速的入门操做数据库

ORM指object relational mapping，即对象关系映射。app

首先咱们的有一些基础知识，我本身用的是mysql和SQLAlchemy。若有不熟悉请移步mysql菜鸟教程，SQLAlchemy教程。scrapy

在咱们经过spider爬取到数据以后，全部的数据都是提交给pipeline处理，pipeline须要在settings中注册ide

ITEM_PIPELINES = { 'spider.pipelines.SpiderPipeline': 300, 'spider.pipelines.SpiderDetailPipeline': 300, }

而后咱们须要在mysql中添加本身的数据库和表

mysql -u root -p
create database xxx; use xxx; create table spider(id integer not null, primary key (id));

添加好本身须要的数据以后，咱们在程序中建立一个表的映射类

from sqlalchemy import Column, String, DateTime,create_engine, Integer, Text, INT from sqlalchemy.orm import sessionmaker from sqlalchemy.ext.declarative import declarative_base import settings Base = declarative_base() class topic(Base): __tablename__ = 'topic' id = Column(Integer, primary_key=True, unique=True, autoincrement=True) topic_title = Column(String(256)) topic_author = Column(String(256)) topic_author_img = Column(String(256)) topic_class = Column(String(256)) topic_reply_num = Column(Integer) spider_time = Column(String(256)) def __init__(self, topic_title, topic_author, topic_class, topic_reply_num, spider_time, topic_author_img): # self.topic_id = topic_id self.topic_title = topic_title self.topic_author = topic_author self.topic_author_img = topic_author_img self.topic_class = topic_class self.topic_reply_num = topic_reply_num self.spider_time = spider_time DBSession = sessionmaker(bind=settings.engine)

Base做为基类，供全部的对象类继承 DBSession做为操做数据库的一个对话，经过sessionmaker建立后，能够对方便的对数据库进行操做。接下来就是进行数据的插入了，由于咱们是爬虫操做，也不须要关心删除修改这些。直接上代码

class TesterhomeSpiderPipeline(object): def __init__(self): self.session = DBSession() @check_spider_pipeline def process_item(self, item, spider): my_topic = Topic(topic_title=item['topic_title'][0].encode('unicode-escape'), topic_author=item['topic_author'][0].encode('unicode-escape'), topic_author_img=item['topic_author_img'][0].encode('unicode-escape'), topic_class=item['topic_class'][0].encode('unicode-escape'), topic_reply_num=item['topic_reply_num'][0].encode('unicode-escape'), spider_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')) try: self.session.add(my_topic) self.session.commit() except: self.session.rollback() raise finally: self.session.close() return item

经过对Topic类进行实例化，而后调用session的方法将数据插入数据库就完成了一次对数据库的操做。

自己是打算每周更新两篇博客的，也无论有没有养分，哈哈！不过周末又犯懒了，因此周一赶忙补上！

第一篇笔记是本身实际使用爬虫中遇到的问题，第二篇。。嗯。。是拿来凑数的！由于之前利用flask

开发过网站，因此SQLAlchemy用起来仍是很轻松的！

本文为本人原创，创做不易，转载请注明出处！谢谢