基于 Fabric 部署分布式爬虫的思考

时间 2020-01-31

标签基于 fabric 部署分布式爬虫思考栏目系统架构繁體版

原文原文链接

Python： 基于 Fabric 部署分布式爬虫的思考

Fabric 自己是一款用于自动化管理，发布任务和布署应用的工具，在自动化运维中比较常见python

固然其余的链接工具一样优秀，好比 paramiko ，只是 fabric 封装的更好，文档更全，使用也更简单mysql

中文文档 https://fabric-chs.readthedoc...

Fabric 是一个用 Python 编写的命令行工具库，它能够帮助系统管理员高效地执行某些任务
一个让经过 SSH 执行 Shell 命令更加容易、更符合 Python 风格的命令库web

安装

sudo pip3 install fabric3

如何运行

最终这样均可以运行redis

fab -f REQUESTS_HTML.py host_type # host_type 是任务函数

or

python3 REQUESTS_HTML.py  # 运行整个文件

实战案例

我是一个不怎么爱说话的人(实际上是文笔通常)，因此直接贴代码恐怕是最好的分享的方式了sql

第一个 Demomongodb

from fabric.api import run,cd,env,hosts,execute
env.hosts=['root@2.2.2.2:22']
env.password='pwd2'

def host_type():
    with cd('../home'):
        run('ls')
        run('cd youboy_redis')
        run('cd Youboy && cd youboy && ls && python3 run.py')

print(execute(host_type)) # execute 执行任务

能够很清晰的看到函数执行的顺序，进入爬虫目录并运行一个爬虫脚本，就像在本地执行命令同样，这里支持 with 上下文docker

简单作一个类封装

from fabric.api import run,cd,env,hosts,execute
class H():
    def __init__(self,host,pwd):
        env.hosts=list(hosts)
        env.password=pwd

    def host_type(self):
        with cd('../home'):
            run('ls')
    def run(self):
        print(execute(self.host_type))

h = H('root@2.2.2.2:22','pwd2')
h.run()

嗯，没有问题shell

并行执行

咱们在介绍执行远程命令时曾提到过多台机器的任务默认状况下是串行执行的。
Fabric 支持并行任务，当服务器的任务之间没有依赖时，并行能够有效的加快执行速度。
怎么开启并行执行呢？数据库

在执行”fab”命令时加上”-P”参数json

$ fab -P host_type

或者
设置 ”env.parallel” 环境参数为True

from fabric.api import env
env.parallel = True

若是，咱们只想对某一任务作并行的话，咱们能够在任务函数上加上”@parallel”装饰器：

from fabric.api import parallel
 
@parallel
def runs_in_parallel():
    pass
 
def runs_serially():
    pass

这样即使并行未开启，”runs_in_parallel()”任务也会并行执行。
反过来，咱们能够在任务函数上加上”@serial”装饰器：

from fabric.api import serial
 
def runs_in_parallel():
    pass
 
@serial
def runs_serially():
    pass

这样即使并行已经开启，”runs_serially()”任务也会串行执行。

试着作一些事情

配置文件

使用 yaml 格式

[mysql]
host = 127.0.0.1
port = 3306
db = python
user = root
passwd = 123456
charset = utf8

[mongodb]
host = ip
port = 27017
db = QXB

[redis]
host = 127.0.0.1
port = 6379
db = 0

[server]
aliyun1_host = ["公网ip", "ssh密码", 22]
aliyun2_host = ["公网ip", "ssh密码", 22]

读取配置

config.py

from configparser import ConfigParser
import json
config = ConfigParser()
config.read('./conf.yml')  # ['conf.ini'] ['conf.cfg]


# 获取全部的section
# print(config.sections())  # ['mysql', redis]

conf_list = list()
for host in config.options('server'):
    str_host = config.get('server', host)
    json_host = json.loads(str_host)
    conf_list.append(json_host)

print(conf_list)

远程链接服务器以及一些经常使用操做

import warnings
warnings.filterwarnings("ignore")
import time
from fabric.api import * # run,cd,env,hosts,execute,sudo,settings,hide
from fabric.colors import *
from fabric.contrib.console import confirm
import config
import json
from fabric.tasks import Task

class HA():
    def __init__(self):
        self.host = "root@{host}:{port}"
        self.ssh = "root@{host}:{port}"
        self.env = env
        self.env.warn_only = True # 这样写比较痛快
        self.env.hosts = [
            self.host.format(host=host[0],port=host[2]) for host in config.conf_list]
        self.env.passwords = {
            self.ssh.format(host=host[0], port=host[2]):host[1] for host in config.conf_list}

 
        print(self.env["hosts"])

    # def Hide_all(self):
    #     with settings(hide('everything'), warn_only=True):  # 关闭显示
    #         result = run('ls')
    #         print(result)  # 命令执行的结果
    #         print(result.return_code)

    # def Show_all(self):
    #     with settings(show('everything'), warn_only=True):  # 显示全部
    #         result = run('docker')
    #         print(str(result.return_code))  # 返回码，0表示正确执行，1表示错误
    #         print(str(result.failed))

    # @task
    # def Prefix(self): # 前缀，它接受一个命令做为参数，表示在其内部执行的代码块，都要先执行prefix的命令参数。
    #     with cd('../home'):
    #         with prefix('echo 123'):
    #             run('echo caonima')


    # def Shell_env(self): # 设置shell脚本的环境变量　
    #     with shell_env(HTTP_PROXY='1.1.1.1'):
    #         run('echo $HTTP_PROXY')


    # def Path_env(self): # 配置远程服务器PATH环境变量，只对当前会话有效，不会影响远程服务器的其余操做，path的修改支持多种模式
    #     with path('/tmp', 'prepend'):
    #         run("echo $PATH")
    #     run("echo $PATH")


    # def Mongo(self): # 尝试链接mongodb数据库  不知道为何制定端口就不行了
    #     # with remote_tunnel(27017):
    #     run('mongo')


    # def Mysql(self):  # 尝试链接mysql数据库
    #     with remote_tunnel(3306):
    #         run('mysql -u root -p password')

    '''
    指定host时，能够同时指定用户名和端口号： username@hostname:port
    经过命令行指定要多哪些hosts执行人物：fab mytask:hosts="host1;host2"
    经过hosts装饰器指定要对哪些hosts执行当前task
    经过env.reject_unkown_hosts控制未知host的行为，默认True，相似于SSH的StrictHostKeyChecking的选项设置为no，不进行公钥确认。
    '''

    # @hosts('root@ip:22')
    # @task
    # def Get_Ip(self):
    #     run('ifconfig') 
    #     # return run("ip a")

    # @hosts("root@ip:22")
    # @runs_once
    # def Get_One_Ip(self):
    #     run('ifconfig')

    '''
    role是对服务器进行分类的手段，经过role能够定义服务器的角色，
    以便对不一样的服务器执行不一样的操做，Role逻辑上将服务器进行了分类，
    分类之后，咱们能够对某一类服务器指定一个role名便可。
    进行task任务时，对role进行控制。
    '''

    # @roles('web')  # 只对role为db的主机进行操做
    # @task
    # def Roles_Get_Ip():
    #     run('ifconfig')
        

    # def Confirm(self): # 有时候咱们在某一步执行错误，会给用户提示，是否继续执行时，confirm就很是有用了，它包含在 fabric.contrib.console中
    #     result = confirm('Continue Anyway?')
    #     print(result)

    # def run_python(self):
    #     run("python3 trigger.py")

    @task
    @parallel
    def celery_call(): # 执行celery任务
        with cd('../home'):
            warn(yellow('----->Celery'))
            puts(green('----->puts'))
            run('cd ./celery_1 && celery -A Celery worker -l info')
            time.sleep(3)
            run('python3 run_tasks.py')
    

    # @task
    # def update_file(): # 上传文件到服务器
    #     with settings(warn_only=True):
    #         local("tar -czf test.tar.gz config.py")
    #         result = put("test.tar.gz", "/home/test.tar.gz")
    #     if result.failed and not confirm("continue[y/n]?"):
    #         abort("put test.tar.gz failed")

    #     with settings(warn_only=True):
    #         local_file_md5 = local("md5sum test.tar.gz",capture=True).split(" ")[0]
    #         remote_file_md5 = run("md5sum /home/test.tar.gz").split(" ")[0]
    #     if local_file_md5 == remote_file_md5:
    #         print(green("local_file == remote_file"))
    #     else:
    #         print(red("local_file != remote"))
    #     run("mkdir /home/test")
    #     run("tar -zxf /home/test.tar.gz -C /home/scp")

    '''
    有一个地方很神奇，self和@task装饰器在类中不能共用，不然会报错
    '''

    # @task
    # def downloads_file(): # get文件到本地
    #     with settings(warn_only=True):
    #         result = get("/home/celery_1", "./")
    #     if result.failed and not confirm("continue[y/n]?"):
    #         abort("get test.tar.gz failed")
    #     local("mkdir ./test")
    #     local("tar zxf ./hh.tar.gz -C ./test")

    # @task
    # @parallel
    # def scp_docker_file():
    #     with settings(warn_only=True):
    #         local("tar -czf docker.tar.gz ../docker")
    #         result = put("docker.tar.gz", "/home/docker.tar.gz")
    #     if result.failed and not confirm("continue[y/n]?"):
    #         abort("put dockerfile failed")
    #     run("mkdir /home/docker")
    #     run("tar -zxf /home/docker.tar.gz -C /home")


    def Run(self):
        execute(self.celery_call)
    

h = HA()
h.Run()

我尽量添加一些代码注释，更多解释还请参考 fabric 文档啊

看看 docker

只要可以链接到服务器，那么在这些服务器上安装服务也就在情理之中了，好比 docker
来看具体的代码实现

import warnings
warnings.filterwarnings("ignore")
import time
from fabric.api import * # run,cd,env,hosts,execute,sudo,settings,hide
from fabric.colors import *
from fabric.contrib.console import confirm
import config
import json
from fabric.tasks import Task

class HA():
    def __init__(self):
        self.host = "root@{host}:{port}"
        self.ssh = "root@{host}:{port}"
        self.env = env
        self.env.warn_only = True # 这样写比较痛快
        self.env.hosts = [
            self.host.format(host=host[0],port=host[2]) for host in config.conf_list]
        self.env.passwords = {
            self.ssh.format(host=host[0], port=host[2]):host[1] for host in config.conf_list}


        print(self.env["hosts"])

    
    
    @task
    def get_docker_v(): # 查看docker版本
        with cd('../home'):
            run('docker version')

    @task
    def pull_images(images_name):
        with settings(warn_only=True):
            with cd("../home/"):
                try:
                    run("docker pull {}".format(images_name))
                except:
                    abort("docker pull failed")

    @task
    def push_images(images_name,username_repository,tag):
        with settings(warn_only=True):
            with cd("../home/"):
                try:
                    run("docker tag {image_name} {username_repository}:{tag}".format(images_name=images_name,username_repository=username_repository,tag=tag))
                    run("docker push {username_repository}:{tag}".format(username_repository=username_repository,tag=tag))
                except:
                    abort("docker push failed")

    @task
    def run_docker_images(images_name_tag):
        with settings(warn_only=True):
            with cd("../home/"):
                try:
                    run("docker run -p 4000:80 {}".format(images_name_tag))
                except:
                    abort("docker run failed")


    @task
    @parallel
    def execute_docker_compose():
        with settings(warn_only=True):
            with cd("../home/flask_app"):
                run("docker-compose up")


    @task
    def create_docker_service(service_name,images_name,num=4):
        with settings(warn_only=True):
            with cd("../home/"):
                run("docker service create --name {service_name} -p 4000:80 {images_name}".format(service_name=service_name,images_name=images_name))
                run("docker service scale {service_name}={num}".format(service_name=service_name,num=num))
    
    
    @task
    def stop_docker_service(service_name):
        with settings(warn_only=True):
            with cd("../home/"):
                run("docker service rm {}".format(service_name))

    def Run(self):
        # execute(self.create_docker_service,"demo","3417947630/py:hello")
        execute(self.execute_docker_compose)

h = HA()
h.Run()

嘿嘿，挺好

总结

基于python第三方库 fabric 实现远程ssh分布式调度部署应用,是一种很不错的选择，那么若是用于部署爬虫的应用呢？
若是你是使用 scrapy 框架编写的爬虫(或者是其余框架，各类脚本也是同样)，那么能够直接运行文件上传的方法把完整目录拷贝到目标服务器(固然是批量的)
而后键入爬取的命令，记住 fabric 是支持并行的，就能达到多机协做抓取的目的了

其实在 scrapy 中也可能用 scrapyd 来打包部署分布式爬虫，可是打包过程略为繁琐，而 SSH 链接则比较直接，操做简单
然鹅说到底 fabric 也只是一种自动化运维的工具，本质上也只是把代码拷贝到目标服务器和执行相应的命令而已，并无像 scrapyd 提供爬虫管理的可视化界面

因此这样看来， Fabric 至少算得上是部署分布式爬虫的一种选择，就是由于部署简单

以上就是我对这个 py 库的一些见解，它为咱们往后部署应用和服务提供了更多的选择，多多实战吧！！

欢迎转载，但要声明出处
我的博客：http://www.gzky.live