关于限流实现的思考

时间 2019-11-09

标签关于限流实现思考栏目兴趣爱好繁體版

原文原文链接

在基于 Spring Cloud 实现的微服务架构下，须要在网关处新增限流功能：好比对指定 ip 地址访问具体接口时限制访问频率为 100次/s。html

总的原则是：在知足需求的基础上，实现简单、易于维护。nginx

整个平台的基础架构以下：redis

nginx -> [gateway1, gateway2, …] -> [serviceA1, serviceA2, serviceB1, …]算法

1. 基于内存的单机限流

A：首先考虑基于内存的单机限流，其优势主要是实现简单，性能好；spring

Q：然而为了提升系统的可用性和性能，我须要部署多个网关实例，多个实例之间没法共享内存；bash

A：假设制定了一个限流策略为：对接口 A 限制访问频率为 100次/s，在部署 2 个网关而且 nginx 上设置了负载均衡的状况下，每一个网关上限制访问频率为每秒 50 次，也能基本知足需求。服务器

Q：但若是我如今须要再新增一个网关实例，或者已部署的 2 个网关实例挂了一个，就没法知足原先制定的限流策略了。网络

A：在这种状况下，须要有一种机制能够感知到全部的网关服务是否正常。既然是基于 Spring Cloud 平台，确定会有一个服务的注册中心。以 consul 为例，能够把限流策略保存到 consul 的 key/value 存储上。按照某个频次（好比每 30s）调用一次注册中心的接口，网关能够感知到目前状态正常的全部网关实例的数量（假设为 n），动态调整本身的限流策略为每秒 100/n 次便可。架构

Q：在网关实例新增或者异常挂掉的状况下，以上实现会有一小段时间（好比 30s）限流策略不许确。不过考虑到这种异常状况比较少出现，而且这个时间能够设置的更短，若是要求不那么严格的话倒不是个问题。并发

Q：还有一个问题是这种实现是依赖于请求在各个网关上的分配比例的。好比 nginx 上配置转发请求时，网关 1 的权重为 3，网关 2 的权重为 1，网关 3 的权重为 1，那么相应的，网关 1 的策略须要设置为每秒限制最多访问 60 次，网关 2 和网关 3 为每秒 20 次。即网关的限流策略和 nginx 的配置也有绑定了，这种设计不合理。另外若是此时网关 3 异常挂掉，网关 1 和 2 如何调整各自的限流策略，也会变得比较复杂。

2. 分布式限流（限流功能做为单独的 RPC 服务）

A：把限流功能封装成一个单独的 RPC 服务。当网关接收到请求以后，先经过限流服务提供的接口查询，根据返回结果决定放行仍是拒绝。

Q：这种实现方式，首先须要部署一个限流服务，增长了运维成本；另外，每一个请求会多一次网络开销（网关访问限流服务），因此性能瓶颈极可能会出如今网关与限流服务之间的 RPC 通讯上。若是限流功能提供的是普通的 http 接口，估计性能会不理想；若是提供的是二进制协议的接口（好比 thrift），那么网关会有一些代码改写工做（毕竟是基于 Spring Cloud 和 WebFlux 开发的）。

总的来讲，这是一种值得尝试的实现。阿里巴巴开源限流系统 Sentinel 同时实现了分布式限流和基于内存的限流，感受是个不错的选择。（看了下大概介绍，没有深刻研究）

3.基于 redis 的分布式限流

A：利用 redis 的单线程特性以及 lua 脚本，实现分布式限流。多个网关的请求访问 redis 时，在 redis 内部仍是顺序执行，不存在并发的问题；单个请求会涉及到屡次 redis 操做，以令牌桶算法为例：获取当前令牌数量，获取上次获取令牌的时间，更新时间以及令牌数量等，能够经过 lua 脚本保证原子性，同时也减小了网关屡次访问 redis 的网络开销。

这里的关键在于 lua 脚本，Spring Cloud.Greenwich 版本中 spring-cloud-gateway 有个限流过滤器，其 lua 脚本以下：

local tokens_key = KEYS[1]
local timestamp_key = KEYS[2]

local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local fill_time = capacity/rate
local ttl = math.floor(fill_time*10)

-- 当前令牌的数量
local last_tokens = tonumber(redis.call("get", tokens_key))
if last_tokens == nil then
  last_tokens = capacity
end

-- 上次取令牌的时间
local last_refreshed = tonumber(redis.call("get", timestamp_key))
if last_refreshed == nil then
  last_refreshed = 0
end

local delta = math.max(0, now-last_refreshed)
-- 新增令牌 delta*rate，更新令牌数量
local filled_tokens = math.min(capacity, last_tokens+(delta*rate))
local allowed = filled_tokens >= requested
local new_tokens = filled_tokens
local allowed_num = 0
if allowed then
  new_tokens = filled_tokens - requested
  allowed_num = 1
end

-- 更新 redis 中令牌数量和时间
redis.call("setex", tokens_key, ttl, new_tokens)
redis.call("setex", timestamp_key, ttl, now)

return { allowed_num, new_tokens }

复制代码

Q：在实际测试中，若是只启用 1 个网关实例时没有问题；若是启用多个网关实例，发现实际限流不许，最终定位到缘由为：启用网关的多台服务器时间不一样步。

A：在令牌桶中按照特定速率添加令牌时，公式为：速率*(当前时间-上次添加令牌的时间)，而当前时间这个值是由网关传过去的，若是多台网关所在的服务器时间不许，那么这个脚本的逻辑就不对了。一种方法是永远确保时间同步，而这几乎是不可能作到的；另一种方法是采用 redis 服务器的时间，即把第 6 行代码 local now = tonumber(ARGV[3])修改成：local now = redis.call("time")[1]。

注意:

在 Redis 设计与实现：Lua 脚本中提到：在 lua 脚本中，不该该设置随机值。如下为相关内容：

当将 Lua 脚本复制到附属节点，或者将 Lua 脚本写入 AOF 文件时， Redis 须要解决这样一个问题：若是一段 Lua 脚本带有随机性质或反作用，那么当这段脚本在附属节点运行时，或者从 AOF 文件载入从新运行时，它获得的结果可能和以前运行的结果彻底不一样。

考虑如下一段代码，其中的 get_random_number() 带有随机性质，咱们在服务器 SERVER 中执行这段代码，并将随机数的结果保存到键 number 上：
# 虚构例子，不会真的出如今脚本环境中
redis> EVAL "return redis.call('set', KEYS[1], get_random_number())" 1 number
OK
redis> GET number
"10086"
复制代码
如今，假如 EVAL 的代码被复制到了附属节点 SLAVE ，由于 get_random_number() 的随机性质，它有很大可能会生成一个和 10086 彻底不一样的值，好比 65535 ：
# 虚构例子，不会真的出如今脚本环境中
redis> EVAL "return redis.call('set', KEYS[1], get_random_number())" 1 number
OK
redis> GET number
"65535"
复制代码
能够看到，带有随机性的写入脚本产生了一个严重的问题：它破坏了服务器和附属节点数据之间的一致性。

当从 AOF 文件中载入带有随机性质的写入脚本时，也会发生一样的问题。

只有在带有随机性的脚本进行写入时，随机性才是有害的。

若是一个脚本只是执行只读操做，那么随机性是无害的。好比说，若是脚本只是单纯地执行 RANDOMKEY 命令，那么它是无害的；但若是在执行 RANDOMKEY 以后，基于 RANDOMKEY 的结果进行写入操做，那么这个脚本就是有害的。

和随机性质相似，若是一个脚本的执行对任何反作用产生了依赖，那么这个脚本每次执行所产生的结果均可能会不同。

为了解决这个问题， Redis 对 Lua 环境所能执行的脚本作了一个严格的限制 —— 全部脚本都必须是无反作用的纯函数（pure function）。

为此，Redis 对 Lua 环境作了一些列相应的措施：

不提供访问系统状态状态的库（好比系统时间库）。

禁止使用 loadfile 函数。

若是脚本在执行带有随机性质的命令（好比 RANDOMKEY ），或者带有反作用的命令（好比 TIME ）以后，试图执行一个写入命令（好比 SET ），那么 Redis 将阻止这个脚本继续运行，并返回一个错误。

若是脚本执行了带有随机性质的读命令（好比 SMEMBERS ），那么在脚本的输出返回给 Redis 以前，会先被执行一个自动的字典序排序，从而确保输出结果是有序的。

用 Redis 本身定义的随机生成函数，替换 Lua 环境中 math 表原有的 math.random 函数和 math.randomseed 函数，新的函数具备这样的性质：每次执行 Lua 脚本时，除非显式地调用 math.randomseed ，不然 math.random 生成的伪随机数序列老是相同的。

通过这一系列的调整以后， Redis 能够保证被执行的脚本：

无反作用。

没有有害的随机性。

对于一样的输入参数和数据集，老是产生相同的写入命令。

而后，我实际测试了下却发现并无报错？！

10.201.0.30:6379> eval "local now = redis.call('time')[1]; return redis.call('set', 'time-test', now)" 0
OK
10.201.0.30:6379> get time-test
"1552628054"
复制代码

因而查看官方文档：

redis.io/commands/ev…，

Note: starting with Redis 5, the replication method described in this section (scripts effects replication) is the default and does not need to be explicitly enabled.

Starting with Redis 3.2, it is possible to select an alternative replication method. Instead of replication whole scripts, we can just replicate single write commands generated by the script. We call this script effects replication.

In this replication mode, while Lua scripts are executed, Redis collects all the commands executed by the Lua scripting engine that actually modify the dataset. When the script execution finishes, the sequence of commands that the script generated are wrapped into a MULTI / EXEC transaction and are sent to replicas and AOF.

This is useful in several ways depending on the use case:

When the script is slow to compute, but the effects can be summarized by a few write commands, it is a shame to re-compute the script on the replicas or when reloading the AOF. In this case to replicate just the effect of the script is much better.

When script effects replication is enabled, the controls about non deterministic functions are disabled. You can, for example, use the TIMEor SRANDMEMBER commands inside your scripts freely at any place.

The Lua PRNG in this mode is seeded randomly at every call.

In order to enable script effects replication, you need to issue the following Lua command before any write operated by the script:
redis.replicate_commands()
复制代码
The function returns true if the script effects replication was enabled, otherwise if the function was called after the script already called some write command, it returns false, and normal whole script replication is used.

简单的说就是：从 Redis 3.2 开始，在 redis 主从复制中或者写入 AOF 文件时，新增了一个基于效果的复制方式。咱们能够只复制脚本生成的单个写入命令，而不是复制整个脚本，这样的话，也就意味着在 lua 脚本中能够设置随机值了，好比系统时间。Redis 5 版本以上，默认采用的就是这种复制方式。