Redis集群方案

时间 2019-11-08

标签 redis 集群方案栏目 Redis 繁體版

原文原文链接

前段时间搞了搞Redis集群，想用作推荐系统的线上存储，说来挺有趣，这边基础架构不太完善，所以须要咱们作推荐系统的本身来搭这个存储环境，就本身折腾了折腾。公司所给机器的单机性能其实挺给力，已经能够知足目前的业务需求，想作redis集群主要有如下几点考虑：node

一、扩展性，scale-out，之后数据量变得很大以后，不至于推到重来，redis虽然能够开启虚拟内存功能，单机也能提供超过物理内存上限的容量，但频繁在内存和硬盘间swap页会大大下降其性能，有点儿违背redis的设计初衷。git

二、redis是一个单线程io复用的结构，没法有效利用服务器的多核结构，若是能在一台多核机器起多个redis进程，共同提供服务，效率会更高一些。github

三、主从，数据备份和容灾。。web

所以计划作的redis集群但愿能够实现如下功能：redis

一、data sharding，支持数据切片。算法

二、主从备份，主节点写数据，主和从都提供读请求服务，而且支持主从自动切换。数据库

三、读请求作负载均衡。服务器

四、更好地，支持节点failover，数据自动迁移。架构

下面是先后经历的一个过程：app

【第一步】尝试官方方案

确定想去查看一下redis的官方集群方案，可是很遗憾，官方对cluster的声明以下：

Unfortunately Redis Cluster is currently not production ready, however you can get more information about it reading the specification or checking the partial implementation in the unstable branch of the Redis GitHub repositoriy.

Once Redis Cluster will be available, and if a Redis Cluster complaint client is available for your language, Redis Cluster will be the de facto standard for Redis partitioning.

Redis Cluster is a mix between query routing and client side partitioning.

因为这边想作生产环境部署，unstable branch目前仍是不敢用，在官方目前的版本上作提早开发又没有资源和时间，所以就放弃了。

【第二步】初步设想的方案

舍弃了官方的方案后，就想能不能本身搭一个，当时初步的想法是：用lvs作读请求的负载均衡，在客户端代码里本身写一个一致性hash算法作数据切片，配置redis主从，而且配置keepalived作主从自动切换。这个方案应该能够施行的，但当时本身遇到一些细节方面的问题，就在stackoverflow上问了一下，问题以下：

Since the redis cluster is still a work in progress, I want to build a simplied one by myselfin the current stage. The system should support data sharding,load balance and master-slave backup. A preliminary plan is as follows:

Master-slave: use multiple master-slave pairs in different locations to enhance the data security. Matsters are responsible for the write operation, while both masters and slaves can provide the read service. Datas are sent to all the masters during one write operation. Use Keepalived between the master and the slave to detect failures and switch master-slave automatically.
Data sharding: write a consistant hash on the client side to support data sharding during write/read in case the memory is not enougth in single machine.
Load balance: use LVS to redirect the read request to the corresponding server for the load balance.

My question is how to combine the LVS and the data sharding together?

For example, because of data sharding, all keys are splited and stored in server A,B and C without overlap. Considering the slave backup and other master-slave pairs, the system will contain 1(A,B,C), 2(A,B,C) , 3(A,B,C) and so on, where each one has three servers. How to configure the LVS to support the redirection in such a situation when a read request comes? Or is there other approachs in redis to achieve the same goal?

Thanks:)

有个网友给了两个建议：

You can really close to what you need by using:

twemproxy shard data across multiple redis nodes (it also supports node ejection and connection pooling)

redis slave master/slave replication

redis sentinel to handle master failover

depending on your needs you probably need some script listening to fail overs (see sentinel docs) and clean things up when a master goes down

这位网友的两个建议挺启发的，我在看redis的官方doc的时候，对twemproxy有一些印象，但当时没有太在乎，至于后者用redis sentinel作master failover，redis sentinel也是一个redis正在开发中的模块，我不太敢用。

另外，我舍弃本身的这个初步方案还有两个缘由：

一、本身在写客户端data sharding和均衡服务的时候，发现实际须要考虑的问题比开始想的要复杂一些，若是写完，其实至关于将twemproxy的功能作了一遍，造轮子的事情仍是少干。

二、功能作得有些冗余，一次读请求要通过客户端的sharding、而后还有通过lvs再到实际的服务器，不作优化的话，会增长很多延迟。

【第三步】最终的方案，以下图所示

图中画的挺明白了，就再也不多解释了。

twemproxy是twitter开源的一个数据库代理服务，能够用于memcached和redis的sharding，兼容两者的标准接口，可是对于redis的keys，dbsize等命令不支持，这个其实想一下也就明白了，这种pool内跨机作统计的命令proxy通常不会支持的。另外，twemproxy在自身与后台redis之间使用pipeline发送命令，所以性能损失比较小。可是，twemproxy对于每个客户端链接开启的mbuf有限，最大能够设置为64k，若是在客户端代理层与twemproxy之间也使用pipeline，这个pipeline不能太深，并且不支持pipeline的原子性（transaction），其实，这个时候，至关于客户端链接与redis数据库之间存在两层pipeline，分别是客户端到twemproxy的pipeline，和twemproy到后台redis服务器的pipeline，因为两者buffer深度不一致，所以不支持pipeline的transaction也就好理解了。。在引入了twemproxy，插入大规模数据的时候，有时候确实挺耗时，并且pipeline不保证原子性，丢数据时的恢复问题在客户端须要进行额外关注。对于非transaction的pipeline总丢数据，或者对于数据量比较大的key一次性取数据失败等问题，后来经查是twemproxy端timeou值设置太小，按照官方示例设置400ms，会在一次性操做大数据量的时候返回timeout失败，这个数值须要慎重根据业务（具体的，就是客户端单次命令操做的数据量）进行设置，通常2000ms差很少就够用了（能够支持一次操做接近百万的数据）。

上面的结构，将读操做的负载均衡放到了客户端代码来作，写操做控制也在客户端层的代码里，另外，对于twemproy单点、主从之间能够引入keepalived来消除单点和故障恢复。