情景,5个agent 启动报大量异常,异常以下:python
2017-11-24 13:57:54,760 - 8 - ERROR: - Error while reading from socket: ('Connection closed by server.',) Traceback (most recent call last): File "./inject_agent/agent.py", line 78, in execute self._execute() File "./inject_agent/agent.py", line 87, in _execute task = self.get_task() File "./inject_agent/agent.py", line 44, in get_task body = self.db.redis_con.lpop(redis_conf["proxy_task_queue"]) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1329, in lpop return self.execute_command('LPOP', name) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command connection.send_command(*args) File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command self.send_packed_command(self.pack_command(*args)) File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command self.connect() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 493, in connect self.on_connect() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 561, in on_connect if nativestr(self.read_response()) != 'OK': File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 624, in read_response response = self._parser.read_response() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 284, in read_response response = self._buffer.readline() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 216, in readline self._read_from_socket() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 191, in _read_from_socket (e.args,))
机器以下是阿里服务器,上面部署了台redis 。 redis
情形是我5台机器去redis 取任务消费,每一个agent 同时取100个线程!bash
如今有个问题服务器
reids 为何会主动关闭 socket。并发
首先,怀疑,redis问题, 链接过多,由于同时启动500个线程去链接,可是,redis 支持几万并发读是没问题的,查询后得知,数量在2000 左右,并非问题所在。python2.7
查询同时间,服务器日志:socket
Feb 18 12:28:38 i-*** kernel: TCP: time wait bucket table overflow Feb 18 12:28:44 i-*** kernel: printk: 227 messages suppressed. Feb 18 12:28:44 i-*** kernel: TCP: time wait bucket table overflow Feb 18 12:28:52 i-*** kernel: printk: 121 messages suppressed. Feb 18 12:28:52 i-*** kernel: TCP: time wait bucket table overflow Feb 18 12:28:53 i-*** kernel: printk: 351 messages suppressed. Feb 18 12:28:53 i-*** kernel: TCP: time wait bucket table overflow Feb 18 12:28:59 i-*** kernel: printk: 319 messages suppressed.
明显,timewait 数量超过限制,一查,果真,阿里内核的限制net.ipv4.tcp_max_tw_buckets
tcp
设的是5000。而本身服务器设置的是200万。修复参数,服务恢复正常。线程
而后发现,这种状况会出现任务丢失,agent 没接受到,可是server 已经丢了(业务逻辑可肯定倒是丢了),以前在rabbitmq 的使用过程也遇到了相似问题,因此,在及其重要的业务中,须要注意使用方式,使用ack 之类。日志