redis close by server 的缘由分析

情景,5个agent 启动报大量异常,异常以下:python

2017-11-24 13:57:54,760 - 8 - ERROR: - Error while reading from socket: ('Connection closed by server.',)
Traceback (most recent call last):
  File "./inject_agent/agent.py", line 78, in execute
    self._execute()
  File "./inject_agent/agent.py", line 87, in _execute
    task = self.get_task()
  File "./inject_agent/agent.py", line 44, in get_task
    body = self.db.redis_con.lpop(redis_conf["proxy_task_queue"])
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1329, in lpop
    return self.execute_command('LPOP', name)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 493, in connect
    self.on_connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 561, in on_connect
    if nativestr(self.read_response()) != 'OK':
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 624, in read_response
    response = self._parser.read_response()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 284, in read_response
    response = self._buffer.readline()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 216, in readline
    self._read_from_socket()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 191, in _read_from_socket
    (e.args,))

机器以下是阿里服务器,上面部署了台redis 。 redis

情形是我5台机器去redis 取任务消费,每一个agent 同时取100个线程!bash

    如今有个问题服务器

reids 为何会主动关闭 socket。并发

    首先,怀疑,redis问题, 链接过多,由于同时启动500个线程去链接,可是,redis 支持几万并发读是没问题的,查询后得知,数量在2000 左右,并非问题所在。python2.7

查询同时间,服务器日志:socket

Feb 18 12:28:38 i-*** kernel: TCP: time wait bucket table overflow
Feb 18 12:28:44 i-*** kernel: printk: 227 messages suppressed.
Feb 18 12:28:44 i-*** kernel: TCP: time wait bucket table overflow
Feb 18 12:28:52 i-*** kernel: printk: 121 messages suppressed.
Feb 18 12:28:52 i-*** kernel: TCP: time wait bucket table overflow
Feb 18 12:28:53 i-*** kernel: printk: 351 messages suppressed.
Feb 18 12:28:53 i-*** kernel: TCP: time wait bucket table overflow
Feb 18 12:28:59 i-*** kernel: printk: 319 messages suppressed.

    明显,timewait 数量超过限制,一查,果真,阿里内核的限制net.ipv4.tcp_max_tw_bucketstcp

设的是5000。而本身服务器设置的是200万。修复参数,服务恢复正常。线程

    而后发现,这种状况会出现任务丢失,agent 没接受到,可是server 已经丢了(业务逻辑可肯定倒是丢了),以前在rabbitmq 的使用过程也遇到了相似问题,因此,在及其重要的业务中,须要注意使用方式,使用ack 之类。日志

相关文章
相关标签/搜索