【原创】使用 rabbitmq 中 heartbeat 功能可能会遇到的问题


【问题场景】
      客户端以 consumer 身份订阅到 rabbitmq server 上的 queue 上,客户端侧在 AMQP 协议的 Connection.Tune-Ok 信令中,设置 heartbeat 为 0,即要求服务器侧不启用 heartbeat 功能。服务器因为异常断电缘由中止服务,结果客户端在短期内没法感知到服务器端已经异常。

       刚刚出现这个问题时,就有测试人员和业务人员找到我这边说:通过改造的 rabbitmq-c 库可能存在重大 bug,服务器都关闭了,客户端怎么还那像什么都没发生同样继续工做着呢?听到这种疑问,我只问了两个问题就想到了答案:
  • 业务中是否是仅仅做为 consumer 运行的?
  • 服务器可否确认是由于异常断电致使中止服务?
  • 服务器和业务程序之间是否还有中间路由设备?
业务人员告诉我上述问题的答案分别是:是的、是的、没有。呵呵~~因此答案就已经肯定了,你想到了么?

【问题分析】
这个问题能够从如下两个层面进行分析:
1. TCP 协议层面
      在此层面上讲,上述问题属于典型的 TCP 协议中的“半打开”问题,典型描述以下:
若是一方已经关闭或异常终止链接而另外一方却还不知道,咱们将这样的 TCP 链接称为半打开(Half-Open)的。任何一端的主机异常均可能致使发生这种状况。只要不打算在半打开链接上传输数据,仍处于链接状态的一方就不会检测另外一方已经出现异常。
半打开链接的一个常见缘由是,当客户主机忽然掉电,而不是正常的结束客户应用程序后再关机。固然这里所谓的客户机并非仅仅表示客户端。
      在这种状况发生时,做为 TCP 链路上只接收不发送数据的一方,只能依靠 TCP 协议自己的 keepalive 机制来检查链路是否处于正常状态。而一般 keepalive 机制下,须要大约 2 个小时时间才能触发。

2. AMQP 协议层面
      在此层面上讲,客户端因为是做为 consumer 订阅到 queue 上的,因此在该 AMQP/TCP 链接上客户端不会主动发送数据到 rabbitmq server 侧。当服务器因为异常断电中止服务后,consumer 不会接收到 AMQP 协议层面的终止信令,因此没法感知对端的状况。
      一种可能的解决办法是客户端侧在接收 N 次超时后,经过发送 AMQP 协议中的 Heartbeat 信令检测服务器端是否处于正常状态。


      在场景描述中说道“客户端侧在 AMQP 协议的 Connection.Tune-Ok 信令中,设置 heartbeat 为 0”,若是是将 heartbeat 设置为 30 会如何?答案是会同时触发服务器端和客户端的 heartbeat 功能,即服务器端会在一段时间内没有数据须要发送给客户端的状况下,发送一个心跳包给客户端;或者一段时间内没有收到任何数据,则断定为心跳超时,最终会关闭tcp链接(参考这里)。而客户端侧一样会触发对发送和接收 heartbeat 计时器的维护,分别用于断定发送和接收的超时状况。

在 amqp.h 头文件中能够看到目前 rabbitmq-c 对 heartbeat 的支持状况:
* \param [in] heartbeat the number of seconds between heartbeat frame to 
 *             request of the broker. A value of 0 disables heartbeats. 
 *             Note rabbitmq-c only has partial support for hearts, as of 
 *             v0.4.0 heartbeats are only serviced during amqp_basic_publish(), 
 *             and amqp_simple_wait_frame()/amqp_simple_wait_frame_noblock()
目前 github 上的 rabbitmq-c 0.4.1 版本在 heartbeat 功能上的支持仅限上述 3 种 API。

      因此,须要解决的问题能够描述为: 客户端做为 consumer 订阅到服务器上的 queue 后,在无业务数据须要处理时,须要经过检测 Heartbeat 帧(信令)来断定服务器是否处于异常状态(换句话说,本身是否已是“半打开”的 TCP 链接)。


【解决办法】
建议的解决办法以下:
  • 客户端必须启用 heartbeat 功能(解决“半打开”问题的基础); 
  • 客户端须要支持在发送空闲时,发送 heartbeat 的功能(由于目前客户端做为 producer 是长链接到 rabbitmq server 上的); 
  • 客户端须要支持在接收空闲时,经过检测服务器端发送来的 heartbeat 帧来断定服务器端(或网络)是否处于正常状态(由于客户端做为 consumer 也是长链接到 rabbitmq server 上的,同时不会主动向 rabbitmq server 发送数据)。 

总结:
      只要客户端启用 heartbeat ,那么服务器就会在知足“必定条件”时,定时向客户端发送 heartbeat 信令,同时也会检测在空闲状态达到规定时间后是否收到 heartbeat 信令;而客户端侧做为 consumer 时,须要断定是否接收到数据(不管是常规数据仍是 heartbeat 信令),若在必定时间内没有接收到数据,则认为当前链路可能存在问题。后续能够从业务上触发 consume 关系的从新创建


      以下为使能了 heartbeat 功能后的打印输出:
做为 consumer 的状况下出现网络断开时的打印
[warn] evsignal_init: socketpair: No error
drive_machine: [conn_init]  ---  TCP 3-way handshake start! --> [172.16.81.111:5672][s:53144]
drive_machine: [conn_connecting]  ---  connection timeout 1 time on socket(53144)
drive_machine: [conn_connected]  ---  connected on socket(53144)
53144: conn_state change   connected ==> snd_protocol_header
  --> Send Protocol.Header!
53144: conn_state change   snd_protocol_header ==> rcv_connection_start_method
[53144] drive_machine: wait for Connection.Start method another 10 seconds!!
  <-- Recv Connection.Start Method frame!
53144: conn_state change   rcv_connection_start_method ==> snd_connection_start_rsp_method
  --> Send Connection.Start-Ok Method frame!
53144: conn_state change   snd_connection_start_rsp_method ==> rcv_connection_tune_method
  <-- Recv Connection.Tune Method frame!
53144: conn_state change   rcv_connection_tune_method ==> snd_connection_tune_rsp_method
  --> Send Connection.Tune-Ok Method frame!
53144: conn_state change   snd_connection_tune_rsp_method ==> snd_connection_open_method
  --> Send Connection.Open Method frame!
53144: conn_state change   snd_connection_open_method ==> rcv_connection_open_rsp_method
  <-- Recv Connection.Open-Ok Method frame!
53144: conn_state change   rcv_connection_open_rsp_method ==> snd_channel_open_method
  --> Send Channel.Open Method frame!
53144: conn_state change   snd_channel_open_method ==> rcv_channel_open_rsp_method
[53144] drive_machine: wait for Channel.Open-Ok method another 10 seconds!!
  <-- Recv Channel.Open-Ok Method frame!
53144: conn_state change   rcv_channel_open_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Queue Declaring!
53144: conn_state change   idle ==> snd_queue_declare_method
  --> Send Queue.Declare Method frame!
53144: conn_state change   snd_queue_declare_method ==> rcv_queue_declare_rsp_method
[53144] drive_machine: wait for Queue.Declare-Ok method another 10 seconds!!
  <-- Recv Queue.Declare-Ok Method frame!
53144: conn_state change   rcv_queue_declare_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Queue Binding!
53144: conn_state change   idle ==> snd_queue_bind_method
  --> Send Queue.Bind Method frame!
53144: conn_state change   snd_queue_bind_method ==> rcv_queue_bind_rsp_method
[53144] drive_machine: wait for Queue.Bind method another 10 seconds!!
  <-- Recv Queue.Bind Method frame!
need to code something!
53144: conn_state change   rcv_queue_bind_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Basic QoS!
53144: conn_state change   idle ==> snd_basic_qos_method
  --> Send Basic.Qos Method frame!
53144: conn_state change   snd_basic_qos_method ==> rcv_basic_qos_rsp_method
  <-- Recv Queue.Qos-Ok Method frame!
need to code something!
53144: conn_state change   rcv_basic_qos_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Basic Consuming!
53144: conn_state change   idle ==> snd_basic_consume_method
  --> Send Basic.Consume Method frame!
53144: conn_state change   snd_basic_consume_method ==> rcv_basic_consume_rsp_method
  <-- Recv Basic.Consume-Ok Method frame!
need to code something!
53144: conn_state change   rcv_basic_consume_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Start waiting to recv!
53144: conn_state change   idle ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
      ### Recv AMQP_FRAME_HEARTBEAT frame! ###
  <-- Recv Heartbeat frame!
53144: conn_state change   rcv_basic_deliver_method ==> snd_heartbeat
  --> Send Heartbeat frame!
53144: conn_state change   snd_heartbeat ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
      ### Recv AMQP_FRAME_HEARTBEAT frame! ###
  <-- Recv Heartbeat frame!
53144: conn_state change   rcv_basic_deliver_method ==> snd_heartbeat
  --> Send Heartbeat frame!
53144: conn_state change   snd_heartbeat ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
      ### Recv AMQP_FRAME_HEARTBEAT frame! ###
  <-- Recv Heartbeat frame!
53144: conn_state change   rcv_basic_deliver_method ==> snd_heartbeat
  --> Send Heartbeat frame!
53144: conn_state change   snd_heartbeat ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: Recv nothing for 60s!
[53144] drive_machine: Maybe network broken or rabbitmq server fucked! Plz retry consuming!
53144: conn_state change   rcv_basic_deliver_method ==> close
[53144] drive_machine: [conn_close]  ---  Connection Disconnect!
### CB: Connection Disconnect!    Msg : [Connection Disconnect]



做为 producer 的状况下出现网络断开时的打印
[warn] evsignal_init: socketpair: No error
drive_machine: [conn_init]  ---  TCP 3-way handshake start! --> [172.16.81.111:5672][s:12184]
drive_machine: [conn_connecting]  ---  connection timeout 1 time on socket(12184)
drive_machine: [conn_connected]  ---  connected on socket(12184)
12184: conn_state change   connected ==> snd_protocol_header
  --> Send Protocol.Header!
12184: conn_state change   snd_protocol_header ==> rcv_connection_start_method
  <-- Recv Connection.Start Method frame!
12184: conn_state change   rcv_connection_start_method ==> snd_connection_start_rsp_method
  --> Send Connection.Start-Ok Method frame!
12184: conn_state change   snd_connection_start_rsp_method ==> rcv_connection_tune_method
[12184] drive_machine: wait for Connection.Tune method another 10 seconds!!
  <-- Recv Connection.Tune Method frame!
12184: conn_state change   rcv_connection_tune_method ==> snd_connection_tune_rsp_method
  --> Send Connection.Tune-Ok Method frame!
12184: conn_state change   snd_connection_tune_rsp_method ==> snd_connection_open_method
  --> Send Connection.Open Method frame!
12184: conn_state change   snd_connection_open_method ==> rcv_connection_open_rsp_method
[12184] drive_machine: wait for Connection.Open-Ok method another 10 seconds!!
  <-- Recv Connection.Open-Ok Method frame!
12184: conn_state change   rcv_connection_open_rsp_method ==> snd_channel_open_method
  --> Send Channel.Open Method frame!
12184: conn_state change   snd_channel_open_method ==> rcv_channel_open_rsp_method
[12184] drive_machine: wait for Channel.Open-Ok method another 10 seconds!!
  <-- Recv Channel.Open-Ok Method frame!
12184: conn_state change   rcv_channel_open_rsp_method ==> snd_channel_confirm_select_method
  --> Send Confirm.Select Method frame!
12184: conn_state change   snd_channel_confirm_select_method ==> rcv_channel_confirm_select_rsp_method
  <-- Recv Confirm.Select-Ok Method frame!
Channel in Confirm Mode!
12184: conn_state change   rcv_channel_confirm_select_rsp_method ==> idle
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find msg to send!
12184: conn_state change   idle ==> snd_basic_publish_method
  --> Send Basic.Publish Method frame!
12184: conn_state change   snd_basic_publish_method ==> snd_basic_content_header
  --> Send Content-Header frame!
12184: conn_state change   snd_basic_content_header ==> snd_basic_content_body
  --> Send Content-Body frame!
12184: conn_state change   snd_basic_content_body ==> rcv_basic_ack_method
  <-- Recv Basic.Ack Method frame!
### CB: Publisher Confirm -- [Basic.Ack]  Delivery_Tag:[1]  multiple:[0]
12184: conn_state change   rcv_basic_ack_method ==> idle
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
12184: conn_state change   idle ==> snd_heartbeat
  --> Send Heartbeat frame!
12184: conn_state change   snd_heartbeat ==> idle
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
12184: conn_state change   idle ==> snd_heartbeat
[12184] drive_machine: Send Heartbeat failed! status = -9
12184: conn_state change   snd_heartbeat ==> close
[12184] drive_machine: [conn_close]  ---  Connection Disconnect!
### CB: Connection Disconnect!    Msg : [Connection Disconnect]
相关文章
相关标签/搜索