某关键业务系统上频繁出现业务失败,并发生了一次大规模业务中断。
该系统采用两台IBM Power 740运行AIX 6.1+Oracle 11gR2 RAC做为数据库服务器,两台DELL PowerEdge R720做为应用服务器,前端采用F5做为负载均衡设备。
数据库服务器和应用服务器之间有Cisco ASA5585硬件防火墙进行访问控制。
应用服务器上运行自行开发的C程序,经过Oracle Client提供的接口,以长链接的方式访问RAC数据库。
故障时,先检查了数据库服务器的系统负载,发现相对于正常时期,CPU负载偏高,IO负载明显升高,IOPS达到13000左右。
正常时的负载
异常时的负载
检查数据库相关日志,发现有大量的TNS错误:
前端
复制内容到剪贴板python
Fatal NI connect error 12170.
linux
VERSION INFORMATION: VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Mon Feb 23 13:22:16 2015
Time: 23-Feb-2015 13:22:16
*********************************************************************** Time: 23-Feb-2015 13:22:16
Time: 23-Feb-2015 13:22:16
Time: 23-Feb-2015 13:22:16
Tracing not turned on.
Tracing not turned on.
Fatal NI connect error 12170.
Tracing not turned on.
Tracing not turned on.
Tns error struct:
Tns error struct:
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 11.2.0.3.0 - Production
Tns error struct:
Tns error struct:
ns main err code: 12535
ns main err code: 12535
Time: 23-Feb-2015 13:22:16
ns main err code: 12535
ns main err code: 12535
Tracing not turned on.
Tns error struct:
TNS-12535: TNS:operation timed out
TNS-12535: TNS:operation timed out
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
ns secondary err code: 12560
nt main err code: 505
ns secondary err code: 12560
nt main err code: 505
nt main err code: 505
nt main err code: 505
TNS-12535: TNS:operation timed out
TNS-00505: Operation timed out
TNS-00505: Operation timed out
ns secondary err code: 12560
TNS-00505: Operation timed out
nt secondary err code: 78
nt secondary err code: 78
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 78
nt OS err code: 0
nt OS err code: 0
nt secondary err code: 78
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=37975))
TNS-00505: Operation timed out
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=25972))
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=9108))
nt secondary err code: 78
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=52073))
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.1.32.70)(PORT=49148))
Mon Feb 23 13:22:16 2015
复制内容到剪贴板数据库
[sysadmin@appsrv1 ~]$ netstat -an|grep 10.1.197.15
服务器
tcp 0 0 10.1.32.70:37975 10.1.197.15:1521 ESTABLISHED
tcp 0 0 10.1.32.70:25972 10.1.197.15:1521 ESTABLISHED
tcp 0 0 10.1.32.70:9108 10.1.197.15:1521 ESTABLISHED
tcp 0 0 10.1.32.70:52073 10.1.197.15:1521 ESTABLISHED
tcp 0 0 10.1.32.70:49148 10.1.197.15:1521 ESTABLISHED
.
.
.
.
复制内容到剪贴板网络
net.netfilter.nf_conntrack_tcp_timeout_established = 300
session
就是它让iptables对于已创建的链接,300秒若没有活动,那么则清除掉,默认的时间是432000秒(5天)。
问是谁设置的这个参数,答复是应用组在上线前作性能测试时,按照应用供应商的建议修改的。为啥要改,为啥改为这个值也不清楚 。
好吧,应该就是它了,改回默认的值 432000 应该就能解决了。同时让应用组的优化优化SQL,就能够写故障报告了。
故事就到这里就完了? 固然不是。 在调整了这个值后,一开始还风平浪静。过了一段时间,应用组又来报告说,又出现了不多的
业务超时,并且有愈来愈频繁的趋势,从一天一两次到一条一二十次。真是不省心啊。。
继续了解状况,此次仿佛跟数据库服务器没啥关系了,是这个应用服务器到服务器总线应用之间的链接出现了问题。服务总线应用服务器向该应用服务器发起链接的时候,偶尔会被拒绝。
考虑到应用服务器以前有F5来做为负载均衡,先检查了F5上应用服务状态有没有异常,结果良好,F5上对应用的健康探测没有异常。 好吧,仍是直接看数据流,为啥会拒绝应用链接。
服务总线应用服务器和该应用服务器之间通讯是短链接,有业务调用的时候,由服务总线方发起链接。应用组找到了被拒绝的链接,经过debug日志找到相关端口信息,继续让网络组
提取相关链接数据包。
并发
这里能够看到,在svcbus2向appsrv1发起请求后,一样有应答,可是一分钟后,svcbus2关闭了链接。再过了3分钟appsrv1处理完请求后返回数据给svcbus2的时候就被svcbus2给拒绝掉了,而后一样也是不停重传,最后超时。
应用组说svcbus2上,应用有一个60秒超时的机制,当对端在60秒内没有返回结果就直接关闭这个请求,断掉链接。
从上面的报文也能够看到,svcbus2发起了FIN报文,可是因为appsrv1没有处理完,因此这个链接关闭是单向的,直到appsrv1处理完成,重传数据超时,链接彻底关闭。
这和svcbus2新建链接到appsrv1被拒绝有啥关系呢?我一时也没有想明白。 appsrv1上的iptables对于服务端口应该是一直开放的,不该该拒绝新建的链接啊,除非这个新建的链接有什么特殊的地方。
经过红帽的客户网站,找到了一些相关信息。
https://access.redhat.com/solutions/73353 iptables randomly drops new connection requests
app
if /proc/net/ipv4/netfilter/ip_conntrack_tcp_loose is set to 1, iptables creates a new connection-tracking entry after receiving any packet, not just packets with the SYN flag set负载均衡
the ip_conntrack_tcp_loose setting is useful in firewalls for keeping already established connections unaffected if the firewall restarts, but this causes issues in the following scenario:
a client initiates a new connection with the same source port it used in a previous connection
the server already has an iptables connection-tracking entry for the same IP address and port, with states ESTABLISHED and UNREPLIED
the default rule gets hit, instead of the desired one
the packet is dropped, and the server returns icmp-host-prohibited
https://access.redhat.com/solutions/73273 Why iptables sporadically drops initial connections requests?
This behavior is caused by enabling ip_conntrack_tcp_loose sysctl parameter which enables iptables to create a conntrack entry whenever it sees any packet from any direction and not just SYN packet.
This is called connection picking and is usually used by the firewalls to ensure that the ongoing end to end sessions remain active if for some reason the firewall gets restarted. But this feature is not needed in standalone servers.
这里面提到了iptables在规则为allow的状况下也会对某些数据包drop的状况。大意就是在默认状况下(/proc/net/ipv4/netfilter/ip_conntrack_tcp_loose 为1时),iptables会对即时是不完整的TCP链接也会记录其状态,这样避免
iptables重启的时候影响已有的链接,可是会影响一些特殊状况。
咱们的应用正好就出现了这种特殊状况。当svcbus2由于等待appsrv1应答超时的时候,关闭了链接。而appsrv1上的nf_conntrack表中,当收到svcbus2发送的FIN包是,对于这个链接的记录会变成CLOSE_WAIT状态,而后在60秒后,
条目被清除。 可是,当appsrv1开始回应数据的时候,nf_conntrack表中又出现了这个链接的条目,而且状态是ESTABLISHED [UNREPLIED]。
复制内容到剪贴板
[root@appsrv1 ~]# grep 51522 /proc/net/nf_conntrack
ipv4 2 tcp 6 35 CLOSE_WAIT src=10.1.32.70 dst=10.1.41.192 sport=9102 dport=51522 src=10.1.41.192 dst=10.1.32.70 sport=51522 dport=9102 [ASSURED] mark=0 secmark=0 use=2
.
.
wait more seconds
.
.
[root@appsrv1 ~]# grep 51522 /proc/net/nf_conntrack
ipv4 2 tcp 6 431965 ESTABLISHED src=10.1.32.70 dst=10.1.41.192 sport=9102 dport=51522 [UNREPLIED] src=10.1.41.192 dst=10.1.32.70 sport=51522 dport=9102 mark=0 secmark=0 use=2
复制内容到剪贴板
1.配置服务器的iptables 容许tcp 8888端口.
2.用python来创建一个监听服务,监听在tcp 8888.
[root@server ~]# python
Python 2.6.6 (r266:84292, Sep 4 2013, 07:46:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> serversocket.bind(("0.0.0.0",8888))
>>> serversocket.listen(5)
>>> (clientsocket, address) = serversocket.accept()
3. 从客户端以tcp 22222为源端口发起链接,并发送请求数据
[root@client ~]# python
Python 2.6.6 (r266:84292, Sep 4 2013, 07:46:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.bind(("0.0.0.0",22222))
>>> s.connect(("1.1.1.101",8888))
>>> s.send("request",100)
在server端检查iptable contrack 的状态
[root@server ~]# grep 103 /proc/net/nf_conntrack
ipv4 2 tcp 6 431949 ESTABLISHED src=1.1.1.103 dst=1.1.1.101 sport=22222 dport=8888 src=1.1.1.101 dst=1.1.1.103 sport=8888 dport=22222 [ASSURED] mark=0 secmark=0 use=2
Wait some seconds, then close the connection on client.
>>> s.close()
>>> exit()
继续检查server端的iptable contrack 状态
[root@server ~]# grep 103 /proc/net/nf_conntrack
ipv4 2 tcp 6 54 CLOSE_WAIT src=1.1.1.103 dst=1.1.1.101 sport=22222 dport=8888 src=1.1.1.101 dst=1.1.1.103 sport=8888 dport=22222 [ASSURED] mark=0 secmark=0 use=2
[root@server ~]# sleep 55
[root@server ~]# grep 103 /proc/net/nf_conntrack
server端的条目消失了.
4. 当server端向client发送响应数据的时候
>>> clientsocket.recv(1000)
'request'
>>> clientsocket.send("respond",100)
再到server端查看iptable contrack 状态就会发现有 ESTABLISHED[UNREPLIED] 条目.
可是看TCP 链接的状态倒是 CLOSE_WAIT.
[root@server ~]# grep 103 /proc/net/nf_conntrack
ipv4 2 tcp 6 431996 ESTABLISHED src=1.1.1.10
1 dst=1.1.1.103 sport=8888 dport=22222 [UNREPLIED] src=1.1.1.103 dst=1.1.1.101 sport=22222 dport=8888 mark=0 secmark=0 use=2
[root@server ~]# netstat -ntp|grep 103
tcp 1 7 1.1.1.101:8888 1.1.1.103:22222 CLOSE_WAIT 28978/python
这个时候,从wireshark上抓包也发现了client拒绝server的响应数据并发送ICMP[host administratively prohibited].
当server端TCP链接CLOSE_WAIT超时后, 受到net.netfilter.nf_conntrack_tcp_timeout_established参数的控制,nf_contrack表中ESTABLISHED[UNREPLIED] 条目是依然存在的。
[root@server ~]# netstat -ntp|grep 103
[root@server ~]# grep 103 /proc/net/nf_conntrack
ipv4 2 tcp 6 431066 ESTABLISHED src=1.1.1.101 dst=1.1.1.103 sport=8888 dport=22222 [UNREPLIED] src=1.1.1.103 dst=1.1.1.101 sport=22222 dport=8888 mark=0 secmark=0 use=2
监听端口状态是正常的
[root@server ~]# netstat -nplt|grep 8888
tcp 0 0 0.0.0.0:8888 0.0.0.0:* LISTEN 28978/python
5. 这个时候,当client再以tcp 22222为源端口向tcp 8888发起链接,server端的iptables就拒绝了这个SYN报文. client端链接报错。
[root@client ~]# python
Python 2.6.6 (r266:84292, Sep 4 2013, 07:46:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.bind(("0.0.0.0",22222))
>>> s.connect(("1.1.1.101",8888))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 1, in connect
socket.error: [Errno 113] No route to host
>>>