最近作性能测试时,发现链接第三方系统时会有约1%的交易提示以下错误java
nested exception is org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
起先抱着能google就google的思路去找,结果没找到相应的解决方案,只能本身一步一步去排查了,下面记录下排查过程。python
涉及到网络的问题,先抓个包,tcpdump走一波。shell
tcpdump -i eth0 host xxx.xxx.xxx.xxx > tcp.dmp
节选了一次TCP链接的整个内容,从三次握手到报文发送到四次挥手,下面将一次通信分为三个部分进行讲解apache
时间戳 协议类型 请求方地址>接收方地址 Flags [xxx], seq xxx,......,length
协议类型:此处针对TCP/IP协议,为IP网络
Flags常见值及含义并发
标识 | 含义 |
---|---|
S | 即SYN,用于创建链接 |
. | .即ACK,可与其余组合使用好比[S.]表明SYN的ACK回执 |
P | 即PSH,指示接收方应当即将数据交给上层 |
F | 即FIN,用于断开链接 |
09:23:00.038908 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [S], seq 2590465106, win 29200, options [mss 1460,sackOK,TS val 1752507811 ecr 0,nop,wscale 7], length 0 09:23:00.039329 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [S.], seq 148046299, ack 2590465107, win 14480, options [mss 1460,sackOK,TS val 2174555693 ecr 1752507811,nop,wscale 7], length 0 09:23:00.039341 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [.], ack 1, win 229, options [nop,nop,TS val 1752507811 ecr 2174555693], length 0
客户端发送SYN=1,seq=随机生成的序号
到服务端,对应第一个包内容app
服务端响应SYN=1,seq=服务端随机序号,ack=第一步中的seq+1
给客户端,对应第二个包内容tcp
客户端发送SYN=0,seq=第一步中的seq+2,ack=服务端的随机序号+1
给服务端,对应第三个包性能
至此链接创建完成测试
TCP链接为全双工的,既能够C->S,亦可S->C
09:23:00.039406 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [P.], seq 1:167, ack 1, win 229, options [nop,nop,TS val 1752507811 ecr 2174555693], length 166 09:23:00.039426 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [P.], seq 167:2259, ack 1, win 229, options [nop,nop,TS val 1752507811 ecr 2174555693], length 2092 09:23:00.039778 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [.], ack 167, win 122, options [nop,nop,TS val 2174555693 ecr 1752507811], length 0 09:23:00.039784 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [.], ack 2259, win 155, options [nop,nop,TS val 2174555693 ecr 1752507811], length 0 09:23:00.533102 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [P.], seq 1:972, ack 2259, win 155, options [nop,nop,TS val 2174555817 ecr 1752507811], length 971 09:23:00.533114 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [.], ack 972, win 244, options [nop,nop,TS val 1752507935 ecr 2174555817], length 0
09:23:00.533121 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [F.], seq 972, ack 2259, win 155, options [nop,nop,TS val 2174555817 ecr 1752507811], length 0 09:23:00.533166 IP 192.168.1.100.51777 > 192.168.1.200.39045: Flags [F.], seq 2259, ack 973, win 244, options [nop,nop,TS val 1752507935 ecr 2174555817], length 0 09:23:00.533479 IP 192.168.1.200.39045 > 192.168.1.100.51777: Flags [.], ack 2260, win 155, options [nop,nop,TS val 2174555817 ecr 1752507935], length 0
此处由于使用了HTTP协议,且在系统头中设置了
Connection=close
,因此在请求处理完成后,服务端便会将链接关闭。
FIN
到客户端,对应包1。ACK
到服务端,对应包2。FIN
到服务端,对应包2。由于第2、第三步都是从客户端发送到服务端,TCP协议实现时在此处作了合并,将ACK和FIN一并发给对端,减小了网络消耗。
ACK
到客户端,对应包3,链接释放完成。包是抓到了,但是在性能测试时,业务量特别大,若是纯粹靠手工分析的话,怕是要找到地老天荒了,为此写了一段Python代码来分析tcpdump的包,由于本次的问题为请求链接有问题,必然是三次握手的时候存在问题,因此如下代码主要分析三次握手时的状况。
解析包内容
import re class Packet(object): '分组' timestamp = '00:00:00.000000' # 时间戳 from_addr = '0.0.0.0' # 源地址 to_addr = '0.0.0.0' # 目的地址 protocol_type = 'IP' # 协议类型 flags = '[S]' seq = '0' ack = '0' length = 0 # 数据包长度 def __init__(self, packet_str): hb = re.split(': ', packet_str) header = hb[0] body = hb[1] hs = re.split('[\s]', header) self.timestamp = hs[0] self.protocol_type = hs[1] self.from_addr = hs[2] self.to_addr = hs[4] bs = re.split(',\s', body) for b in bs: kv = re.split('\s', b, 1) k = kv[0] v = kv[1] if 'Flags' == k: self.flags = v elif 'seq' == k: if self.flags != '[P.]': self.seq = v elif 'length' == k: self.length = int(v) elif 'ack' == k: self.ack = v def __str__(self): return self.timestamp + self.protocol_type + self.from_addr + '==>' + self.to_addr + self.flags + str( self.seq) + str(self.ack) + str(self.length)
分析dmp包内容
from tcpdump.Packet import Packet client_addr = '192.168.1.100' server_addr = '192.168.1.200.39045' packet_times = {} # 链接ID:时间 first_packets = [] #三次握手第一次的包 ''' 第一步:syn=1,seq=xxx 第二步:syn=1,seq=yyy,ack=xxx+1 第三步:syn=0,seq=xxx+1,ack=yyy+1 ''' with open('C:\\Users\\xxx\\Desktop\\39045.dmp') as dmp: for line in dmp.readlines(): line = str(line.strip()) packet = Packet(line) if packet.from_addr.startswith(client_addr) and packet.flags == '[S]': first_packets.append(packet) id = packet.from_addr + packet.to_addr + packet.flags + 'ack=' + packet.ack if packet.flags != '[S.]': id = id + 'seq=' + packet.seq packet_times[id] = packet.timestamp print('请求ID(客户端IP.端口\\请求序号)\t第一步时间\t\t第二步时间\t\t第三步时间') for fpack in first_packets: sync_time = fpack.timestamp # 第一步的时间 id = fpack.to_addr + fpack.from_addr + '[S.]' + 'ack=' + str(int(fpack.seq) + 1) # 第二步:ack=第一步的seq+1 ack_time = packet_times.get(id) conn_time = None if ack_time is not None: conn_time = packet_times.get(fpack.from_addr + fpack.to_addr + '[.]ack=1seq=0') # 第三步:ack=1,seq无值 print( '[' + fpack.from_addr + '\\' + fpack.seq + ']\t' + sync_time + '\t\t' + str(ack_time) + '\t\t' + str(conn_time))
经过代码分析,发现存在部分建立链接时第二步的时间为None的,所以能够得出结论为发出去的SYN请求包没有获得服务端的回执,能够光明正大的摔锅给网络/第三方了![]