【译】TCP Implementation in Linux

时间 2019-12-09

原文原文链接

TCP Implementation in Linux: A Brief Tutorial

一个简单教程关于 TCP 协议在 linux 内核的实现linux

翻译：内核小王子（欢迎订阅微信公众号）原文：Helali Bhuiyan, Mark McGinley, Tao Li, Malathi Veeraraghavan University of Virginia算法

原文连接 TCP Implementation in Linux: A Brief Tutorial微信

A. Introduction

This document provides a brief overview of how TCP is implemented in Linux. 1 It is not meant to be comprehensive, nor do we assert that it is without inaccuracies.网络

本文档简要概述了如何在Linux中实现TCP。他可能并不全面，而且也不能保证彻底准确。数据结构

B. TCP implementation in Linux

Figures 1 and 2 show the internals of the TCP implemen- tation in Linux kernel. Fig. 1 shows the path taken by a new packet from the the wire to a user application. The Linux kernel uses an sk buff data structure to describe each packet. When a packet arrives at the NIC, it invokes the DMA engine to place the packet into the kernel memory via empty sk buffs stored in a ring buffer called rx ring. An incoming packet is dropped if the ring buffer is full. When a packet is processed at higher layers, packet data remains in the same kernel memory, avoiding any extra memory copies.app

图1 和图2 展现了 TCP/IP 协议栈在 Linux 内核中的实现，图1 展现了一个网络包经过物理网线到达应用程序的过程，Linux 内核使用一个名为 sk_buff 的数据结构来表示一个网络包。当一个网络包到达网卡时，会经过 DMA 引擎将这个 sk_buff 加入到一个叫 rx ring 的 ring buffer 中，当这个 ring buffer 已经满了的时候，的报文将被舍弃。当更高层的协议处理数据包的时候，报文保存在内核的内存中从而避免了额外的拷贝。less

Once a packet is successfully received, the NIC raises an interrupt to the CPU, which processes each incoming packet and passes it to the IP layer. The IP layer performs its processing on each packet, and passes it up to the TCP layer if it is a TCP packet. The TCP process is then scheduled to handle received packets. Each packet in TCP goes through a series of complex processing steps. The TCP state machine is updated, and finally the packet is stored inside the TCP recv buffer.socket

一旦成功接收到一个数据包，网卡会向 CPU 发送一个中断，中断处理函数将数据包传给 IP 层。 IP层处理完后，判断若是是 TCP 报文，就会将数据包发给 TCP 层处理，数据包通过 TCP 层一系列复杂的处理过程，会更新 TCP 的状态机，最后将数据包存储在 TCP 的接收缓冲区中。tcp

A critical parameter for tuning TCP is the size of the recv buffer at the receiver. The number of packets a TCP sender is able to have outstanding (unacknowledged) is the minimum of the congestion window (cwnd) and the receiver’s advertised window (rwnd). The maximum size of the receiver’s advertised window is the TCP recv buffer size. Hence, if the size of the recv buffer is smaller than the the bandwidth- delay product (BDP) of the end-to-end path, the achievable throughput will be low. On the other hand, a large recv buffer allows a correspondingly large number of packets to remain outstanding, possibly exceeding the number of packets an end- to-end path can sustain. The size of the recv buffer can be set by modifying the /proc/sys/net/ipv4/tcp rmem variable. It takes three different values, i.e, min, default, and max. The min value defines the minimum receive buffer size even when the operating system is under hard memory pressure. The default is the default size of the receive buffer, which is used together with the TCP window scaling factor to calculate the actual advertised window. The max defines the maximum size of the receive buffer，ide

TCP 调优的一个关键参数为接收端的 recv 缓冲区大小。TCP 发送方可以发送的数据包的数量为发送方的拥塞控制窗口 (cwnd) 和接收方的告知的接收窗口 (rwnd) 中的最小值。而接收方告知的接收窗口的最大值就是 recv 缓冲区大小。所以，若是 recv 缓冲区设置的比 BGP (带宽延迟积) 小，则网络的吞吐量将会很低。另外，一个大的 recv 缓冲区容许大量的数据包处于未完成状态，可能超过了双方能够维持的数据包数量。recv 缓冲区大小能够经过修改 /proc/sys/net/ipv4/tcp rmem变量来设置。它须要三个值，最大值，最小值，默认值。最小值定义了最小能够接收的缓冲区大小，即便操做系统处于硬件内存很小。默认值是接收缓冲区的默认大小，它与TCP滑动窗口比例一块儿用来计算实际公示的窗口大小。max 定义接收缓冲区的最大值。

Also at the receiver, the parameter netdev max backlog dictates the maximum number of packets queued at a device, which are waiting to be processed by the TCP receiving process. If a newly received packet when added to the queue would cause the queue to exceed netdev max backlog then it is discarded.

此外在接收端，参数netdev max backlog 指示网卡设备上排队的最大数据包数，这些数据包等待TCP接收进程处理。若是一个新收到的数据包在添加到队列时会致使队列超过netdev max backlog，那么它将被丢弃。

On the sender, as shown in Fig 2, a user application writes the data into the TCP send buffer by calling the write() system call. Like the TCP recv buffer, the send buffer is a crucial parameter to get maximum throughput. The maximum size of the congestion window is related to the amount of send buffer space allocated to the TCP socket. The send buffer holds all outstanding packets (for potential retransmission) as well as all data queued to be transmitted. Therefore, the congestion window can never grow larger than send buffer can accommodate. If the send buffer is too small, the congestion window will not fully open, limiting the throughput. On the other hand, a large send buffer allows the congestion window to grow to a large value. If not constrained by the TCP recv buffer, the number of outstanding packets will also grow as the congestion window grows, causing packet loss if the end-to- end path can not hold the large number of outstanding packets. The size of the send buffer can be set by modifying the /proc/sys/net/ipv4/tcp wmem variable, which also takes three different values, i.e., min, default, and max.

在发送端，如图 2 ，所示，用户程序经过系统调用 write() 将数据写入 TCP 的 send buffer，和接收端的缓冲区同样，send buffer 也是提供吞吐量很重要的参数。拥塞窗口的最大值和分配给 TCP socket 的 send buffer 空间大小相关，send buffer 保存了全部尚未确认的数据包，由于该数据包可能还须要重发，若是s end buffer 设置的过小，则拥塞窗口也会变小，将影响吞吐量。另外，一个大的 send buffer 可能致使拥塞窗口变大，若是没有经过接收端的 recv buffer 来限制，未确认的报文数目会随着拥塞窗口的增长而变大，若是超过双方能够维持的最大包数目从而致使丢包。send buffer 的大小能够经过修改 /proc/sys/net/ipv4/tcp 的 wmem 变量值，一样须要配置最大最小值和默认值。

The analogue to the receiver’s netdev max backlog is the sender’s txqueuelen. The TCP layer builds packets when data is available in the send buffer or ACK packets in response to data packets received. Each packet is pushed down to the IP layer for transmission. The IP layer enqueues each packet in an output queue (qdisc) associated with the NIC. The size of the qdisc can be modified by assigning a value to the txqueuelen variable associated with each NIC device. If the output queue is full, the attempt to enqueue a packet generates a local- congestion event, which is propagated upward to the TCP layer. The TCP congestion-control algorithm then enters into the Congestion Window Reduced (CWR) state, and reduces the congestion window by one every other ACK (known as rate halving). After a packet is successfully queued inside the output queue, the packet descriptor (sk buff) is then placed in the output ring buffer tx ring. When packets are available inside the ring buffer, the device driver invokes the NIC DMA engine to transmit packets onto the wire.

相似于接收端的 netdev max backlog 是发送者的网卡设备上排队的最大数据包数。TCP 层在数据到达 send buffer的时候会构建报文，当收到确认报文回复的时候也会更高数据包状态。构建好 TCP 报文后会将数据包推送到协议下层的 IP 层进行传输，IP 层将加数据包放入一个和网卡关联的输出队列。该队列的大小能够经过修改和网卡设备关联的 txqueuelen 变量值来设置。若是队列已满，会尝试将数据包排队生成一个阻塞事件传播到 TCP层。TCP 拥塞控制算法将减小拥塞窗口的状态变量，每有一个阻塞事件会将当前拥塞窗口的状态变量减半。当数据包成功加入到队列，则这个数据包的描述符 (sk buff) 将会放入到发送方的 ring buffer 中，以后设备驱动经过 DMA engine 将数据包传输到线路中。

While the above parameters dictate the flow-control profile of a connection, the congestion-control behavior can also have a large impact on the throughput. TCP uses one of several congestion control algorithms to match its sending rate with the bottleneck-link rate. Over a connectionless network, a large number of TCP flows and other types of traffic share the same bottleneck link. As the number of flows sharing the bottleneck link changes, the available bandwidth for a certain TCP flow varies. Packets get lost when the sending rate of a TCP flow is higher than the available bandwidth. On the other hand, packets are not lost due to competition with other flows in a circuit as bandwidth is reserved. However, when a fast sender is connected to a circuit with lower rate, packets can get lost due to buffer overflow at the switch.

上述参数展现了网络链接的流量控制，但拥塞控制行为也会对对吞吐量产生很大影响。TCP使用多种拥塞控制算法来匹配发送速率以适应有瓶颈的线路。在一个无链接的网络环境里，大量的TCP流和其余类型的流量共享同一个瓶颈链路，当链路上的数据包数量发生变化的时候，TCP 流的可用带宽也会变化。当TCP流的发送速率高于可用带宽时，数据包会丢失。另外一方面，因为带宽被保留，数据包不会由于与电路中其余流的竞争而丢失。但，当一个发送速率很快的发送端链接到一个速率较低的链路时，因为交换机的缓冲区溢出，数据包也可能会丢失。

When a TCP connection is set up, a TCP sender uses ACK packets as a ’clock, known as ACK-clocking, to inject new packets into the network [1]. Since TCP receivers cannot send ACK packets faster than the bottleneck-link rate, a TCP senders transmission rate while under ACK-clocking is matched with the bottleneck link rate. In order to start the ACK-clock, a TCP sender uses the slow-start mechanism. During the slow-start phase, for each ACK packet received, a TCP sender transmits two data packets back-to-back. Since ACK packets are coming at the bottleneck-link rate, the sender is essentially transmitting data twice as fast as the bottleneck link can sustain. The slow-start phase ends when the size of the congestion window grows beyond ssthresh. In many congestion control algorithms, such as BIC [2], the initial slow start threshold (ssthresh) can be adjusted, as can other factors such as the maximum increment, to make BIC more or less aggressive. However, like changing the buffers via the sysctl function, these are system-wide changes which could adversely affect other ongoing and future connections. A TCP sender is allowed to send the minimum of the con- gestion window and the receivers advertised window number of packets. Therefore, the number of outstanding packets is doubled in each roundtrip time, unless bounded by the receivers advertised window. As packets are being forwarded by the bottleneck-link rate, doubling the number of outstanding packets in each roundtrip time will also double the buffer occupancy inside the bottleneck switch. Eventually, there will be packet losses inside the bottleneck switch once the buffer overflows.

当一个 TCP 完成链接创建后，发送方使用确认报文做为一个时钟从而将新的数据包加入网络，称为 ACK-clocking。因为 TCP 接收端发送 ACK 数据包的速度不能超过瓶颈链路速率，所以ACK 时钟下的 TCP 发送端传输速率与瓶颈链路速率匹配。为了启动 ACK 时钟，TCP 发送端使用慢速启动机制。在慢启动阶段，对于接收到的每一个 ACK 数据包，TCP发送端连续传输两个数据包。因为 ACK 数据包以瓶颈链路速率传输，发送方传输数据的速度基本上是瓶颈链路可以维持的速度的两倍。当拥塞窗口的大小超过 ssthresh 时，慢启动阶段结束。在许多拥塞控制算法中，如 bic，能够调整初始慢启动阈值（ssthresh），以及其余因素（如最大增量），使bic或多或少提升效率。可是，与经过sysctl函数更改缓冲区同样，这些是系统范围内的更改，可能会对其余正在进行的链接和未来的链接产生不利影响。TCP 发送端最多只能发送拥塞窗口和接收端公布的窗口中的最小值。所以，除非受接收端公示的窗口的限制，不然每一个往返时间内未完成数据包的数量将增长一倍。因为数据包是由瓶颈链路速率转发的，所以在每一个往返时间内，将未完成数据包的数量加倍也将使瓶颈交换机内的缓冲区占用率加倍。最后，一旦缓冲区溢出，瓶颈交换机内部就会有数据包丢失。

After packet loss occurs, a TCP sender enters into the congestion avoidance phase. During congestion avoidance, the congestion window is increased by one packet in each roundtrip time. As ACK packets are coming at the bottleneck link rate, the congestion window keeps growing, as does the the number of outstanding packets. Therefore, packets will get lost again once the number of outstanding packets grows larger than the buffer size in the bottleneck switch plus the number of packets on the wire.

当发生数据包丢失后，TCP发送端进入拥塞控制阶段。在这期间，每收到一个回复报文拥塞窗口加一。当 ACK 数据包以瓶颈链路速率返回时，拥塞窗口和未完成数据包的数量都在不断增长。所以，一旦未完成数据包的数量超过瓶颈链路交换机中的缓冲区大小加上线路上的数据包数量，数据包将再次丢失。

There are many other parameters that are relevant to the operation of TCP in Linux, and each is at least briefly explained in the documentation included in the distribution (Documentation/networking/ip-sysctl.txt). An example of a configurable parameter in the TCP implementation is the RFC2861 congestion window restart function. RFC2861 pro- poses restarting the congestion window if the sender is idle for a period of time (one RTO). The purpose is to ensure that the congestion window reflects the current state of the network. If the connection has been idle, the congestion window may reflect an obsolete view of the network and so is reset. This be- havior can be disabled using the sysctl tcp slow start after idle but, again, this change affects all connections system-wide.

还有许多与 Linux 中的 TCP 操做相关的其余参数，而且每一个参数都在发布的文档（documentation/networking/ip sysctl.txt）中进行了简要说明。TCP 实现可配置参数的一个例子是 rfc2861 拥塞窗口重启功能。若是发送方空闲一段时间（一个 RTO），则RFC2861 Pro 将从新启动拥塞窗口，目的是确保拥塞窗口反映网络的当前状态。若是链接处于空闲状态，拥塞窗口可能反映网络的已通过时状态，须要进行重置。可使用 ysctl tcp slow start 在空闲后禁用此行为，但此更改会影响系统范围内的全部链接。

若是对 TCP 对流量控制和拥塞控制不是很理解，欢迎关注公众号 内核小王子 ，下周将分享 网络内核之如何实现c10m 深刻分析linux的网络模型