Linux3.5内核对路由子系统的重构对Redirect路由以及neighbour子系统的影响

时间 2020-01-15

标签 linux3.5 linux 内核路由子系统 redirect 以及 neighbour 影响栏目 Linux 繁體版

原文原文链接

几年前，我记得写过好几篇关于Linux去除对路由cache支持的文章，路由cache的下课来源于一次对路由子系统的重构，具体缘由就再也不重复说了，本文将介绍此次重构对Redirect路由以及neighbour子系统的影响。

事实上，直到最近3个月我才发现这些影响是如此之大，工做细节不便详述，这里只是对关于开放源代码Linux内核协议栈的一些实现上的知识进行一个汇总，以便从此查阅，若是有谁也所以获益，则不胜荣幸。

html

路由项rtable，dst_entry与neighbour

IP协议栈中，IP发送由两部分组成：
linux

IP路由的查找

要想成功发送一个数据包，必需要有响应的路由，这部分是由IP协议规范的路由查找逻辑完成的，路由查找细节并非本文的要点，对于Linux系统，最终的查找结果是一个rtable结构体对象，表示一个路由项，其内嵌的第一个字段是一个dst_entry结构体，所以两者能够相互强制转换，其中重要的字段就是：rt_gateway
　　rt_gateway只是要想把数据包发往目的地，下一跳的IP地址，这是IP逐跳转发的核心。到此为止，IP路由查找就结束了。
git

IP neighbour的解析

在 IP路由查找阶段已经知道了rt_gateway，那么接下来就要往二层落实了，这就是IP neighbour解析的工做，咱们知道rt_gateway就是neighbour，如今须要将它解析成硬件地址。所谓的neighbour就是逻辑上与本机直连的全部网卡设备，“逻辑上直连”意味着，对于以太网而言，整个以太网上全部的设备均可以是本机的邻居，关键看谁被选择为发送当前包的下一跳，而对于POINTOPOINT设备而言，则其邻居只有惟一的一个，即对端设备，惟一意味着不须要解析硬件地址！值得注意的是，无视这个区别将会带来巨大的性能损失，这个我将在本文的最后说明。

缓存

声明：

为了描述方便，如下将再也不提起rtable，将路由查找结果一概用 dst_entry代替！下面的代码并非实际上的Linux协议栈的代码，而是为了表述方便抽象而成的伪代码，所以dst_entry并非内核中的 dst_entry结构体，而只是表明一个路由项！这么作的理由是，dst_entry表示的是与协议无关的部分，本文的内容也是与具体协议无关的，所以在伪代码中再也不使用协议相关的rtable结构体表示路由项。socket

Linux内核对路由子系统的重构

在Linux内核 3.5版本以前，路由子系统存在一个路由cache哈希表，它缓存了最近最常用的一些dst_entry(IPv4即rtable)路由项，对数据包首先以其IP地址元组信息查找路由cache，若是命中即可以直接取出dst_entry，不然再去查找系统路由表。
　　在3.5内核中，路由 cache不见了，具体原因不是本文的重点，已有其它文章描述，路由cache的去除引发了对neighbour子系统的反作用，这个反作用被证实是有益的，下面的很大的篇幅都花在这个方面，在详细描述重构对neighbour子系统的影响以前，再简单说说另外一个变化，就是Redirect路由的实现的变化。
　　所谓的Redirect路由确定是对本机已经存在的路由项的Redirect，然而在早期的内核中，都是在不一样的位置好比 inet_peer中保存重定向路由，这意味着路由子系统与协议栈其它部分发生了耦合。在早期内核中，其实无论Redirect路由项存在于哪里，最终它都要进入路由cache才能起做用，但是在路由cache彻底没有了以后，Redirect路由保存的位置问题才暴露出来，为了“在路由子系统内部解决 Redirect路由问题”，重构后的内核在路由表中为每个路由项保存了一个exception哈希表，一个路由项Fib_info相似于下面的样子：
async

Fib_info {
　　Address nexhop;
　　Hash_list exception;
};

这个exception表的表项相似下面的样子：
ide

Exception_entry {
　　Match_info info;
　　Address new_nexthop;
};

这样的话，当收到Reidrect路由的时候，会初始化一个Exception_entry记录而且插入到相应的exception哈希表，在查询路由的时候，好比说最终找到了一个Fib_info，在构建最终的dst_entry以前，要先用诸如源IP信息之类的Match_info去查找exception哈希表，若是找到一个匹配的Exception_entry，则再也不使用Fib_info中的nexhop构建 dst_entry，而是使用找到的Exception_entry中的new_nexthop来构建dst_entry。
在对Redirect路由进行了简单的介绍以后，下面的篇幅将所有用于介绍路由与neighbour的关系。

函数

重构对neighbour子系统的反作用

如下是网上摘录的关于在路由cache移除以后对neighbour的影响：
Neighbours
>Hold link-level nexthop information (for ARP, etc.)
>Routing cache pre-computed neighbours
>Remember: One “route” can refer to several nexthops
>Need to disconnect neighbours from route entries.
>Solution:
　　Make neighbour lookups cheaper (faster hash, etc.)
　　Compute neighbours at packet send time ...
　　.. instead of using precomputed reference via route
>Most of work involved removing dependenies on old setup
事实上两者不应有关联的，路由子系统和neighbour子系统是两个处在上下不一样层次的子系统，合理的方式是经过路由项的nexthop值来承上启下，经过一个惟一的neighbour查找接口关联便可：
oop

dst_entry = 路由表查找(或者路由cache查找，经过skb的destination做键值)
nexthop = dst_entry.nexthop
neigh = neighbour表查找(经过nexthop做为键值)

然而Linux协议栈的实现却远远比这更复杂，这一切还得从3.5内核重构前开始提及。

spa

重构前

在重构前，因为存在路由cache，凡是在cache中能够找到dst_entry的skb，便不用再查找路由表，路由cache存在的假设是，对于绝大多数的skb，都不须要查找路由表，理想状况下，均可以在路由cache中命中。对于neighbour而言，显而易见的作法是将neighbour和 dst_entry作绑定，在cache中找到了dst_entry，也就一块儿找到了neighbour。也就是说，路由cache不只仅缓存 dst_entry，还缓存neighbour。
　　事实上在3.5内核前，dst_entry结构体中有一个字段就是neighbour，表示与该路由项绑定的neighour，从路由cache中找到路由项后，直接取出neighbour就能够直接调用其output回调函数了。
　　咱们能够推导出dst_entry与neighbour的绑定时期，那就是查找路由表以后，即在路由cache未命中时，进而查找路由表完成后，将结果插入到路由cache以前，执行一个neighbour绑定的逻辑。
　　和路由cache同样，neighbour子系统也维护着一张neighbour表，并执行着替换，更新，过时等状态操做，这个neighbour表和路由cache表之间存在着巨大的耦合，在描述这些耦合前，咱们先看一下总体的逻辑：

func ip_output(skb):
        dst_entry = lookup_from_cache(skb.destination);
        if dst_entry == NULL
        then
                dst_entry = lookup_fib(skb.destination);
                nexthop = dst_entry.gateway?:skb.destination;
                neigh = lookup(neighbour_table, nexthop);
                if neigh == NULL
                then
                        neigh = create(neighbour_table, nexthop);
                        neighbour_add_timer(neigh);
                end
                dst_entry.neighbour = neigh;
                insert_into_route_cache(dst_entry);
        end
        neigh = dst_entry.neighbour;
        neigh.output(neigh, skb);
endfunc
---->TO Layer2

试看如下几个问题：
若是neighbour定时器执行时，某个neighbour过时了，能够删除吗？
若是路由cache定时器执行时，某条路由cache过时了，能够删除吗？
若是能够精确回答上述两个问题，便对路由子系统和neighbour子系统之间的关系足够了解了。咱们先看第一个问题。
　　若是删除了neighbour，因为此时与该neighbour绑定的路由cache项可能还在，那么在后续的skb匹配到该路由cache项时，便无法取出和使用neighbour，因为dst_entry和neighbour的绑定仅仅发生在路由cache未命中的时候，此时没法执行从新绑定，事实上，因为路由项和neighbour是一个多对一的关系，所以neighbour中没法反向引用路由cache项，经过 dst_entry.neighbour引用的一个删除后的neighbour就是一个野指针从而引起oops最终内核panic。所以，显而易见的答案就是即使neighbour过时了，也不能删除，只能标记为无效，这个经过引用计数能够作到。如今看第二个问题。
　　路由cache过时了，能够删除，可是要记得递减与该路由cache项绑定的neighbour的引用计数，若是它为0，把neighbour删除，这个neighbour就是第一个问题中在neighbour过时时没法删除的那类neighbour。由此咱们能够看到，路由cache和neighbour之间的耦合关系致使与一个 dst_entry绑定的neighbour的过时删除操做只能从路由cache项发起，除非一个neighbour没有同任何一个dst_entry绑定。现修改总体的发送逻辑以下：

func ip_output(skb):
        dst_entry = lookup_from_cache(skb.destination);
        if dst_entry == NULL
        then
                dst_entry = lookup_fib(skb.destination);
                nexthop = dst_entry.gateway?:skb.destination;
                neigh = lookup(neighbour_table, nexthop);
                if neigh == NULL
                then
                        neigh = create(neighbour_table, nexthop);
                        neighbour_add_timer(neigh);
                end
                inc(neigh.refcnt);
                dst_entry.neighbour = neigh;
                insert_into_route_cache(dst_entry);
        end
        neigh = dst_entry.neighbour;
        # 若是是INVALID状态的neigh，须要在output回调中处理
        neigh.output(neigh, skb);
endfunc
   
func neighbour_add_timer(neigh):
        inc(neigh.refcnt);
        neigh.timer.func = neighbour_timeout;
        timer_start(neigh.timer);
endfunc

func neighbour_timeout(neigh):
        cnt = dec(neigh.refcnt);
        if cnt == 0
        then
                free_neigh(neigh);
        else
                neigh.status = INVALID;
        end
endfunc

func dst_entry_timeout(dst_entry):
        neigh = dst_entry.neighbour;
        cnt = dec(neigh.refcnt);
        if cnt == 0
        then
                free_neigh(neigh);
        end
        free_dst(dst_entry);
endfunc

咱们最后看看这会带来什么问题。
　　若是neighbour表的gc参数和路由cache表的gc参数不一样步，好比 neighbour过快到期，而路由cache项到期的很慢，则会有不少的neighbour没法删除，形成neighbour表爆满，所以在这种状况下，须要强制回收路由cache，这是neighbour子系统反馈到路由子系统的一个耦合，这一切简直太乱了：

func create(neighbour_table, nexthop):
retry:
        neigh = alloc_neigh(nexthop);
        if neigh == NULL or neighbour_table.num > MAX
        then
                shrink_route_cache();
                retry;
        end
endfunc

关于路由cache的gc定时器与neighbour子系统的关系，有一篇写得很好的关于路由cache的文章《Tuning Linux IPv4 route cache》以下所述：
You may find documentation about those obsolete sysctl values:
net.ipv4.route.secret_interval has been removed in Linux 2.6.35; it was used to trigger an asynchronous flush at fixed interval to avoid to fill the cache.
net.ipv4.route.gc_interval has been removed in Linux 2.6.38. It is still present until Linux 3.2 but has no effect. It was used to trigger an asynchronous cleanup of the route cache. The garbage collector is now considered efficient enough for the job.
UPDATED: net.ipv4.route.gc_interval is back for Linux 3.2. It is still needed to avoid exhausting the neighbour cache because it allows to cleanup the cache periodically and not only above a given threshold. Keep it to its default value of 60.

这一切在3.5内核以后发生了改变！！

重构后

经过了重构，3.5以及此后的内核去除了对路由cache的支持，也就是说针对每个数据包都要去查询路由表(暂不考虑在socket缓存 dst_entry的情形)，不存在路由cache也就意味着不须要处理cache的过时和替换问题，整个路由子系统成了一个彻底无状态的系统，因此，dst_entry再也无需和neighbour绑定了，既然每次都要从新查找路由表开销也不大，每次查找少得多的neighbour表的开销更是可以忽略(虽然查表开销没法避免)，所以dst_entry去除了neighbour字段，IP发送逻辑以下：

func ip_output(skb):
        dst_entry = lookup_fib(skb.destination);
        nexthop = dst_entry.gateway?:skb.destination;
        neigh = lookup(neighbour_table, nexthop);
        if neigh == NULL
        then    
                neigh = create(neighbour_table, nexthop);
        end
        neigh.output(skb);
endfunc

路由项再也不和neighbour关联，所以neighbour表就能够独立执行过时操做了，neighbour表因为路由cache的gc过慢而致使频繁爆满的状况也就消失了。
　　不光如此，代码看上去也清爽了不少。

一个细节：关于POINTOPOINT和LOOPBACK设备的neighbour

有不少讲述Linux neighbour子系统的资料，可是几乎无一例外都是在说ARP的，各类复杂的ARP协议操做，队列操做，状态机等，可是几乎没有描述ARP以外的关于 neighbour的资料，所以本文在最后这个小节中准备补充关于这方面的一个例子。仍是从问题开始：
一个NOARP的设备，好比POINTOPOINT设备发出的skb，其neighbour是谁？
在广播式以太网状况下，要发数据包到远端，须要解析“下一跳”地址，即每个发出的数据包都要经由一个gateway发出去，这个gateway被抽象为一个同网段的IP地址，所以须要用ARP协议落实到肯定的硬件地址。可是对于pointopoint设备而言，与该设备对连的只有固定的一个，它并无一个广播或者多播的二层，所以也就没有gateway的概念了，或者换句话说，其下一跳就是目标IP地址自己。
　　根据上述的ip_output函数来看，在查找neighbour表以前，使用的键值是nexthop，对于pointopoint设备而言，nexthop就是skb的目标地址自己，如果找不到将会以此为键值进行建立，那么试想使用pointopint设备发送的skb的目标地址空间十分海量的状况，将会有海量的neighbour在同一时间被建立，这些neighbour将会同时插入到neighbour表中，而这必然要遭遇到锁的问题，事实上，它们的插入操做将所有自旋在 neighbour表读写锁的写锁上！！
　　neigh_create的逻辑以下：

struct neighbour *neigh_create(struct neigh_table *tbl, const void *pkey,
                   struct net_device *dev)
{
    struct neighbour *n1, *rc, *n = neigh_alloc(tbl);
　　......
    write_lock_bh(&tbl->lock);
　　// 插入hash表
    write_unlock_bh(&tbl->lock);
    .......
}

在海量目标IP的skb经过pointopoint设备发送的时候，这是一个彻底避不开的瓶颈！然而内核没有这么傻。它采用了如下的方式进行了规避：

__be32 nexthop = ((struct rtable *)dst)->rt_gateway?:ip_hdr(skb)->daddr;
if (dev->flags&(IFF_LOOPBACK|IFF_POINTOPOINT))
　　nexthop = 0;

这就意味着只要发送的pointopint设备相同，且伪二层(好比IPGRE的状况)信息相同，全部的skb 将使用同一个neighbour，无论它们的目标地址是否相同。在IPIP Tunnel的情形下，因为这种设备没有任何的二层信息，这更是意味着全部的经过IPIP Tunnel设备的skb将使用一个单一的neighbour，即使是使用不一样的IPIP Tunnel设备进行发送。
可是在3.5内核重构以后，悲剧了！
　　咱们直接看4.4的内核吧！

static inline __be32 rt_nexthop(const struct rtable *rt, __be32 daddr)
{
    if (rt->rt_gateway)
        return rt->rt_gateway;
    return daddr;
}
static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
{
　　......
    nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
    neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
    if (unlikely(!neigh))
        neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
    if (!IS_ERR(neigh)) {
        int res = dst_neigh_output(dst, neigh, skb);
        return res;
    }
　　......
}

能够看到，dev->flags&(IFF_LOOPBACK|IFF_POINTOPOINT)这个判断消失了！这意味着内核变傻了。上一段中分析的那种现象在3.5以后的内核中将会发生，事实上也必定会发生。
　　遭遇这个问题后，在没有详细看3.5以前的内核实现以前，个人想法是初始化一个全局的dummy neighbour，它就是简单的使用dev_queue_xmit进行direct out：

static const struct neigh_ops dummy_direct_ops = {
    .family =        AF_INET,
    .output =        neigh_direct_output,
    .connected_output =    neigh_direct_output,
};
struct neighbour dummy_neigh;
void dummy_neigh_init()
{
    memset(&dummy_neigh, 0, sizeof(dummy_neigh));
    dummy_neigh.nud_state = NUD_NOARP;
    dummy_neigh.ops = &dummy_direct_ops;
    dummy_neigh.output = neigh_direct_output;
    dummy_neigh.hh.hh_len = 0;
}

static inline int ip_finish_output2(struct sk_buff *skb)
 {
　　......
     nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
    if (dev->type == ARPHRD_TUNNEL) {
        neigh = &dummy_neigh;
    } else {
        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
    }
     if (unlikely(!neigh))
         neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
　　......
 }

后来看了3.5内核以前的实现，发现了：

if (dev->flags&(IFF_LOOPBACK|IFF_POINTOPOINT))
　　nexthop = 0;

因而决定采用这个，代码更少也更优雅！而后就产生了下面的patch：

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -202,6 +202,8 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s

        rcu_read_lock_bh();
        nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
+       if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
+               nexthop = 0;
        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
        if (unlikely(!neigh))
                neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);