iptables 是基于内核态的 netfilter 框架,用来过滤 ip 数据包和网络地址转换 NAT 的一个工具,通常用做防火墙功能,或负载均衡功能。
简单点说:iptables 是用户态的命令行工具,能够操做内核态的 iptables 的几个模块(基于更底层的 netfilter 模块),而后达到过滤或网络地址转换数据包。node
K8S 里的 kube-proxy 模块会调用 iptables go client 来往 linux 内核中去用户自定义 iptables chain,和往
一些表内(如 filter 表或 nat 表)写自定义rule,来实现每个 Node 节点作到四层负载均衡功能,且因为 kube-proxy
做为 DaemonSet 部署,会在每个 Node 节点内运行这个进程,致使这种负载均衡仍是分布式的。另外,因为集群内每次新建一个 service 都会
在每个 Node 节点上写 iptables rules,因为 iptables 的数据结构是链表,因此每一次读操做都是 O(n),效率就很低了,致使一个 k8s cluster
若是使用 iptables 做为负载均衡底层技术的话,就只能支撑中小数量的 service 了。这一点就没有 ipvs 效率高,ipvs 的数据结构是哈希表,并且 ipvs 天生就是
为负载均衡而生,支持的负载均衡策略更多,包括 rr、wrr 或 lc 等等。linux
可是不是说 iptables 没有学习的价值,相反,官方的 K8S 在代码里也大量使用 iptables 功能,并且即便使用 ipvs,也在一些状况下必须借助 iptables。nginx
本文笔记记录,做为最近几天的学习总结。网络
公式:iptables = 4 tables = (5 chains + user-defined chains) = rules * EveryChain
rule 会去匹配(-m 参数)这个数据包,告诉该数据包 packet 下一跳干啥去(-J 参数),是跳到下一个 target,仍是直接丢弃。数据结构
一个数据包来了,会发生什么?(没考虑表内用户自定义的 chain)负载均衡
kube-proxy 是怎么利用 iptables 作 service 四层负载均衡的?
建立一个 NodePort service 时,在每个机器内核中,会把 service 后面挂的每个 endpoint ip ,写入到自定义 chain 的规则中。NodePort service 过程以下:框架
# (1) 首先进入 prerouting chain,跳转到 KUBE-SERVICES chain sudo iptables-save -t nat | grep -- '-A PREROUTING' -A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES # (2) 跳转到 KUBE-NODEPORTS chain sudo iptables-save -t nat | grep -- '-A KUBE-SERVICES' -A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS # (3) 发现 default/nginx-demo NodePort service 会跳转到 KUBE-SVC-JKOCBQALQGD3X3RT chain sudo iptables-save -t nat | grep -- '-A KUBE-NODEPORTS' -A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx-demo-1:" -m tcp --dport 32719 -j KUBE-MARK-MASQ -A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx-demo-1:" -m tcp --dport 32719 -j KUBE-SVC-JKOCBQALQGD3X3RT # (4) 查看 KUBE-SVC-JKOCBQALQGD3X3RT chain,又发现 0.33333333349 几率跳转到 KUBE-SEP-HWWSIA644OJY5W7C # 剩下的2/3几率中,又有 0.50000000000 几率跳转到 KUBE-SEP-5Z6HLG57ALXCA2BN,最后剩下的几率跳转到 KUBE-SEP-HE7NEHV2WH3AYFZT。 # 即以平均几率跳转到 KUBE-SEP-HWWSIA644OJY5W7C、KUBE-SEP-5Z6HLG57ALXCA2BN、KUBE-SEP-HE7NEHV2WH3AYFZT sudo iptables-save -t nat | grep -- '-A KUBE-SVC-JKOCBQALQGD3X3RT' -A KUBE-SVC-JKOCBQALQGD3X3RT -m comment --comment "default/nginx-demo-1:" -m statistic --mode random --probability 0.33333333349 -j KUBE-SEP-HWWSIA644OJY5W7C -A KUBE-SVC-JKOCBQALQGD3X3RT -m comment --comment "default/nginx-demo-1:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-5Z6HLG57ALXCA2BN -A KUBE-SVC-JKOCBQALQGD3X3RT -m comment --comment "default/nginx-demo-1:" -j KUBE-SEP-HE7NEHV2WH3AYFZT
iptables 包含了哪些表 Tables?
iptables 包含的 tables:
filter: This is the default table (if no -t option is passed). 包含 chains:
INPUT(for packets destined to local sockets),
FORWARD(for packets being routed through the box),
OUTPUT(for locally-generated packets)dom
nat: 网络地址转换表,如 SNAT 或 DNAT。
PREROUTING(for altering packets as soon as they come in),
OUTPUT(for altering locally-generated packets before routing),
POSTROUTING(for altering packets as they are about to go out)socket
mangle: This table is used for specialized packet alteration.
PREROUTING(for altering incoming packets before routing)
INPUT(for packets coming into the box itself)
OUTPUT(for altering locally-generated packets before routing)
FORWARD(for altering packets being routed through the box)
POSTROUTING(for altering packets as they are about to go out)tcp
raw: This table is used mainly for configuring exemptions from connection tracking in combination with the NOTRACK target. It registers at the netfilter hooks with higher priority and is thus called before ip_conntrack, or any other IP tables.PREROUTING(for packets arriving via any network interface)OUTPUT(for packets generated by local processes)