本文经过两个简单的服务之间的访问,结合tcpdump抓包,详细分析下在IPVS模式下,kubernetes实现经过服务名称访问NodePort、ClusterIp类型的service的原理。java
固然kubernetes网络实现牵扯到不少知识,特别是对Linux低层的模块的各类调用,若是对Linux中的网络命名空间、eth设备对、网桥等模块不熟悉的话,能够先参考下另外一篇文章Docker 网络,以后也能够看下另外一篇文件Kubernetes kube-proxy来了解下kube-proxy的IPVS模式node
本文环境基于flannel网络插件,具体搭建参考kubernetes安装-二进制web
角色 | 系统 | CPU Core | 内存 | 主机名称 | ip | 安装组件 |
---|---|---|---|---|---|---|
master | 18.04.1-Ubuntu | 4 | 8G | master | 192.168.0.107 | kubectl,kube-apiserver,kube-controller-manager,kube-scheduler,etcd,flannald,kubelet,kube-proxy |
slave | 18.04.1-Ubuntu | 4 | 4G | slave | 192.168.0.114 | docker,flannald,kubelet,kube-proxy,coredns |
master节点spring
$ route -n -v 内核 IP 路由表 目标 网关 子网掩码 标志 跃点 引用 使用 接口 0.0.0.0 192.168.0.1 0.0.0.0 UG 600 0 0 wlp3s0 169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 wlp3s0 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-471858815e83 172.30.22.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0 172.30.78.0 172.30.78.0 255.255.255.0 UG 0 0 0 flannel.1 192.168.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlp3s0
slave节点docker
route -v -n 内核 IP 路由表 目标 网关 子网掩码 标志 跃点 引用 使用 接口 0.0.0.0 192.168.0.1 0.0.0.0 UG 600 0 0 wlo1 169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 wlo1 172.30.22.0 172.30.22.0 255.255.255.0 UG 0 0 0 flannel.1 172.30.78.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0 192.168.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlo1
web镜像后端
用spring boot启动了一个web服务,监听8080端口,里面提供一个方法 /header/list,调用这个方法后,会把调用者地址相关信息输出出来api
@RequestMapping("/header/list") public String listHeader(HttpServletRequest request) { log.info("host is" + request.getHeader("host")); log.info("remoteAddr is " + request.getRemoteHost()); log.info("remotePort is " + request.getRemotePort()); return "OK"; }
curl镜像服务器
基于 alpine镜像,只安装了一个curl命令,使咱们能够经过这个命令访问web服务网络
FROM alpine:latest RUN apk update RUN apk add --upgrade curl
为了控制pod启动到指定的节点完成下面的分析,给两个节点分别添加不一样的labelapp
$ kubectl label nodes master sample=master node/master labeled $ kubectl label nodes slave sample=slave node/slave labeled
web服务的curl服务对应的pod都在master节点上,由拓扑图可知,这次访问通讯只用通过master 节点上的docker0网桥便可实现
编写web服务启动文件
$ cat > web.yml <<EOF apiVersion: v1 kind: Service metadata: name: clientip spec: #type: NodePort selector: app: clientip ports: - name: http port: 8080 targetPort: 8080 #nodePort: 8086 --- apiVersion: apps/v1 kind: Deployment metadata: name: clientip-deployment spec: selector: matchLabels: app: clientip replicas: 1 template: metadata: labels: app: clientip spec: nodeSelector: sample: master containers: - name: clientip image: 192.168.0.107/k8s/client-ip-test:0.0.2 ports: - containerPort: 8080 EOF
编写启动 curl pod的文件
$ cat > pod_curl.yml <<EOF apiVersion: v1 kind: Pod metadata: name: curl spec: containers: - name: curl image: 192.168.0.107/k8s/curl:1.0 command: - sleep - "3600" nodeSelector: sample: master EOF
启动服务
$ kubectl create -f web.yml -f pod_curl.yml $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES clientip-deployment-5d8b5dcb46-qprps 1/1 Running 0 4s 172.30.22.4 master <none> <none> curl 1/1 Running 0 9s 172.30.22.3 master <none> <none> $ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE clientip ClusterIP 10.254.0.30 <none> 8080/TCP 51s kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 25d
能够看到,两个服务服务都正常启动起来,并启动在master节点上
启动监听master节点上docker0、flannel.1设备
$ tcpdump -n -vv -i docker0 $ tcpdump -n -vv -i flannel.1
$ kubectl exec -it curl curl http://clientip:8080/header/list OK
监控日志分析
web 服务日志
2020-03-06 08:29:05.447 INFO 6 --- [nio-8080-exec-1] c.falcon.clientip.ClientIpController : host isclientip:8080 2020-03-06 08:29:05.447 INFO 6 --- [nio-8080-exec-1] c.falcon.clientip.ClientIpController : remoteAddr is 172.30.22.3 2020-03-06 08:29:05.447 INFO 6 --- [nio-8080-exec-1] c.falcon.clientip.ClientIpController : remotePort is 42000
docker0网络监控(只摘录了主要流程的日志)
172.30.22.3.47980 > 10.254.0.2.53: [bad udp cksum 0xcd6e -> 0xdae6!] 22093+ A? clientip.default.svc.cluster.local. (52) ... 10.254.0.2.53 > 172.30.22.3.47980: [udp sum ok] 22093*- q: A? clientip.default.svc.cluster.local. 1/0/0 clientip.default.svc.cluster.local. A 10.254.0.30 (102) ... 172.30.22.3.42000 > 10.254.0.30.8080: Flags [P.], cksum 0xcdbb (incorrect -> 0x95b1), seq 0:88, ack 1, win 507, options [nop,nop,TS val 3200284558 ecr 1892112994], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 172.30.22.3.42000 > 172.30.22.4.8080: Flags [P.], cksum 0x84c2 (incorrect -> 0xdeaa), seq 1:89, ack 1, win 507, options [nop,nop,TS val 3200284558 ecr 1892112994], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 172.30.22.4.8080 > 172.30.22.3.42000: Flags [P.], cksum 0x84dd (incorrect -> 0xe64b), seq 1:116, ack 89, win 502, options [nop,nop,TS val 1892113104 ecr 3200284558], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Fri, 06 Mar 2020 08:29:05 GMT OK[!http]
观察flannel.1设备的输出,此时是不会出现和请求172.30.22.4相关的信息,此处略去
curl服务对应的pod在master节点上,web服务对应的pod在slave节点上,由拓扑图可知,这时要完成从curl的pod内部访问到web服务依次要通过 master.docker0->master.flannel.1->master.wlp3s0->slave.wlo1->slave.flannel.1->slave.docker0
修改web的启动文件,将nodeSelector的值修改为sample=slave,从新启动web应用
$ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES clientip-deployment-68c57b7965-pmwp2 1/1 Running 0 33s 172.30.78.3 slave <none> <none> curl 1/1 Running 0 48m 172.30.22.3 master <none> <none> $ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE clientip ClusterIP 10.254.167.63 <none> 8080/TCP 94s kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 25d
日志监控
监控web服务的日志
$ kubectl logs -f clientip-deployment-68c57b7965-pmwp2
监控master各个网络设备的日志
$ tcpdump -n -vv -i docker0 $ tcpdump -n -vv -i flannel.1 $ tcpdump -n -vv -i wlp3s0
监控slave节点各个网络设备日志
$ tcpdump -n -vv -i docker0 $ tcpdump -n -vv -i flannel.1 $ tcpdump -n -vv -i wlo1
监控日志分析
web日志
2020-03-07 11:13:22.384 INFO 6 --- [nio-8080-exec-3] c.falcon.clientip.ClientIpController : host isclientip:8080 2020-03-07 11:13:22.384 INFO 6 --- [nio-8080-exec-3] c.falcon.clientip.ClientIpController : remoteAddr is 172.30.22.3 2020-03-07 11:13:22.384 INFO 6 --- [nio-8080-exec-3] c.falcon.clientip.ClientIpController : remotePort is 51596
master网络设备的日志分析(只展现主要流程,tcp握手过程略去)
docker0设备
... 11:13:22.346481 IP (tos 0x0, ttl 64, id 28047, offset 0, flags [DF], proto UDP (17), length 80) 172.30.22.3.35482 > 10.254.0.2.53: [bad udp cksum 0xcd6e -> 0x55df!] 3111+ A? clientip.default.svc.cluster.local. (52) ... 11:13:22.355447 IP (tos 0x0, ttl 62, id 34179, offset 0, flags [DF], proto UDP (17), length 130) 10.254.0.2.53 > 172.30.22.3.35482: [udp sum ok] 3111*- q: A? clientip.default.svc.cluster.local. 1/0/0 clientip.default.svc.cluster.local. A 10.254.167.63 (102) ... 11:13:22.359009 IP (tos 0x0, ttl 64, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 10.254.167.63.8080: Flags [P.], cksum 0x74dd (incorrect -> 0x0f66), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.372907 IP (tos 0x0, ttl 62, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 10.254.167.63.8080 > 172.30.22.3.51596: Flags [P.], cksum 0x077c (correct), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http]
返回信息是从10.254.167.63:8080发回来的,和发出去的路径是一致的,在返回时IPVS的masq(SNAT),将真实服务器地址转换成了虚拟地址
flannel.1网络设备
11:13:22.359020 IP (tos 0x0, ttl 63, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xbcc1 (incorrect -> 0xc781), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.372887 IP (tos 0x0, ttl 63, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbf97 (correct), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http]
wlp3s0网卡(物理网卡)
... 11:13:22.359026 IP (tos 0x0, ttl 64, id 22491, offset 0, flags [none], proto UDP (17), length 190) 192.168.0.107.33404 > 192.168.0.114.8472: [udp sum ok] OTV, flags [I] (0x08), overlay 0, instance 1 IP (tos 0x0, ttl 63, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xc781 (correct), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.372815 IP (tos 0x0, ttl 64, id 57065, offset 0, flags [none], proto UDP (17), length 217) 192.168.0.114.43021 > 192.168.0.107.8472: [udp sum ok] OTV, flags [I] (0x08), overlay 0, instance 1 IP (tos 0x0, ttl 63, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbf97 (correct), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http]
master节点上数据传输总结(经过抓包中的时间分析出数据到达各个设备的前后顺序,红色方块curl开始)
slave节点上网络设备的日志分析(只展现主要流程,tcp握手过程略去)
docker0设备
... 11:13:22.379401 IP (tos 0x0, ttl 62, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xc781 (correct), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 ... 11:13:22.389173 IP (tos 0x0, ttl 64, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbcdc (incorrect -> 0xbf97), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200
flannel.1设备
11:13:22.379392 IP (tos 0x0, ttl 63, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xc781 (correct), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.389192 IP (tos 0x0, ttl 63, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbcdc (incorrect -> 0xbf97), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http]
wlo1网卡
11:13:22.379300 IP (tos 0x0, ttl 64, id 22491, offset 0, flags [none], proto UDP (17), length 190) 192.168.0.107.33404 > 192.168.0.114.8472: [udp sum ok] OTV, flags [I] (0x08), overlay 0, instance 1 IP (tos 0x0, ttl 63, id 23895, offset 0, flags [DF], proto TCP (6), length 140) 172.30.22.3.51596 > 172.30.78.3.8080: Flags [P.], cksum 0xc781 (correct), seq 1:89, ack 1, win 507, options [nop,nop,TS val 56684247 ecr 2651496809], length 88: HTTP, length: 88 GET /header/list HTTP/1.1 Host: clientip:8080 User-Agent: curl/7.67.0 Accept: */* ... 11:13:22.389223 IP (tos 0x0, ttl 64, id 57065, offset 0, flags [none], proto UDP (17), length 217) 192.168.0.114.43021 > 192.168.0.107.8472: [udp sum ok] OTV, flags [I] (0x08), overlay 0, instance 1 IP (tos 0x0, ttl 63, id 63303, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 172.30.22.3.51596: Flags [P.], cksum 0xbf97 (correct), seq 1:116, ack 89, win 502, options [nop,nop,TS val 2651496823 ecr 56684247], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sat, 07 Mar 2020 03:13:22 GMT OK[!http] ...
slave节点上响应请求过程(经过抓包中的时间分析出数据到达各个设备的前后顺序,红色方块请求进入为起始点)
只在slave上启动一个web服务,type设定成NodePort,对应的nodePort设置成8086,从master宿主机上使用curl http://slaveIp:8088/header/list 访问web服务(直接从slave上访问,数据不须要传输,没法看到slave机器上物理网卡上的数据包,因此为了分析,咱们从master上访问)
当建立NodePort类型的service时,Kubernetes会从API Server指定的参数--service-node-port-range中选择一个port分配给service,也能够本身经过.spec.ports[*].nodePort本身指定。以后kubernetes会在集群的每一个node上监听对应的port。
除了在全部节点节点上监听port外,kubernetes会自动给咱们建立一个ClusterIP类型的service,因此建立NodePort的service后,也能够像上个例子同样在集群内部经过 service Name+ service Port的形式访问
此时数据包不须要在集群内pod中跨主机流转,因此数据包不会通过flannel.1,数据包处理流程: master.wlp3s0->.slave.wlo1->slave.docker0->slave.docker0->slave.wlo1-> master.wlp3s0
修改web启动文件
$cat > web.yml <<EOF apiVersion: v1 kind: Service metadata: name: clientip spec: type: NodePort selector: app: clientip ports: - name: http port: 8080 targetPort: 8080 nodePort: 8086 --- apiVersion: apps/v1 kind: Deployment metadata: name: clientip-deployment spec: selector: matchLabels: app: clientip replicas: 1 template: metadata: labels: app: clientip spec: nodeSelector: sample: slave containers: - name: clientip image: 192.168.0.107/k8s/client-ip-test:0.0.2 ports: - containerPort: 8080 EOF
启动服务
$ kubectl create -f web.yml service/clientip created deployment.apps/clientip-deployment created $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES clientip-deployment-68c57b7965-28w4t 1/1 Running 0 10s 172.30.78.3 slave <none> <none> $ kubectl get svc -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR clientip NodePort 10.254.85.24 <none> 8080:8086/TCP 17s app=clientip kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 27d <none>
监控web服务的日志
$ kubectl logs -f clientip-deployment-68c57b7965-28w4t
监控master wlp3s0网卡的日志
$ tcpdump -n -vv -i wlp3s0
监控slave节点各个网络设备日志
$ tcpdump -n -vv -i docker0 $ tcpdump -n -vv -i flannel.1 $ tcpdump -n -vv -i wlo1
web日志
2020-03-08 10:15:01.498 INFO 6 --- [nio-8080-exec-2] c.falcon.clientip.ClientIpController : host is192.168.0.114:8086 2020-03-08 10:15:01.499 INFO 6 --- [nio-8080-exec-2] c.falcon.clientip.ClientIpController : remoteAddr is 172.30.78.1 2020-03-08 10:15:01.499 INFO 6 --- [nio-8080-exec-2] c.falcon.clientip.ClientIpController : remotePort is 38362
docker0设备日志
... 10:15:01.494019 IP (tos 0x0, ttl 63, id 41431, offset 0, flags [DF], proto TCP (6), length 145) 172.30.78.1.38362 > 172.30.78.3.8080: Flags [P.], cksum 0x171b (correct), seq 0:93, ack 1, win 502, options [nop,nop,TS val 670876057 ecr 1109116899], length 93: HTTP, length: 93 GET /header/list HTTP/1.1 Host: 192.168.0.114:8086 User-Agent: curl/7.58.0 Accept: */* ... 10:15:01.503806 IP (tos 0x0, ttl 64, id 34492, offset 0, flags [DF], proto TCP (6), length 167) 172.30.78.3.8080 > 192.168.0.107.38362: Flags [P.], cksum 0xbbce (incorrect -> 0x0f9e), seq 1:116, ack 94, win 502, options [nop,nop,TS val 1109116911 ecr 670876057], length 115: HTTP, length: 115 HTTP/1.1 200 Content-Type: text/plain;charset=UTF-8 Content-Length: 2 Date: Sun, 08 Mar 2020 02:15:01 GMT OK[!http] ...
第一条,请求从docker0进入172.30.78.3.8080,注意此时的请求是从172.30.78.1.38362过来的,就是咱们在web容器中看到的remoteAddr,缘由参考请求过程
第二条,请求处理后从docker0返回,这时对应的响应返回的地址又变成了实际发出访问的地址192.168.0.107.38362
flnanel.1设备日志
tcpdump -n -vv -i flannel.1 tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
wlo1物理网卡日志
... 10:15:01.493998 IP (tos 0x0, ttl 64, id 41431, offset 0, flags [DF], proto TCP (6), length 145) 192.168.0.107.38362 > 192.168.0.114.8086: Flags [P.], cksum 0x8928 (correct), seq 1:94, ack 1, win 502, options [nop,nop,TS val 670876057 ecr 1109116899], length 93 ... 10:15:01.503827 IP (tos 0x0, ttl 63, id 34492, offset 0, flags [DF], proto TCP (6), length 167) 192.168.0.114.8086 > 192.168.0.107.38362: Flags [P.], cksum 0x489f (correct), seq 1:116, ack 94, win 502, options [nop,nop,TS val 1109116911 ecr 670876057], length 115 ...
master wlp3s0网卡日志
... 10:15:01.447172 IP (tos 0x0, ttl 64, id 41431, offset 0, flags [DF], proto TCP (6), length 145) 192.168.0.107.38362 > 192.168.0.114.8086: Flags [P.], cksum 0x82b1 (incorrect -> 0x8928), seq 1:94, ack 1, win 502, options [nop,nop,TS val 670876057 ecr 1109116899], length 93 ... 10:15:01.460324 IP (tos 0x0, ttl 63, id 34492, offset 0, flags [DF], proto TCP (6), length 167) 192.168.0.114.8086 > 192.168.0.107.38362: Flags [P.], cksum 0x489f (correct), seq 1:116, ack 94, win 502, options [nop,nop,TS val 1109116911 ecr 670876057], length 115 ...
slave上响应请求过程总结(红色方块请求进入为起始点)
从master到slave的请求过程和普通请求同样,此处不在描述
在POSTROUTING阶段,按照IPTABLES的规则会进行masquerade(为何执行,参考执行masquerade的缘由),以后进行路由选择,根据slave的路由规则表,发往172.30.78.3.8080的数据须要通过docker0,根据masquerade的原理,在发送时将源地址变成了docker0网络设备的地址
$ ip addr 6: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default link/ether 02:42:0d:ab:b0:60 brd ff:ff:ff:ff:ff:ff inet 172.30.78.1/24 brd 172.30.78.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:dff:feab:b060/64 scope link valid_lft forever preferred_lft forever对应的地址是172.30.78.1,这就是为何咱们在web日志,以及在docker0网络上看到请求是172.30.78.1的缘由
web容器处理完请求构成响应体,在返回时发现这个请求是通过masquerade进来的,返回时查找masquerade前的真实请求发起者,将数据返回地址设置为192.168.0.107.38362,以后根据路由规则,发送给192.168.0.107.38362的数据包,须要从物理网卡wlo1发送,因此数据转发给了wlo1网卡,在进入以前,会执行IPVS的masquerade,将源地址修改为192.168.0.114,并经过wlo1网卡发送给master
IPVS判断出192.168.0.114.8086是集群服务原理
咱们知道IPVS根据本身的hash表中的内容进行判断,因此kubernetes只须要把集群服务相关的信息存入到IPVS的hash表中就能实现了。利用ipvsadm工具查看当启动一个NodePort的service后,kubernetes会在这个hash表中存入哪些内容(下面命令输出中略去了不相干的记录)
$ ipvsadm --list IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP localhost:8086 rr -> 172.30.78.3:http-alt Masq 1 0 0 TCP slave:8086 rr -> 172.30.78.3:http-alt Masq 1 0 0 TCP promote.cache-dns.local:http rr -> 172.30.78.3:http-alt Masq 1 0 0 ...
请求在POSTROUTING阶段执行masquerade的原理
首先看下采用IPVS模式时,kubernetes给咱们建立的ipset,及其做用
set name | members | usage |
---|---|---|
KUBE-CLUSTER-IP | All service IP + port | Mark-Masq for cases that masquerade-all=true or clusterCIDR specified |
KUBE-LOOP-BACK | All service IP + port + IP | masquerade for solving hairpin purpose |
KUBE-EXTERNAL-IP | service external IP + port | masquerade for packages to external IPs |
KUBE-LOAD-BALANCER | load balancer ingress IP + port | masquerade for packages to load balancer type service |
KUBE-LOAD-BALANCER-LOCAL | LB ingress IP + port with externalTrafficPolicy=local | accept packages to load balancer with externalTrafficPolicy=local |
KUBE-LOAD-BALANCER-FW | load balancer ingress IP + port with loadBalancerSourceRanges | package filter for load balancer with loadBalancerSourceRanges specified |
KUBE-LOAD-BALANCER-SOURCE-CIDR | load balancer ingress IP + port + source CIDR | package filter for load balancer with loadBalancerSourceRanges specified |
KUBE-NODE-PORT-TCP | nodeport type service TCP port | masquerade for packets to nodePort(TCP) |
KUBE-NODE-PORT-LOCAL-TCP | nodeport type service TCP port with externalTrafficPolicy=local | accept packages to nodeport service with externalTrafficPolicy=local |
KUBE-NODE-PORT-UDP | nodeport type service UDP port | masquerade for packets to nodePort(UDP) |
KUBE-NODE-PORT-LOCAL-UDP | nodeport type service UDP port with externalTrafficPolicy=local | accept packages to nodeport service with externalTrafficPolicy=local |
其次,须要知道kubernetes是如何利用这些ipset的,再看下kubernetes为咱们在iptables中追加的规则
下面的输出内容是在kube-proxy启动参数:iptables.masqueradeAll=false;clusterCIDR=172.30.0.0/16时的结果,配置成其余的KUBE-SERVICES的规则链会稍有不一样,输出进行了精简只包含了和kubernetes相关的规则
$ iptables -n -L -t nat Chain PREROUTING (policy ACCEPT) target prot opt source destination KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ Chain OUTPUT (policy ACCEPT) target prot opt source destination KUBE-SERVICES all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */ Chain POSTROUTING (policy ACCEPT) target prot opt source destination KUBE-POSTROUTING all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */ Chain KUBE-FIREWALL (0 references) target prot opt source destination KUBE-MARK-DROP all -- 0.0.0.0/0 0.0.0.0/0 Chain KUBE-KUBELET-CANARY (0 references) target prot opt source destination Chain KUBE-LOAD-BALANCER (0 references) target prot opt source destination KUBE-MARK-MASQ all -- 0.0.0.0/0 0.0.0.0/0 Chain KUBE-MARK-DROP (1 references) target prot opt source destination Chain KUBE-MARK-MASQ (2 references) target prot opt source destination MARK all -- 0.0.0.0/0 0.0.0.0/0 MARK or 0x4000 Chain KUBE-NODE-PORT (1 references) target prot opt source destination KUBE-MARK-MASQ tcp -- 0.0.0.0/0 0.0.0.0/0 /* Kubernetes nodeport TCP port for masquerade purpose */ match-set KUBE-NODE-PORT-TCP dst Chain KUBE-POSTROUTING (1 references) target prot opt source destination MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000 MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-LOOP-BACK dst,dst,src Chain KUBE-SERVICES (2 references) target prot opt source destination KUBE-MARK-MASQ all -- !172.30.0.0/16 0.0.0.0/0 /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst KUBE-NODE-PORT all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 match-set KUBE-CLUSTER-IP dst,dst
PREROUTING阶段
在KUBE-NODE-PORT规则链中会判断请求目的端口号是否在KUBE-NODE-PORT-TCP这个ipset中,是的话跳转到KUBE-MARK-MASQ,看下咱们启动NodePort service后这个ipset中的值
$ ipset --list KUBE-NODE-PORT-TCP Name: KUBE-NODE-PORT-TCP Type: bitmap:port Revision: 3 Header: range 0-65535 Size in memory: 8268 References: 1 Number of entries: 1 Members: 8086kubernetes的确把咱们建立的服务对应的node port值存入这个里面了
KUBE-MARK-MASQ规则链对进入这个规则链的全部请求都打上一个标签0x4000
这样,本文用三个例子,经过用tcpdump对各个网络设备上数据包的分析,阐述了不一样状况下kubernetes的网络请求过程。最后一个例子结合kubernetes给我建立的ipset、iptables规则讲述了kubernetes实现服务访问的原理。前面两个例子读者也能够采用这样的方式结合下iptables中的规则链,来验证下数据的流转流程。
另外最后一个例子,还能够经过集群中master节点的8086来访问web服务,这时数据包还会通过两个节点的flannel.1网络设备,但不会通过master.docker0设备,而且web中收到请求的remoteAddr也会不同,下面只给出请求过程再也不给出具体的tcpdump日志信息
请求:
master.wlp3s0->master.flannel.1->master.wlp3s0->slave.wlo1->slave.flannel.1->slave.docker0
响应:
slave.docker0->slave.flannel.1->slave.wlo1->master.wlp3s0->master.flannel.1-> master.wlp3s0
读者朋友能够自行试下,结合tcpdump工具和iptables中的规则对数据包流转过程进行分析。
额外的一点思考,为啥kubernetes要设计的这么复杂对经过node port的请求进行masquerade呢,这是由于当建立一个NodePort服务后,kubernetes不仅是让服务对应的endpoint所在的节点上可以提供服务,而是让集群中全部的节点均可以在对应的port上提供服务,这样咱们从外部经过node port访问集群服务时,有可能访问的服务对应的pod不在咱们访问的节点上,这样要是不通过masquerade,真实的endpoint处理完请求后在响应时看到的也是真实的clientIP,数据就不会先返回到client一开始请求的node上,而是直接返回给了client,这样client收到结果发现是和请求的地址不同的服务器给了响应,会认为这是不合法的的响应体。因此为了让client能从请求的节点上拿到响应体,因此须要对外部访问node port的请求统一作masquerade,这样数据返回时,会首先返回到client请求的节点上,再由此节点返回给client。若是由于业务需求,如一些审计什么的,必需要获取到client的真实IP,能够考虑下面三种方式:
kubernetes的官网对此的探讨:
Using Source IP