高级调度设置机制分为如下两类:node
节点选择器: nodeSelector , nodeNamelinux
节点亲和角度: nodeAffinty后端
调度器的逻辑api
1 节点选择器
app
nodeSelector 、nodeName、NodeAffinity
frontend
若是指望把pod调度到特定节点上,直接给定node名称便可,这样对应pod必定只能被调度到对应节点ide
若是有一类节点都符合条件,则使用nodeSeleteor,给必定的节点打上标签,在pod的配置中去匹配节点标签,这样的方式能够极大的缩小范围ui
nodeSelector
this
例:找到gpu为ssd的标签节点spa
[root@master k8s]# mkdir schedule
[root@master k8s]# cd schedule/
[root@master schedule]# ll
total 0
[root@master schedule]# cp ../pod-demo.yaml .
[root@master schedule]#
apiVersion: v1
kind: Pod
metadata:
name: pod-demo
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
nodeSelector: # 调用的是MatchNodeSelector 预选策略,查看ssd 标签的node是否存在
disktype: ssd
查看node 标签
[root@master schedule]# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
master.test.k8s.com Ready master 2d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=master.test.k8s.com,node-role.kubernetes.io/master=
node1.test.k8s.com Ready <none> 2d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node1.test.k8s.com
node2.test.k8s.com Ready <none> 2d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node2.test.k8s.com
[root@master schedule]#
若是给予其中一个加入标签,那么它必定在指定的node中建立pod
若是没有任何标签则会处于Pending状态
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-demo 0/1 Pending 0 47s <none> <none> <none>
[root@master schedule]#
调度是没法成功,它是一种强约束,因此必须知足其条件
describe查看信息
Events:
Type Reason Age # From Message
---- ------ ---- #---- -------
Warning FailedScheduling 18s (x25 over 1m) #default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
除非从新打上标签
[root@master schedule]# kubectl label nodes node1.test.k8s.com disktype=ssd
node/node1.test.k8s.com labeled
再次查看
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-demo 1/1 Running 0 4m 10.244.1.153 node1.test.k8s.com <none>
[root@master schedule]#
NodeAffinity
Node affinity跟nodeSelector很像,能够限制Pods在特定节点上运行,也能够优先调度到特定节点
使用方式
[root@master schedule]# kubectl explain pods.spec.affinity
KIND: Pod
VERSION: v1
RESOURCE: affinity <Object>
[root@master schedule]# kubectl explain pods.spec.affinity.nodeAffinity | grep '<'
RESOURCE: nodeAffinity <Object>
preferredDuringSchedulingIgnoredDuringExecution <[]Object> # 它的值是一个对象列表
requiredDuringSchedulingIgnoredDuringExecution <Object>
NodeAffinity的亲和性
requiredDuringSchedulingIgnoredDuringExecution 硬亲和性好比知足条件
preferredDuringSchedulingIgnoredDuringExecution 软亲和性,尽可能知足条件,不然找其余节点运行
定义一个硬亲和,requiredDuringSchedulingIgnoredDuringExecution
经过区域断定,若是节点中拥有此标签则在此建立pod
apiVersion: v1
kind: Pod
metadata:
name: pod-demo
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
matchExpressions:
- key: zone # 若是当前key中的value,在node上存在,则建立pod
operator: In
value:
- foo
- bar
运行时是pending,由于是硬亲和性,当前没有用于这个标签
node 软亲和
软亲和性
[root@master schedule]# kubectl explain pods.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution
KIND: Pod
VERSION: v1
看到使用方式
preference <Object> -required-
A node selector term, associated with the corresponding weight.
weight <integer> -required- # 给予权重和对象(定义哪些节点)
Weight associated with matching the corresponding nodeSelectorTerm, in the
range 1-100.
[root@master schedule]# cat preferred-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-demo
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: zone
operator: In
values:
- foo
- bar
weight: 60
匹配不到标签,可是还能够照常运行
[root@master schedule]# kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-demo 1/1 Running 0 1m
[root@master schedule]#
pod亲和性
与node亲和性相比,pod亲和性并不强制
以节点名称为不一样位置,那么很显然每一个节点都不一样,所以每一个节点都是独特的位置
因此另外一种判断标准,以标签为位置,一样的标签为同一位置,这样才能够判断哪些知足亲和性,以及其余调度属性
pod 也有软硬亲和性,以下所示
[root@master schedule]# kubectl explain pods.spec.affinity.podAffinity
preferredDuringSchedulingIgnoredDuringExecution
requiredDuringSchedulingIgnoredDuringExecution
[root@master schedule]# kubectl explain pods.spec.affinity.podAffinity.preferredDuringSchedulingIgnoredDuringExecution
podAffinityTerm <Object> -required-
weight <integer> -required-
[root@master schedule]# kubectl explain pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution
labelSelector
namespaces
topologyKey
定义pod
第一个资源
apiVersion: v1
kind: Pod
metadata:
name: pod-first
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
定义多个资源
[root@master schedule]# cat pod-first.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-first
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
---
apiVersion: v1
kind: Pod
metadata:
name: pod-second
labels:
app: db
tier: db
spec:
containers:
- name: busybox
image: busybox:latest
imagePullPolicy: IfNotPresent
command: ["/bin/sh","-c","sleep 360000"]
每一个节点都自动建立一个标签,名为当前节点的hostname
[root@master schedule]# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
master.test.k8s.com Ready master 3d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=master.test.k8s.com,node-role.kubernetes.io/master=
node1.test.k8s.com Ready <none> 3d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/hostname=node1.test.k8s.com
node2.test.k8s.com Ready <none> 3d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node2.test.k8s.com
You have new mail in /var/spool/mail/root
接下来定义affinity
topologKey 表示只要是当前hostname,则认为是同一个位置,只要hostname则认为同一个位置,每一个节点的hostname不一样,hostname是一个变量
以下
[root@master schedule]# cat pod-first.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-first
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
---
apiVersion: v1
kind: Pod
metadata:
name: pod-second
labels:
app: backend
tier: db
spec:
containers:
- name: busybox
image: busybox:latest
imagePullPolicy: IfNotPresent
command: ["/bin/sh","-c","sleep 360000"]
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # 定义亲和性
- labelSelector:
matchExpressions: # 匹配哪一个pod,要与pod标签捆绑在一块儿
- {key: app, operator: In, values: ["myapp"]} # 找到存在pod标签 app:myapp 的pod 优先选择
topologyKey: kubernetes.io/hostname
默认是一个均衡法则,两个节点天然是同样的,优先策略:cpu天然均衡并找到最少资源占用的,
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-first 1/1 Running 0 2m 10.244.2.56 node2.test.k8s.com <none>
pod-second 1/1 Running 0 35s 10.244.2.57 node2.test.k8s.com <none>
[root@master schedule]#
[root@master schedule]# kubectl describe pod pod-second
查看调度方式
---- ------ ---- ---- -------
Normal Scheduled 3m default-scheduler Successfully assigned default/pod-second to node2.test.k8s.com # 明确告知被调度到node2
Normal Pulled 3m kubelet, node2.test.k8s.com Container image "busybox:latest" already present on machine
Normal Created 3m kubelet, node2.test.k8s.com Created container
Normal Started 3m kubelet, node2.test.k8s.com Started container
若是使用软亲和性,可能会被调度到其余节点,由于没有那么强制的策略
反亲和
取反,key的值不能是相同的,两者的值确定不能是相同的
更改以下
affinity:
podAntiAffinity: # 更改成反亲和
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- {key: app, operator: In, values: ["myapp"]}
topologyKey: kubernetes.io/hostname
[root@master schedule]# kubectl apply -f pod-first.yaml
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-first 1/1 Running 0 13s 10.244.1.161 node1.test.k8s.com <none>
pod-second 1/1 Running 0 13s 10.244.2.58 node2.test.k8s.com <none>
一样的,若是pod-first 运行在这个节点,那么pod-second 必定不能在这个节点
给node标签
[root@master schedule]# kubectl label nodes node1.test.k8s.com zone=foo
node/node1.test.k8s.com labeled
[root@master schedule]# kubectl label nodes node2.test.k8s.com zone=foo
node/node2.test.k8s.com labeled
[root@master schedule]#
更改toplogKey
从新编辑配置清单
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- {key: app, operator: In, values: ["myapp"]}
topologyKey: zone # 排除的node 标签
从新建立
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-first 1/1 Running 0 5s 10.244.2.59 node2.test.k8s.com <none>
pod-second 0/1 Pending 0 5s <none> <none> <none>
[root@master schedule]#
pod-second为Pending,由于启动时检查pod是否存在反亲和性,那么会检查topologyKey: zone 这个标签是否存在,若是存在,由于是反亲和性,那么则不在这个节点上运行
污点调度/容忍调度
后端倾向度,让pod进行选择,节点是被动选择,给予了节点的选择权,选择让那些pod进行调度到节点
污点定义
在node.spec中进行定义
[root@master schedule]# kubectl explain node.spec.taints
查看节点的说明详细信息
[root@master schedule]# kubectl get nodes node1.test.k8s.com -o yaml
找到spec
spec:
podCIDR: 10.244.1.0/24
taints是一个对象列表,用于定义节点的污点
定义污点:
关键参数:
geffect 要求必需要有当pod不能容忍污点时,采起的行为是什么因此是effect:
分别有如下定义:
effect定义对Pod排斥效果
[root@master schedule]# kubectl explain node.spec.taints.effect
KIND: Node
VERSION: v1
FIELD: effect <string>
DESCRIPTION:
Required. The effect of the taint on pods that do not tolerate the taint.
Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
NoSchedule |
仅影响调度过程对现存的pod对象不产生影响 |
PreferNoSchedule |
不能容忍也不能调度 |
NoExecute |
即影响调度过程也影响当前的pod对象,不能容忍的pod对象将被驱逐 |
若是一个节点存在污点,那么一个pod可否调度到这个节点,先去检查可以被匹配的污点容忍度
好比第一个污点与第一个容忍度恰好匹配到,那么剩下的检查不能被容忍则检查污点的效果,那么若是是noschedule ,则如何,若是是noexecute 则又如何
查看node中的污点
[root@master schedule]# kubectl describe node master.test.k8s.com | grep -i taints
Taints: node-role.kubernetes.io/master:NoSchedule # pod只要不能容忍这个污点则不能被调度
master是NoSchedule,也就是说为何master上不被调度pod的缘由
因此各类pod不少,历来就没调度到master之上,就是说没有定义过它的容忍度
好比查看kube-apiserver-master的信息
[root@master schedule]# kubectl describe pod kube-apiserver-master.test.k8s.com -n kube-system
看到以下,Tolerations表示容忍度:表示全部污点,NoExecute 表示不能被调度
Tolerations: :NoExecute # 容忍度,若是能容忍则NoExecute 能够调度过来,显然这里是全部
查看kube-proxy的节点信息
[root@master schedule]# kubectl describe pod kube-proxy-lncxb -n kube-system
它的容忍度比较明显
Tolerations:
CriticalAddonsOnly # 附件
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
以上都会影响容忍度的检查
添加一个污点容忍
[root@master ~]# kubectl taint node node1.test.k8s.com node-type=production:NoSchedule # 节点类型是不能容忍污点被调度
node/node1.test.k8s.com tainted
[root@master ~]# kubectl describe nodes node1.test.k8s.com | grep -i taint
Taints: node-type=production:NoSchedule
这样就为node1 加入了污点,这样之后就不会被调度到node1上来
定义以下
清单中这3个pod,没有污点容忍度
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-deploy
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: myapp
release: cancary
template:
metadata:
labels:
app: myapp
release: cancary
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v2
ports:
- name: http
containerPort: 80
因此他们都在node2上,由于他们不能容忍node1的污点,由于没有定义pod的容忍度
[root@master daemonset]# kubectl apply -f deploy.yaml
[root@master daemonset]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
myapp-deploy-86c975f8b-7x6m7 1/1 Running 0 4s 10.244.2.62 node2.test.k8s.com <none>
myapp-deploy-86c975f8b-bk9c7 1/1 Running 0 4s 10.244.2.61 node2.test.k8s.com <none>
myapp-deploy-86c975f8b-rpd84 1/1 Running 0 4s 10.244.2.60 node2.test.k8s.com <none>
若是在node2上加入污点查看效果,这个节点类型是dev环境,并且类型是NoExecute
以下所示,pod状态都为pending
[root@master daemonset]# kubectl taint node node2.test.k8s.com node-type=production:NoExecute
node/node2.test.k8s.com tainted
[root@master daemonset]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
myapp-deploy-86c975f8b-4sd6c 0/1 Pending 0 11s <none> <none> <none>
myapp-deploy-86c975f8b-nf985 0/1 Pending 0 11s <none> <none> <none>
myapp-deploy-86c975f8b-vx2h2 0/1 Pending 0 11s <none> <none> <none>
[root@master daemonset]#
加入pod容忍度
只须要让其容忍哪些污点便可,每一个容忍度都是一个列表中的元素
[root@master daemonset]# kubectl explain pods.spec.tolerations
KIND: Pod
VERSION: v1
RESOURCE: tolerations <[]Object>
tolerationSeconds 容忍时间,意思为若是被驱逐,则等待定义的时间再去被驱逐,默认是0秒
tolerationSeconds <integer>
TolerationSeconds represents the period of time the toleration (which must
be of effect NoExecute, otherwise this field is ignored) tolerates the
taint. By default, it is not set, which means tolerate the taint forever
(do not evict). Zero and negative values will be treated as 0 (evict
immediately) by the system.
operator 参数
operator <string>
Operator represents a key's relationship to the value. Valid operators are
# Exists and Equal.
Defaults to Equal. Exists is equivalent to wildcard for
value, so that a pod can tolerate all taints of a particular category.
exists 断定污点存在
equal 表示必须精确容忍其容忍值,等值比较
定义容忍度
节点类型 哪一个污点
[root@master schedule]# cat deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-deploy
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: myapp
release: cancary
template:
metadata:
labels:
app: myapp
release: cancary
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v2
ports:
- name: http
containerPort: 80
tolerations:
- key: "node-type"
operator: "Equal"
value: "production"
effect: "NoExecute"
tolerationSeconds: 300
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
myapp-deploy-595c744cf7-6cll6 1/1 Running 0 16s 10.244.2.65 node2.test.k8s.com <none>
myapp-deploy-595c744cf7-fwgqr 1/1 Running 0 16s 10.244.2.63 node2.test.k8s.com <none>
myapp-deploy-595c744cf7-hhdfq 1/1 Running 0 16s 10.244.2.64 node2.test.k8s.com <none>