K8S调度之Taints and Tolerations

时间 2020-08-10

标签 k8s 调度 taints tolerations 繁體版

原文原文链接

Taints和Tolerations（污点和容忍）

在《K8S之节点亲和性》中，咱们说到的的NodeAffinity节点亲和性，是在pod上定义的一种属性，使得Pod可以被调度到某些node上运行。Taint恰好相反，它让Node拒绝Pod的运行。
Taint须要与Toleration配合使用，让pod避开那些不合适的node。在node上设置一个或多个Taint后，除非pod明确声明可以容忍这些“污点”，不然没法在这些node上运行。Toleration是pod的属性，让pod可以（注意，只是可以，而非必须）运行在标注了Taint的node上。html

基本用法

设置污点：

kubectl taint node [node] key=value[effect]   
          其中[effect] 可取值： [ NoSchedule | PreferNoSchedule | NoExecute ]
           NoSchedule ：必定不能被调度。
           PreferNoSchedule：尽可能不要调度。
           NoExecute：不只不会调度，还会驱逐Node上已有的Pod。
           
           
    #示例：
      kubectl taint node 10.3.1.16 test=16:NoSchedule

去除污点：

#好比设置污点：
     kubectl taint node 10.3.1.16 test=16:NoSchedule
     kubectl taint node 10.3.1.16 test=16:NoExecute
     
    #去除指定key及其effect：
     kubectl taint nodes node_name key:[effect]-    #(这里的key不用指定value)
                
    #去除指定key全部的effect: 
     kubectl taint nodes node_name key-
    
    #示例：
     kubectl taint node 10.3.1.16 test:NoSchedule-
     kubectl taint node 10.3.1.16 test:NoExecute-
     kubectl taint node 10.3.1.16 test-

下面是一个简单的示例：node

在node1上加一个Taint，该Taint的键为key，值为value，Taint的效果是NoSchedule。这意味着除非pod明确声明能够容忍这个Taint，不然就不会被调度到node1上:api

kubectl taint nodes node1  key=value:NoSchedule

而后须要在pod上声明Toleration。下面的Toleration设置为能够容忍具备该Taint的Node，使得pod可以被调度到node1上：服务器

apiVersion: v1
kind: Pod
metadata:
  name: pod-taints
spec:
  tolerations:
  - key: "key"
    operator: "Equal"
    value: "value"
    effect: "NoSchedule"
  containers:
    - name: pod-taints
      image: busybox:latest

也能够写成以下：网络

tolerations:
- key: "key"
  operator: "Exists"
  effect: "NoSchedule"

pod的Toleration声明中的key和effect须要与Taint的设置保持一致，而且知足如下条件之一：code

operator的值为Exists，这时无需指定value
operator的值为Equal而且value相等

若是不指定operator，则默认值为Equal。htm

另外还有以下两个特例：blog

空的key配合Exists操做符可以匹配全部的键和值
空的effect匹配全部的effect

effect说明

上面的例子中effect的取值为NoSchedule，下面对effect的值做下简单说明：事件

NoSchedule：若是一个pod没有声明容忍这个Taint，则系统不会把该Pod调度到有这个Taint的node上ci
PreferNoSchedule：NoSchedule的软限制版本，若是一个Pod没有声明容忍这个Taint，则系统会尽可能避免把这个pod调度到这一节点上去，但不是强制的。
NoExecute：定义pod的驱逐行为，以应对节点故障。NoExecute这个Taint效果对节点上正在运行的pod有如下影响：
- 没有设置Toleration的Pod会被马上驱逐
- 配置了对应Toleration的pod，若是没有为tolerationSeconds赋值，则会一直留在这一节点中
- 配置了对应Toleration的pod且指定了tolerationSeconds值，则会在指定时间后驱逐
- 从kubernetes 1.6版本开始引入了一个alpha版本的功能，即把节点故障标记为Taint（目前只针对node unreachable及node not ready，相应的NodeCondition "Ready"的值为Unknown和False）。激活TaintBasedEvictions功能后（在--feature-gates参数中加入TaintBasedEvictions=true），NodeController会自动为Node设置Taint，而状态为"Ready"的Node上以前设置过的普通驱逐逻辑将会被禁用。注意，在节点故障状况下，为了保持现存的pod驱逐的限速设置，系统将会以限速的模式逐步给node设置Taint，这就能防止在一些特定状况下（好比master暂时失联）形成的大量pod被驱逐的后果。这一功能兼容于tolerationSeconds，容许pod定义节点故障时持续多久才被逐出。

多污点与多容忍配置

系统容许在同一个node上设置多个taint，也能够在pod上设置多个Toleration。Kubernetes调度器处理多个Taint和Toleration可以匹配的部分，剩下的没有忽略掉的Taint就是对Pod的效果了。下面是几种特殊状况：

若是剩余的Taint中存在effect=NoSchedule，则调度器不会把该pod调度到这一节点上。
若是剩余的Taint中没有NoSchedule的效果，可是有PreferNoSchedule效果，则调度器会尝试不会pod指派给这个节点
若是剩余Taint的效果有NoExecute的，而且这个pod已经在该节点运行，则会被驱逐；若是没有在该节点运行，也不会再被调度到该节点上。

下面是一个示例：

kubectl taint nodes node1 key1=value1:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key2=value2:NoSchedule

在pod上设置两个toleration：

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"

这样的结果是该pod没法被调度到node1上，由于第三个taint没有匹配的toleration。可是若是这个Pod已经在node1上运行了，那么在运行时设置上第三个Taint，它还能继续运行，由于pod能够容忍前两个taint。

通常来说，若是给node加上effect=NoExecute的Taint，那么该 node上正在运行的全部无对应toleration的pod都会被马上驱逐，而具备相应toleration的pod则永远不会被逐出。不过系统容许给具备NoExecute效果的Toleration加入一个可选的tolerationSeconds字段，这个设置代表pod能够在Taint添加到node以后还能在这个node上运行多久（单们为s）：

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
  tolerationSeconds: 3600

上面的例子的意思是，若是pod正在运行，所在节点被加入一个匹配的Taint，则这个Pod会持续在这个节点上存活3600s后被驱逐。若是在这个宽限期内taint被移除，则不会触发驱逐事件。

常见应用场景

节点独占

若是想要拿出一部分节点，专门给特定的应用使用，则能够为节点添加这样的Taint：

kubectl taint nodes nodename dedicated=groupName:NoSchedule

而后给这些应用的pod加入相应的toleration，则带有合适toleration的pod就会被容许同使用其余节点同样使用有taint的节点。而后再将这些node打上指定的标签，再经过nodeSelector或者亲和性调度的方式，要求这些pod必须运行在指定标签的节点上。

具备特殊硬件设备的节点

在集群里，可能有一小部分节点安装了特殊的硬件设备，好比GPU芯片。用户天然会但愿把不须要占用这类硬件的pod排除在外。以确保对这类硬件有需求的pod可以顺利调度到这些节点上。可使用下面的命令为节点设置taint：

kubectl taint nodes nodename special=true:NoSchedule
kubectl taint nodes nodename special=true:PreferNoSchedule

而后在pod中利用对应的toleration来保障特定的pod可以使用特定的硬件。而后一样的，咱们也可使用标签或者其余的一些特征来判断这些pod，将其调度到这些特定硬件的服务器上。

应对节点故障

以前说到，在节点故障时，能够经过TaintBasedEvictions功能自动将节点设置Taint，而后将pod驱逐。可是在一些场景下，好比说网络故障形成的master与node失联，而这个node上运行了不少本地状态的应用即便网络故障，也仍然但愿可以持续在该节点上运行，指望网络可以快速恢复，从而避免从这个node上被驱逐。Pod的Toleration能够这样定义：

tolerations:
- key: "node.alpha.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 6000

对于Node未就绪状态，能够把key设置为node.alpha.kubernetes.io/notReady。

若是没有为pod指定node.alpha.kubernetes.io/noReady的Toleration，那么Kubernetes会自动为pod加入tolerationSeconds=300的node.alpha.kubernetes.io/notReady类型的toleration。

一样，若是没有为pod指定node.alpha.kubernetes.io/unreachable的Toleration，那么Kubernetes会自动为pod加入tolerationSeconds=300的node.alpha.kubernetes.io/unreachable类型的toleration。

这些系统自动设置的toleration用于在node发现问题时，可以为pod确保驱逐前再运行5min。这两个默认的toleration由Admission Controller "DefaultTolerationSeconds"自动加入。