你们在Kubernetes集群中部署核心组件时,常常会用到Critical Pod,那么你知道Critical Pod到底有何特别吗?要完整的了解这一点,其实并非那么简单,它关系到调度、Kubelet Eviction Manager、DaemonSet Controller、Kubelet Preemption等,我将分4个系列为你们剖析。这一篇先介绍Critical Pod在Predicate in Schedule阶段的行为,以及用户指望的行为等。node
官方宣布Rescheduler is deprecated as of Kubernetes 1.10 and will be removed in version 1.12,因此本文将不讨论Rescheduler对Critical Pod的处理逻辑。app
规则1:ide
ExperimentalCriticaPodAnnotation
kube-system
namespace;scheduler.alpha.kubernetes.io/critical-pod=""
规则2:spa
Enable Feature Gate ExperimentalCriticaPodAnnotation, PodPriority
code
Pod的Priority不为空,且不小于2 * 10^9
;资源
system-node-critical priority = 10^9 + 1000;
system-cluster-critical priority = 10^9;rem
知足规则1或规则2之一,就认为该Pod为Critical Pod;部署
在default scheduler进行pod调度的predicate阶段,会注册GeneralPredicates
为default predicates之一,并无判断critical Pod使用EssentialPredicates
来对critical Pod进行predicate process。这意味着什么呢?kubernetes
咱们看看GeneralPredicates和EssentialPredicates的关系就知道了。GeneralPredicates中,先调用noncriticalPredicates,再调用EssentialPredicates。所以若是你给Deployment/StatefulSet等(DeamonSet除外)标识为Critical,那么在scheduler调度时,仍然走GeneralPredicates的流程,会调用noncriticalPredicates,而你却但愿它直接走EssentialPredicates。it
// GeneralPredicates checks whether noncriticalPredicates and EssentialPredicates pass. noncriticalPredicates are the predicates // that only non-critical pods need and EssentialPredicates are the predicates that all pods, including critical pods, need func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { var predicateFails []algorithm.PredicateFailureReason fit, reasons, err := noncriticalPredicates(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil }
noncriticalPredicates原意是想对non-critical pod作的额外predicate逻辑,这个逻辑就是PodFitsResources检查。
pkg/scheduler/algorithm/predicates/predicates.go:1076 // noncriticalPredicates are the predicates that only non-critical pods need func noncriticalPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { var predicateFails []algorithm.PredicateFailureReason fit, reasons, err := PodFitsResources(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil }
PodFitsResources就作如下检查资源是否知足要求:
也就是说,若是你给Deployment/StatefulSet等(DeamonSet除外)标识为Critical,那么对应的Pod调度时仍然会检查Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources
是否足够,若是不知足则会触发预选失败,而且在Preempt阶段也只是根据对应的PriorityClass进行正常的抢占逻辑,并无针对Critical Pod进行特殊处理,所以最终可能会由于找不到知足资源要求的Node,致使该Critical Pod调度失败,一直处于Pending状态。
而用户设置Critical Pod是不想由于资源不足致使调度失败的。那若是我就是想使用Deployment/StatefulSet等(DeamonSet除外)标识为Critical Pod来部署关键服务呢?有如下两个办法:
system-cluster-critical
或system-node-critical
Priority Class,这样就会在scheduler正常的Preempt流程中抢占到资源完成调度。GeneralPredicates
的代码以下,检测是否为Critical Pod,若是是,则不执行noncriticalPredicates逻辑,也就是说predicate阶段不对Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources
资源进行检查。func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { var predicateFails, resons []algorithm.PredicateFailureReason var fit bool var err error // **Modify**: check whether the pod is a Critical Pod, don't invoke noncriticalPredicates if false. isCriticalPod := utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) && kubelettypes.IsCriticalPod(newPod) if !isCriticalPod { fit, reasons, err = noncriticalPredicates(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } } if !fit { predicateFails = append(predicateFails, reasons...) } fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil }
方法1,其实Kubernetes在Admission Priority检查时已经帮你作了。
// admitPod makes sure a new pod does not set spec.Priority field. It also makes sure that the PriorityClassName exists if it is provided and resolves the pod priority from the PriorityClassName. func (p *priorityPlugin) admitPod(a admission.Attributes) error { ... if utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) { var priority int32 if len(pod.Spec.PriorityClassName) == 0 && utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) && kubelettypes.IsCritical(a.GetNamespace(), pod.Annotations) { pod.Spec.PriorityClassName = scheduling.SystemClusterCritical } ... }
在Admission时候会对Pod的Priority进行检查,若是发现您已经:
那么,Admisson Priority阶段会自动给Pod添加SystemClusterCritical(system-cluster-critical) PriorityClass;
经过上面的分析,给出以下最佳实践:在Kubernetes集群中,经过非DeamonSet方式(好比Deployment、RS等)部署关键服务时,为了在集群资源不足时仍能保证抢占调度成功,请确保以下事宜:
本文介绍了标识一个关键服务为Critical服务的两种方法,并介绍了Critical Pod(DaemonSet部署方式除外)在Predicate in Schedule阶段的行为,给出了最佳实践。