深刻分析Kubernetes Critical Pod（一）

时间 2019-11-13

标签深刻分析 kubernetes critical pod 繁體版

原文原文链接

你们在Kubernetes集群中部署核心组件时，常常会用到Critical Pod，那么你知道Critical Pod到底有何特别吗？要完整的了解这一点，其实并非那么简单，它关系到调度、Kubelet Eviction Manager、DaemonSet Controller、Kubelet Preemption等，我将分4个系列为你们剖析。这一篇先介绍Critical Pod在Predicate in Schedule阶段的行为，以及用户指望的行为等。node

官方宣布Rescheduler is deprecated as of Kubernetes 1.10 and will be removed in version 1.12，因此本文将不讨论Rescheduler对Critical Pod的处理逻辑。app

有什么方法标识一个Pod为Critical Pod

规则1：ide

Enable Feature Gate ExperimentalCriticaPodAnnotation
必须隶属于kube-system namespace；
必须加上Annotation scheduler.alpha.kubernetes.io/critical-pod=""

规则2：spa

Enable Feature Gate ExperimentalCriticaPodAnnotation, PodPrioritycode
Pod的Priority不为空，且不小于2 * 10^9;资源

system-node-critical priority = 10^9 + 1000;
system-cluster-critical priority = 10^9;rem

知足规则1或规则2之一，就认为该Pod为Critical Pod；部署

Schedule Critical Pod

在default scheduler进行pod调度的predicate阶段，会注册GeneralPredicates为default predicates之一，并无判断critical Pod使用EssentialPredicates来对critical Pod进行predicate process。这意味着什么呢？kubernetes

咱们看看GeneralPredicates和EssentialPredicates的关系就知道了。GeneralPredicates中，先调用noncriticalPredicates，再调用EssentialPredicates。所以若是你给Deployment/StatefulSet等(DeamonSet除外)标识为Critical，那么在scheduler调度时，仍然走GeneralPredicates的流程，会调用noncriticalPredicates，而你却但愿它直接走EssentialPredicates。it

// GeneralPredicates checks whether noncriticalPredicates and EssentialPredicates pass. noncriticalPredicates are the predicates
// that only non-critical pods need and EssentialPredicates are the predicates that all pods, including critical pods, need
func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	var predicateFails []algorithm.PredicateFailureReason
	fit, reasons, err := noncriticalPredicates(pod, meta, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	return len(predicateFails) == 0, predicateFails, nil
}

noncriticalPredicates原意是想对non-critical pod作的额外predicate逻辑，这个逻辑就是PodFitsResources检查。

pkg/scheduler/algorithm/predicates/predicates.go:1076

// noncriticalPredicates are the predicates that only non-critical pods need
func noncriticalPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	var predicateFails []algorithm.PredicateFailureReason
	fit, reasons, err := PodFitsResources(pod, meta, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	return len(predicateFails) == 0, predicateFails, nil
}

PodFitsResources就作如下检查资源是否知足要求：

Allowed Pod Number；
CPU；
Memory；
EphemeralStorage；
Extended Resources；

也就是说，若是你给Deployment/StatefulSet等(DeamonSet除外)标识为Critical，那么对应的Pod调度时仍然会检查Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources是否足够，若是不知足则会触发预选失败，而且在Preempt阶段也只是根据对应的PriorityClass进行正常的抢占逻辑，并无针对Critical Pod进行特殊处理，所以最终可能会由于找不到知足资源要求的Node，致使该Critical Pod调度失败，一直处于Pending状态。

而用户设置Critical Pod是不想由于资源不足致使调度失败的。那若是我就是想使用Deployment/StatefulSet等(DeamonSet除外)标识为Critical Pod来部署关键服务呢？有如下两个办法：

按照前面提到的规则2，给Pod设置system-cluster-critical或system-node-critical Priority Class，这样就会在scheduler正常的Preempt流程中抢占到资源完成调度。
按照前面提到的规则1，而且修改GeneralPredicates 的代码以下,检测是否为Critical Pod，若是是，则不执行noncriticalPredicates逻辑，也就是说predicate阶段不对Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources资源进行检查。

func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	var predicateFails, resons []algorithm.PredicateFailureReason
	var fit bool
	var err error
	
	// **Modify**: check whether the pod is a Critical Pod, don't invoke noncriticalPredicates if false.
	isCriticalPod := utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
		kubelettypes.IsCriticalPod(newPod)
	
	if !isCriticalPod {
	   fit, reasons, err = noncriticalPredicates(pod, meta, nodeInfo)
    	if err != nil {
    		return false, predicateFails, err
    	}
	}
	
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	return len(predicateFails) == 0, predicateFails, nil
}

方法1，其实Kubernetes在Admission Priority检查时已经帮你作了。

// admitPod makes sure a new pod does not set spec.Priority field. It also makes sure that the PriorityClassName exists if it is provided and resolves the pod priority from the PriorityClassName.
func (p *priorityPlugin) admitPod(a admission.Attributes) error {
	...
	if utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) {
		var priority int32
		if len(pod.Spec.PriorityClassName) == 0 &&
			utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
			kubelettypes.IsCritical(a.GetNamespace(), pod.Annotations) {
			pod.Spec.PriorityClassName = scheduling.SystemClusterCritical
		}
            ...
}

在Admission时候会对Pod的Priority进行检查，若是发现您已经：

Enable PriorityClass Feature Gate；
Enable ExperimentalCriticalPodAnnotation Feature Gate;
给Pod添加了ExperimentalCriticalPodAnnotation；
部署在kube-system namespace；
没有手动设置自定义PriorityClass；

那么，Admisson Priority阶段会自动给Pod添加SystemClusterCritical(system-cluster-critical) PriorityClass；

最佳实践

经过上面的分析，给出以下最佳实践：在Kubernetes集群中，经过非DeamonSet方式（好比Deployment、RS等）部署关键服务时，为了在集群资源不足时仍能保证抢占调度成功，请确保以下事宜：

Enable PriorityClass Feature Gate；
Enable ExperimentalCriticalPodAnnotation Feature Gate;
给Pod添加了ExperimentalCriticalPodAnnotation；
部署在kube-system namespace；
千万不要手动设置自定义PriorityClass；

总结

本文介绍了标识一个关键服务为Critical服务的两种方法，并介绍了Critical Pod（DaemonSet部署方式除外）在Predicate in Schedule阶段的行为，给出了最佳实践。