深刻理解 Kubernetes CPU Mangager

时间 2019-11-13

标签深刻理解 kubernetes cpu mangager 繁體版

原文原文链接

Author: xidianwangtao@gmail.comnode

摘要：Kuberuntime CPU Manager在咱们生产环境中有大规模的应用，咱们必须对其有深刻理解，方能指挥若定。本文对CPU Manager的使用场景、使用方法、工做机制、可能存在的问题及解决办法等方面都有涉及，但愿对你们有所帮助。nginx

CPU Manager是干什么的？

熟悉docker的用户，必定用过docker cpuset的能力，用来指定docker container启动时绑定指定的cpu和memory node。git

--cpuset-cpus=""	CPUs in which to allow execution (0-3, 0,1)
--cpuset-mems=""	Memory nodes (MEMs) in which to allow execution (0-3, 0,1). Only effective on NUMA systems.

可是Kubernetes一直没有提供提供的能力，直到Kubernetes 1.8开始，Kubernetes提供了CPU Manager特性来支持cpuset的能力。从Kubernetes 1.10版本开始到目前的1.12，该特性仍是Beta版。github

CPU Manager是Kubelet CM中的一个模块，目标是经过给某些Containers绑定指定的cpus，达到绑定cpus的目标，从而提高这些cpu敏感型任务的性能。docker

什么场景下会考虑用CPU Manager？

前面提到CPU敏感型任务，会由于使用CpuSet而大幅度提高计算性能，那到底具有哪些特色的任务是属于CPU敏感型的呢？json

Sensitive to CPU throttling effects.
Sensitive to context switches.
Sensitive to processor cache misses.
Benefits from sharing a processor resources (e.g., data and instruction caches).
Sensitive to cross-socket memory traffic.
Sensitive or requires hyperthreads from the same physical CPU core.

Feature Highlight/ CPU Manager - Kubernetes中还列举了一些具体的Sample对比，有兴趣的能够去了解。咱们公司的不少应用是属于这种类型的，并且cpuset带来的好处还有cpu资源结算的方便.固然，这几乎必定会带来整个集群的cpu利用率会有所下降，这就取决于你是否把应用的性能放在第一位了。api

如何使用CPU Manager

在Kubernetes v1.8-1.9版本中，CPU Manager仍是Alpha，在v1.10-1.12是Beta。我没关注过CPU Manager这几个版本的Changelog，仍是建议在1.10以后的版本中使用。app

Enable CPU Manager

确保kubelet中CPUManager Feature Gate为true(BETA - default=true)异步
目前CPU Manager支持两种Policy，分别为none和static，经过kubelet --cpu-manager-policy设置，将来会增长dynamic policy作Container生命周期内的cpuset动态调整。socket
- none: 为cpu manager的默认值，至关于没有启用cpuset的能力。cpu request对应到cpu share，cpu limit对应到cpu quota。
- static: 目前，请设置--cpu-manager-policy=static来启用，kubelet将在Container启动前分配绑定的cpu set，分配时还会考虑cpu topology来提高cpu affinity，后面会提到。
确保kubelet为--kube-reserved和--system-reserved都配置了值，能够不是整数个cpu，最终会计算reserved cpus时会向上取整。这样作的目的是为了防止CPU Manager把Node上全部的cpu cores分配出去了，致使kubelet及系统进程都没有可用的cpu了。

注意CPU Manager还有一个配置项--cpu-manager-reconcile-period，用来配置CPU Manager Reconcile Kubelet内存中CPU分配状况到cpuset cgroups的修复周期。若是没有配置该项，那么将使用--node-status-update-frequency（default 10s）配置的值。

Workload选项

完成了以上配置，就启用了Static CPU Manager，接下来就是在Workload中使用了。Kubernetes要求使用CPU Manager的Pod、Container具有如下两个条件：

Pod QoS为Guaranteed；
Pod中该Container的Cpu request必须为整数CPUs；

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "2"
      requests:
        memory: "200Mi"
        cpu: "2"

任何其余状况下的Containers，CPU Manager都不会为其分配绑定的CPUs，而是经过CFS使用Shared Pool中的CPUs。Shared Pool中的CPU集，就是Node上CPUCapacity - ReservedCPUs - ExclusiveCPUs。

CPU Manager工做流

CPU Manager为知足条件的Container分配指定的CPUs时，会尽可能按照CPU Topology来分配，也就是考虑CPU Affinity，按照以下的优先顺序进行CPUs选择：（Logic CPUs就是Hyperthreads）

若是Container请求的Logic CPUs数量不小于单块CPU Socket中Logci CPUs数量，那么会优先把整块CPU Socket中的Logic CPUs分配给该Container。
若是Container剩余请求的Logic CPUs数量不小于单块物理CPU Core提供的Logic CPUs数量，那么会优先把整块物理CPU Core上的Logic CPUs分配给该Container。
Container剩余请求的Logic CPUs则从按照以下规则排好序的Logic CPUs列表中选择：
- number of CPUs available on the same socket
- number of CPUs available on the same core

pkg/kubelet/cm/cpumanager/cpu_assignment.go:149

func takeByTopology(topo *topology.CPUTopology, availableCPUs cpuset.CPUSet, numCPUs int) (cpuset.CPUSet, error) {
	acc := newCPUAccumulator(topo, availableCPUs, numCPUs)
	if acc.isSatisfied() {
		return acc.result, nil
	}
	if acc.isFailed() {
		return cpuset.NewCPUSet(), fmt.Errorf("not enough cpus available to satisfy request")
	}

	// Algorithm: topology-aware best-fit
	// 1. Acquire whole sockets, if available and the container requires at
	//    least a socket's-worth of CPUs.
	for _, s := range acc.freeSockets() {
		if acc.needs(acc.topo.CPUsPerSocket()) {
			glog.V(4).Infof("[cpumanager] takeByTopology: claiming socket [%d]", s)
			acc.take(acc.details.CPUsInSocket(s))
			if acc.isSatisfied() {
				return acc.result, nil
			}
		}
	}

	// 2. Acquire whole cores, if available and the container requires at least
	//    a core's-worth of CPUs.
	for _, c := range acc.freeCores() {
		if acc.needs(acc.topo.CPUsPerCore()) {
			glog.V(4).Infof("[cpumanager] takeByTopology: claiming core [%d]", c)
			acc.take(acc.details.CPUsInCore(c))
			if acc.isSatisfied() {
				return acc.result, nil
			}
		}
	}

	// 3. Acquire single threads, preferring to fill partially-allocated cores
	//    on the same sockets as the whole cores we have already taken in this
	//    allocation.
	for _, c := range acc.freeCPUs() {
		glog.V(4).Infof("[cpumanager] takeByTopology: claiming CPU [%d]", c)
		if acc.needs(1) {
			acc.take(cpuset.NewCPUSet(c))
		}
		if acc.isSatisfied() {
			return acc.result, nil
		}
	}

	return cpuset.NewCPUSet(), fmt.Errorf("failed to allocate cpus")
}

Discovering CPU topology

CPU Manager能正常工做的前提，是发现Node上的CPU Topology，Discovery这部分工做是由cAdvisor完成的。

在cAdvisor的MachineInfo中经过Topology会记录cpu和mem的Topology信息。其中Topology的每一个Node对象就是对应一个CPU Socket。

vendor/github.com/google/cadvisor/info/v1/machine.go

type MachineInfo struct {
	// The number of cores in this machine.
	NumCores int `json:"num_cores"`

	...

	// Machine Topology
	// Describes cpu/memory layout and hierarchy.
	Topology []Node `json:"topology"`

	...
}

type Node struct {
	Id int `json:"node_id"`
	// Per-node memory
	Memory uint64  `json:"memory"`
	Cores  []Core  `json:"cores"`
	Caches []Cache `json:"caches"`
}

cAdvisor经过GetTopology来完成信息的构建，主要是经过提取/proc/cpuinfo中信息来完成CPU Topology，经过读取/sys/devices/system/cpu/cpu来获取cpu cache信息。

vendor/github.com/google/cadvisor/machine/machine.go

func GetTopology(sysFs sysfs.SysFs, cpuinfo string) ([]info.Node, int, error) {
	nodes := []info.Node{}

	...
	return nodes, numCores, nil
}

下面是一个典型的NUMA CPU Topology结构：

建立容器

对于知足前面提到的知足static policy的Container建立时，kubelet会为其按照约定的cpu affinity来为其挑选最优的CPU Set。Container的建立时CPU Manager工做流程大体以下：

Kuberuntime调用容器运行时去建立该Container。
Kuberuntime将该Container交给CPU Manager处理。
CPU Manager为Container按照static policy逻辑进行处理。
CPU Manager从当前Shared Pool中挑选“最佳”Set拓扑结构的CPU，对于不知足Static Policy的Contianer，则返回Shared Pool中全部CPUS组成的Set。
CPU Manager将对该Container的CPUs分配状况记录到Checkpoint State中，而且从Shared Pool中删除刚分配的CPUs。
CPU Manager再从state中读取该Container的CPU分配信息，而后经过UpdateContainerResources cRI接口将其更新到Cpuset Cgroups中，包括对于非Static Policy Container。
Kuberuntime调用容器运行时Start该容器。

func (m *manager) AddContainer(p *v1.Pod, c *v1.Container, containerID string) error {
	m.Lock()
	err := m.policy.AddContainer(m.state, p, c, containerID)
	if err != nil {
		glog.Errorf("[cpumanager] AddContainer error: %v", err)
		m.Unlock()
		return err
	}
	cpus := m.state.GetCPUSetOrDefault(containerID)
	m.Unlock()

	if !cpus.IsEmpty() {
		err = m.updateContainerCPUSet(containerID, cpus)
		if err != nil {
			glog.Errorf("[cpumanager] AddContainer error: %v", err)
			return err
		}
	} else {
		glog.V(5).Infof("[cpumanager] update container resources is skipped due to cpu set is empty")
	}

	return nil
}

删除容器

当这些经过CPU Manager分配CPUs的Container要Delete时，CPU Manager工做流大体以下：

Kuberuntime会调用CPU Manager去按照static policy中定义逻辑处理。
CPU Manager将该Container分配的Cpu Set从新归还到Shared Pool中。
Kuberuntime调用容器运行时Remove该容器。
CPU Manager会异步地进行Reconcile Loop，为使用Shared Pool中的Cpus的Containers更新CPU集合。

func (m *manager) RemoveContainer(containerID string) error {
	m.Lock()
	defer m.Unlock()

	err := m.policy.RemoveContainer(m.state, containerID)
	if err != nil {
		glog.Errorf("[cpumanager] RemoveContainer error: %v", err)
		return err
	}
	return nil
}

Checkpoint

文件坏了，或者被删除了，该如何操做?

Note: CPU Manager doesn’t support offlining and onlining of CPUs at runtime. Also, if the set of online CPUs changes on the node, the node must be drained and CPU manager manually reset by deleting the state file cpu_manager_state in the kubelet root directory.

在Container Manager建立时，会顺带完成CPU Manager的建立。咱们看看建立CPU Manager时作了什么？咱们也就清楚了Kubelet重启时CPU Manager作了什么。

// NewManager creates new cpu manager based on provided policy
func NewManager(cpuPolicyName string, reconcilePeriod time.Duration, machineInfo *cadvisorapi.MachineInfo, nodeAllocatableReservation v1.ResourceList, stateFileDirectory string) (Manager, error) {
	var policy Policy

	switch policyName(cpuPolicyName) {

	case PolicyNone:
		policy = NewNonePolicy()

	case PolicyStatic:
		topo, err := topology.Discover(machineInfo)
		if err != nil {
			return nil, err
		}
		glog.Infof("[cpumanager] detected CPU topology: %v", topo)
		reservedCPUs, ok := nodeAllocatableReservation[v1.ResourceCPU]
		if !ok {
			// The static policy cannot initialize without this information.
			return nil, fmt.Errorf("[cpumanager] unable to determine reserved CPU resources for static policy")
		}
		if reservedCPUs.IsZero() {
			// The static policy requires this to be nonzero. Zero CPU reservation
			// would allow the shared pool to be completely exhausted. At that point
			// either we would violate our guarantee of exclusivity or need to evict
			// any pod that has at least one container that requires zero CPUs.
			// See the comments in policy_static.go for more details.
			return nil, fmt.Errorf("[cpumanager] the static policy requires systemreserved.cpu + kubereserved.cpu to be greater than zero")
		}

		// Take the ceiling of the reservation, since fractional CPUs cannot be
		// exclusively allocated.
		reservedCPUsFloat := float64(reservedCPUs.MilliValue()) / 1000
		numReservedCPUs := int(math.Ceil(reservedCPUsFloat))
		policy = NewStaticPolicy(topo, numReservedCPUs)

	default:
		glog.Errorf("[cpumanager] Unknown policy \"%s\", falling back to default policy \"%s\"", cpuPolicyName, PolicyNone)
		policy = NewNonePolicy()
	}

	stateImpl, err := state.NewCheckpointState(stateFileDirectory, cpuManagerStateFileName, policy.Name())
	if err != nil {
		return nil, fmt.Errorf("could not initialize checkpoint manager: %v", err)
	}

	manager := &manager{
		policy:                     policy,
		reconcilePeriod:            reconcilePeriod,
		state:                      stateImpl,
		machineInfo:                machineInfo,
		nodeAllocatableReservation: nodeAllocatableReservation,
	}
	return manager, nil
}

调用topology.Discover将cAdvisormachineInfo.Topology封装成CPU Manager管理的CPUTopology。
而后计算reservedCPUs（KubeReservedCPUs + SystemReservedCPUs + HardEvictionThresholds），并向上取整，最终最为reserved cpus。若是reservedCPUs为零，将返回Error，由于咱们必须static policy必需要求System Reserved和Kube Reserved不为空。
调用NewStaticPolicy建立static policy，建立时会调用takeByTopology为reserved cpus按照static policy挑选cpus的逻辑选择对应的CPU Set，最终设置到StaticPolicy.reserved中(注意，并无真正为reserved cpu set更新到cgroups，而是添加到Default CPU Set中，而且不被static policy Containers分配，这样Default CPU Set永远不会为空，它至少包含reserved CPU Set中的CPUs)。在AddContainer allocateCPUs计算assignableCPUs时，会除去这些reserved CPU Set。
接下来，调用state.NewCheckpointState，建立cpu_manager_state Checkpoint文件（若是存在，则不清空），初始Memory State，并从Checkpoint文件中restore到Memory State中。

cpu_manager_state Checkpoint文件内容就是CPUManagerCheckpoint结构体的json格式,其中Entries的key是ContainerID，value为该Container对应的Assigned CPU Set信息。

// CPUManagerCheckpoint struct is used to store cpu/pod assignments in a checkpoint
type CPUManagerCheckpoint struct {
	PolicyName    string            `json:"policyName"`
	DefaultCPUSet string            `json:"defaultCpuSet"`
	Entries       map[string]string `json:"entries,omitempty"`
	Checksum      checksum.Checksum `json:"checksum"`
}

接下来就是CPU Manager的启动了。

func (m *manager) Start(activePods ActivePodsFunc, podStatusProvider status.PodStatusProvider, containerRuntime runtimeService) {
	glog.Infof("[cpumanager] starting with %s policy", m.policy.Name())
	glog.Infof("[cpumanager] reconciling every %v", m.reconcilePeriod)

	m.activePods = activePods
	m.podStatusProvider = podStatusProvider
	m.containerRuntime = containerRuntime

	m.policy.Start(m.state)
	if m.policy.Name() == string(PolicyNone) {
		return
	}
	go wait.Until(func() { m.reconcileState() }, m.reconcilePeriod, wait.NeverStop)
}

启动static policy;
启动Reconcile Loop；

Reconcile Loop到底作了什么？

CPU Manager Reconcile按照--cpu-manager-reconcile-period配置的周期进行Loop，Reconcile注意进行以下处理:

遍历全部activePods中的全部Containers，注意包括InitContainers，对每一个Container继续进行下面处理。
检查该ContainerID是否在CPU Manager维护的Memory State assignments中，
- 若是不在Memory State assignments中：
  - 再检查对应的Pod.Status.Phase是否为Running且DeletionTimestamp为nil，若是是，则调用CPU Manager的AddContainer对该Container/Pod进行QoS和cpu request检查，若是知足static policy的条件，则调用takeByTopology为该Container分配“最佳”CPU Set，并写入到Memory State和Checkpoint文件(cpu_manager_sate)中，并继续后面流程。
  - 若是对应的Pod.Status.Phase是否为Running且DeletionTimestamp为nil为false，则跳过该Container，该Container处理结束。不知足static policy的Containers由于不在Memory State assignments中，因此对它们的处理流程也到此结束。
- 若是ContainerID在CPU Manager assignments维护的Memory State中，继续后面流程。
而后从Memory State中获取该ContainerID对应的CPU Set。
最后调用CRI UpdateContainerCPUSet更新到cpuset cgroups中。

pkg/kubelet/cm/cpumanager/cpu_manager.go:219

func (m *manager) reconcileState() (success []reconciledContainer, failure []reconciledContainer) {
	success = []reconciledContainer{}
	failure = []reconciledContainer{}

	for _, pod := range m.activePods() {
		allContainers := pod.Spec.InitContainers
		allContainers = append(allContainers, pod.Spec.Containers...)
		for _, container := range allContainers {
			status, ok := m.podStatusProvider.GetPodStatus(pod.UID)
			if !ok {
				glog.Warningf("[cpumanager] reconcileState: skipping pod; status not found (pod: %s, container: %s)", pod.Name, container.Name)
				failure = append(failure, reconciledContainer{pod.Name, container.Name, ""})
				break
			}

			containerID, err := findContainerIDByName(&status, container.Name)
			if err != nil {
				glog.Warningf("[cpumanager] reconcileState: skipping container; ID not found in status (pod: %s, container: %s, error: %v)", pod.Name, container.Name, err)
				failure = append(failure, reconciledContainer{pod.Name, container.Name, ""})
				continue
			}

			// Check whether container is present in state, there may be 3 reasons why it's not present:
			// - policy does not want to track the container
			// - kubelet has just been restarted - and there is no previous state file
			// - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
			if _, ok := m.state.GetCPUSet(containerID); !ok {
				if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
					glog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
					err := m.AddContainer(pod, &container, containerID)
					if err != nil {
						glog.Errorf("[cpumanager] reconcileState: failed to add container (pod: %s, container: %s, container id: %s, error: %v)", pod.Name, container.Name, containerID, err)
						failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
						continue
					}
				} else {
					// if DeletionTimestamp is set, pod has already been removed from state
					// skip the pod/container since it's not running and will be deleted soon
					continue
				}
			}

			cset := m.state.GetCPUSetOrDefault(containerID)
			if cset.IsEmpty() {
				// NOTE: This should not happen outside of tests.
				glog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)
				failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
				continue
			}

			glog.V(4).Infof("[cpumanager] reconcileState: updating container (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
			err = m.updateContainerCPUSet(containerID, cset)
			if err != nil {
				glog.Errorf("[cpumanager] reconcileState: failed to update container (pod: %s, container: %s, container id: %s, cpuset: \"%v\", error: %v)", pod.Name, container.Name, containerID, cset, err)
				failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
				continue
			}
			success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
		}
	}
	return success, failure
}

Validate State

CPU Manager启动时，除了会启动一个goruntime进行Reconcile之外，还会对State进行validate处理:

当Memory State中Shared(Default) CPU Set为空时，CPU Assginments也必须为空，而后对Memory State中的Shared Pool进行初始化，并写入到Checkpoint文件中（初始化Checkpoint）。
只要咱们没有手动去删Checkpoint文件，那么在前面提到的state.NewCheckpointState中会根据Checkpoint文件restore到Memory State中，所以以前Assgned CPU Set、Default CPU Set都还在。
当检测到Memory State已经成功初始化（根据Checkpoint restore），则检查此次启动时reserved cpu set是否都在Default CPU Set中，若是不是（好比kube/system reserved cpus增长了），则报错返回，由于这意味着reserved cpu set中有些cpus被Assigned到了某些Container中了，这可能会致使这些容器启动失败，此时须要用户本身手动的去修正Checkpoint文件。
检测reserved cpu set经过后，再检测Default CPU Set和Assigned CPU Set是否有交集，若是有交集，说明Checkpoint文件restore到Memory State的数据有错，报错返回。
最后检查此次启动时从cAdvisor中获取到的CPU Topology中的全部CPUs是否与Memory State（从Checkpoint中restore）中记录的全部CPUs（Default CPU Set + Assigned CPU Set）相同，若是不一样，则报错返回。可能由于上次CPU Manager中止到此次启动这个时间内，Node上的可用CPUs发生变化。

pkg/kubelet/cm/cpumanager/policy_static.go:116

func (p *staticPolicy) validateState(s state.State) error {
	tmpAssignments := s.GetCPUAssignments()
	tmpDefaultCPUset := s.GetDefaultCPUSet()

	// Default cpuset cannot be empty when assignments exist
	if tmpDefaultCPUset.IsEmpty() {
		if len(tmpAssignments) != 0 {
			return fmt.Errorf("default cpuset cannot be empty")
		}
		// state is empty initialize
		allCPUs := p.topology.CPUDetails.CPUs()
		s.SetDefaultCPUSet(allCPUs)
		return nil
	}

	// State has already been initialized from file (is not empty)
	// 1. Check if the reserved cpuset is not part of default cpuset because:
	// - kube/system reserved have changed (increased) - may lead to some containers not being able to start
	// - user tampered with file
	if !p.reserved.Intersection(tmpDefaultCPUset).Equals(p.reserved) {
		return fmt.Errorf("not all reserved cpus: \"%s\" are present in defaultCpuSet: \"%s\"",
			p.reserved.String(), tmpDefaultCPUset.String())
	}

	// 2. Check if state for static policy is consistent
	for cID, cset := range tmpAssignments {
		// None of the cpu in DEFAULT cset should be in s.assignments
		if !tmpDefaultCPUset.Intersection(cset).IsEmpty() {
			return fmt.Errorf("container id: %s cpuset: \"%s\" overlaps with default cpuset \"%s\"",
				cID, cset.String(), tmpDefaultCPUset.String())
		}
	}

	// 3. It's possible that the set of available CPUs has changed since
	// the state was written. This can be due to for example
	// offlining a CPU when kubelet is not running. If this happens,
	// CPU manager will run into trouble when later it tries to
	// assign non-existent CPUs to containers. Validate that the
	// topology that was received during CPU manager startup matches with
	// the set of CPUs stored in the state.
	totalKnownCPUs := tmpDefaultCPUset.Clone()
	for _, cset := range tmpAssignments {
		totalKnownCPUs = totalKnownCPUs.Union(cset)
	}
	if !totalKnownCPUs.Equals(p.topology.CPUDetails.CPUs()) {
		return fmt.Errorf("current set of available CPUs \"%s\" doesn't match with CPUs in state \"%s\"",
			p.topology.CPUDetails.CPUs().String(), totalKnownCPUs.String())
	}

	return nil
}

思考

某个CPU在Shared Pool中被非Guaranteed Pod Containers使用时，后来被CPU Manager分配给某个Static Policy Container,那么原来这个CPU上的任务会怎么样？马上被调度到其余Shared Pool中的CPUs吗？

因为Static Policy Container Add的时候，除了为本身挑选最佳CPU Set外，还会把挑选的CPU Set从Shared Pool CPU Set中删除，所以上面这种状况下，原来的这个CPU上的任务会继续执行等cpu scheduler下次调度任务时，由于cpuset cgroups的生效，将致使他们看不到原来的那块CPU了。

Static Policy Container从头至尾都必定是绑定分配的CPUs吗？

从前面分析的工做流可知，当某Static Policy Container被分配了某些CPUs后，经过每10s（默认）一次的Reconcile将Memory State中分配状况更新到cpuset cgroups中，所以最坏会有10s时间这个Static Policy Container将和非Static Policy Container共享这个CPU。

CPU Manager的Checkpoint文件被破坏，与实际的CPU Assigned状况不一致，该如何修复？

经过对CPU Manager的分析，咱们知道Reconcile并不能本身修复这个差别。能够经过如下方法修复：

方法1：从新生成Checkpoint文件：删除Checkpoint文件，并重启Kubelet，CPU Manager的Reconcile机制会遍历全部Containers，并从新为这些知足Static Policy条件的Containers分配CPUs，并更新到cpuset cgroups中。这可能会致使运行中的Container从新被分配到不一样的CPU Set中而出现短期的应用抖动。

方法2：Drain这个node，将Pod驱逐走，让Pod在其余正常Checkpoint的Node上调度，而后清空或者删除Checkpoint文件。这个方法也会对应用形成一点的影响，毕竟Pod须要在其余Node上recreate。

CPU Manager的不足

基于当前cAdvisor对CPU Topology的Discover能力，目前CPU Manager在为Container挑选CPUs考虑cpu socket是否靠近某些PCI Bus。

CPU Manager还不支持对isolcpus Linux kernel boot parameter的兼容，CPU Manager须要（经过cAdvisor或者直接读）获取isolcpus配置的isolate CPUs，并在给Static Policy Contaienrs分配时排除这些isolate CPUs。
还不支持Dynamic分配，在Container运行过程当中直接更改起cpuset cgroups。

总结

经过对Kubelet CPU Manager的深刻分析，咱们对CPU Manager的工做机制有了充分的理解，包括其Reconcile Loop、启动时的Validate Sate机制、Checkpoint的机制及其修复方法、CPU Manager当前不足等。