kubernetes version: v1.3.0node
在分析kubelet启动流程时,总是会碰到各种GC,这里单独提出来作下较详细的分析。
kubelet's Garbage Collection主要由两部分组成:docker
containerGC: 使用指定的container回收策略,删除那些已经结束的containersjson
imageManager: k8s全部images的生命周期管理就是经过imageManager来实现的,其实该imageManager也是依赖了cAdvisor。api
imageManager的回收策略结构以下:app
type ImageGCPolicy struct { // Any usage above this threshold will always trigger garbage collection. // This is the highest usage we will allow. HighThresholdPercent int // Any usage below this threshold will never trigger garbage collection. // This is the lowest threshold we will try to garbage collect to. LowThresholdPercent int // Minimum age at which a image can be garbage collected. MinAge time.Duration }
该结构的出厂设置在cmd/kubelet/app/server.go中的UnsecuredKubeletConfig()接口进行。less
func UnsecuredKubeletConfig(s *options.KubeletServer) (*KubeletConfig, error) { ... imageGCPolicy := kubelet.ImageGCPolicy{ MinAge: s.ImageMinimumGCAge.Duration, HighThresholdPercent: int(s.ImageGCHighThresholdPercent), LowThresholdPercent: int(s.ImageGCLowThresholdPercent), } ... }
赋值的KubeletServer的几个参数的初始化在cmd/kubelet/app/options/options.go中的NewKubeletServer()接口中进行:async
func NewKubeletServer() *KubeletServer { return &KubeletServer{ ... ImageMinimumGCAge: unversioned.Duration{Duration: 2 * time.Minute}, ImageGCHighThresholdPercent: 90, ImageGCLowThresholdPercent: 80, ... } }
从上面的初始化过程能够得出:ide
在磁盘的占用率高于90%时,imageGC将一直被触发源码分析
在磁盘的占用率低于80%时,imageGC将不会触发学习
imageGC会尝试先delete最少使用的image,可是若是该image的建立时间才低于2min,将不会被删除。
上面介绍的都是imageManager的回收策略参数初始化,下面开始介绍imageManager。
结构所在目录:pkg/kubelet/image_manager.go
结构以下:
type imageManager interface { // Applies the garbage collection policy. Errors include being unable to free // enough space as per the garbage collection policy. GarbageCollect() error // Start async garbage collection of images. Start() error GetImageList() ([]kubecontainer.Image, error) // TODO(vmarmol): Have this subsume pulls as well. }
能够看到imageManager是个interface,实际初始化的结构体是realImageManager:
type realImageManager struct { // Container runtime runtime container.Runtime // Records of images and their use. imageRecords map[string]*imageRecord imageRecordsLock sync.Mutex // The image garbage collection policy in use. policy ImageGCPolicy // cAdvisor instance. cadvisor cadvisor.Interface // Recorder for Kubernetes events. recorder record.EventRecorder // Reference to this node. nodeRef *api.ObjectReference // Track initialization initialized bool }
该接口的初始化须要先回到pkg/kubelet/kubelet.go中的NewMainKubelet()接口中:
func NewMainKubelet( hostname string, nodeName string, ... ) (*Kubelet, error) { ... // setup containerGC containerGC, err := kubecontainer.NewContainerGC(klet.containerRuntime, containerGCPolicy) if err != nil { return nil, err } klet.containerGC = containerGC // setup imageManager imageManager, err := newImageManager(klet.containerRuntime, cadvisorInterface, recorder, nodeRef, imageGCPolicy) if err != nil { return nil, fmt.Errorf("failed to initialize image manager: %v", err) } klet.imageManager = imageManager ... }
能够看到上面的接口中对containerGC和imageManager都进行了初始化,这里先介绍imageManager,containerGC留到下面再讲。
newImageManager()接口以下:
func newImageManager(runtime container.Runtime, cadvisorInterface cadvisor.Interface, recorder record.EventRecorder, nodeRef *api.ObjectReference, policy ImageGCPolicy) (imageManager, error) { // 检查policy参数有效性 if policy.HighThresholdPercent < 0 || policy.HighThresholdPercent > 100 { return nil, fmt.Errorf("invalid HighThresholdPercent %d, must be in range [0-100]", policy.HighThresholdPercent) } if policy.LowThresholdPercent < 0 || policy.LowThresholdPercent > 100 { return nil, fmt.Errorf("invalid LowThresholdPercent %d, must be in range [0-100]", policy.LowThresholdPercent) } if policy.LowThresholdPercent > policy.HighThresholdPercent { return nil, fmt.Errorf("LowThresholdPercent %d can not be higher than HighThresholdPercent %d", policy.LowThresholdPercent, policy.HighThresholdPercent) } // 初始化realImageManager结构 im := &realImageManager{ runtime: runtime, policy: policy, imageRecords: make(map[string]*imageRecord), cadvisor: cadvisorInterface, recorder: recorder, nodeRef: nodeRef, initialized: false, } return im, nil }
查看上面的初始化接口,能够看出该imageManager跟容器runtime、cAdvisor、EventRecorder、nodeRef、Policy都有关。
这里能够进行大胆的猜想:
runtime用于进行image的删除操做
cAdvisor用于获取image占用磁盘的状况
EventRecorder用于发送具体的回收事件
Policy就是具体的回收策略了
nodeRef干吗的?猜不到,仍是后面继续看源码吧!
全部的参数初始化结束后,须要开始进入真正的GC启动流程,该步骤仍是须要查看CreateAndInitKubelet()接口。
接口目录:cmd/kubelet/app/server.go
接口调用流程:main -> app.Run -> run -> RunKubelet -> CreateAndInitKubelet
接口以下:
func CreateAndInitKubelet(kc *KubeletConfig) (k KubeletBootstrap, pc *config.PodConfig, err error) { ... k.StartGarbageCollection() return k, pc, nil }
该接口调用了启动GC的接口StartGarbageCollection(),具体实现以下:
func (kl *Kubelet) StartGarbageCollection() { go wait.Until(func() { if err := kl.containerGC.GarbageCollect(kl.sourcesReady.AllReady()); err != nil { glog.Errorf("Container garbage collection failed: %v", err) } }, ContainerGCPeriod, wait.NeverStop) go wait.Until(func() { if err := kl.imageManager.GarbageCollect(); err != nil { glog.Errorf("Image garbage collection failed: %v", err) } }, ImageGCPeriod, wait.NeverStop) }
上面的接口分别启动了containerGC和imageManager的协程,能够看出containerGC是每1分钟触发回收,imageManager是每5分钟触发回收。
该GarbageCollect()接口须要根据以前参数初始化时的realImageManager结构进行查看,进入kl.imageManager.GarbageCollect()一看究竟:
func (im *realImageManager) GarbageCollect() error { // 获取节点上所存在的images的磁盘占用率 fsInfo, err := im.cadvisor.ImagesFsInfo() if err != nil { return err } // 容量及可利用的空间 capacity := int64(fsInfo.Capacity) available := int64(fsInfo.Available) if available > capacity { glog.Warningf("available %d is larger than capacity %d", available, capacity) available = capacity } // Check valid capacity. if capacity == 0 { err := fmt.Errorf("invalid capacity %d on device %q at mount point %q", capacity, fsInfo.Device, fsInfo.Mountpoint) im.recorder.Eventf(im.nodeRef, api.EventTypeWarning, container.InvalidDiskCapacity, err.Error()) return err } // 查看images的磁盘占用率是否大于等于HighThresholdPercent usagePercent := 100 - int(available*100/capacity) if usagePercent >= im.policy.HighThresholdPercent { // 尝试去回收images的占用率到LowThresholdPercent之下 amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available glog.Infof("[ImageManager]: Disk usage on %q (%s) is at %d%% which is over the high threshold (%d%%). Trying to free %d bytes", fsInfo.Device, fsInfo.Mountpoint, usagePercent, im.policy.HighThresholdPercent, amountToFree) // 真正的回收接口 freed, err := im.freeSpace(amountToFree, time.Now()) if err != nil { return err } if freed < amountToFree { err := fmt.Errorf("failed to garbage collect required amount of images. Wanted to free %d, but freed %d", amountToFree, freed) im.recorder.Eventf(im.nodeRef, api.EventTypeWarning, container.FreeDiskSpaceFailed, err.Error()) return err · } } return nil }
这里最关键的接口即是im.freeSpace(),该接口才是真正进行资源回收的接口。
该接口有两个参数:第一个即是设置此次打算回收的空间,第二个是传入调用回收接口的当前时间。
具体的回收,咱们进入接口继续细看:
func (im *realImageManager) freeSpace(bytesToFree int64, freeTime time.Time) (int64, error) { // 用im.runtime遍历现存的全部的images,并更新im.imageRecords,下面会用到。 err := im.detectImages(freeTime) if err != nil { return 0, err } // 操做imageRecords的锁 im.imageRecordsLock.Lock() defer im.imageRecordsLock.Unlock() // 获取全部的images images := make([]evictionInfo, 0, len(im.imageRecords)) for image, record := range im.imageRecords { images = append(images, evictionInfo{ id: image, imageRecord: *record, }) } sort.Sort(byLastUsedAndDetected(images)) // 下面的循环将尝试删除images,直到知足须要删除的空间为止 var lastErr error spaceFreed := int64(0) for _, image := range images { glog.V(5).Infof("Evaluating image ID %s for possible garbage collection", image.id) // Images that are currently in used were given a newer lastUsed. if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) { glog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime) break } // Avoid garbage collect the image if the image is not old enough. // In such a case, the image may have just been pulled down, and will be used by a container right away. // 查看该image的空闲时间是否够久,不够久的话将不删除 // 这个时间在GC的策略中有配置 if freeTime.Sub(image.firstDetected) < im.policy.MinAge { glog.V(5).Infof("Image ID %s has age %v which is less than the policy's minAge of %v, not eligible for garbage collection", image.id, freeTime.Sub(image.firstDetected), im.policy.MinAge) continue } // 调用runtime(即Docker)的接口删除指定的image glog.Infof("[ImageManager]: Removing image %q to free %d bytes", image.id, image.size) err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id}) if err != nil { lastErr = err continue } // 将删除的镜像从imageRecords中去除,因此前面须要加锁 delete(im.imageRecords, image.id) // 增长已经删除的image的size spaceFreed += image.size // 若是已经删除的image的大小已经知足要求,则退出回收流程 if spaceFreed >= bytesToFree { break } } return spaceFreed, lastErr }
基本的imageManager模块流程差很少就这样了,这里还能够继续深刻学习下cAdvisor和docker runtime的接口实现。
containerGC回收策略相关结构以下:
type ContainerGCPolicy struct { // Minimum age at which a container can be garbage collected, zero for no limit. MinAge time.Duration // Max number of dead containers any single pod (UID, container name) pair is // allowed to have, less than zero for no limit. MaxPerPodContainer int // Max number of total dead containers, less than zero for no limit. MaxContainers int }
该结构的初始化是在cmd/kubelet/app/kubelet.go文件中的CreateAndInitKubelet()接口中进行。
调用流程:main --> app.Run --> RunKubelet --> CreateAndInitKubelet
func CreateAndInitKubelet(kc *KubeletConfig) (k KubeletBootstrap, pc *config.PodConfig, err error) { var kubeClient clientset.Interface if kc.KubeClient != nil { kubeClient = kc.KubeClient // TODO: remove this when we've refactored kubelet to only use clientset. } // containerGC的回收策略初始化 gcPolicy := kubecontainer.ContainerGCPolicy{ MinAge: kc.MinimumGCAge, MaxPerPodContainer: kc.MaxPerPodContainerCount, MaxContainers: kc.MaxContainerCount, } ... }
能够看到实际的参数来源于kc结构,而该结构的初始化是在cmd/kubelet/app/kubelet.go文件中的UnsecuredKubeletConfig()接口中进行。
调用流程:main --> app.Run --> UnsecuredKubeletConfig
func UnsecuredKubeletConfig(s *options.KubeletServer) (*KubeletConfig, error) { ... MaxContainerCount: int(s.MaxContainerCount), MaxPerPodContainerCount: int(s.MaxPerPodContainerCount), MinimumGCAge: s.MinimumGCAge.Duration, ... }
最开始的参数都来源于KubeletServer中的KubeletConfiguration结构,相关的参数以下:
type KubeletConfiguration struct { ... // containerGC会回收已经结束的container,可是该container结束后必需要停留 // 大于MinimumGCAge时间才能被回收。 default: 1min MinimumGCAge unversioned.Duration `json:"minimumGCAge"` // 用于指定每一个已经结束的Pod最多能够存在containers的数量,default: 2 MaxPerPodContainerCount int32 `json:"maxPerPodContainerCount"` // 集群最大支持的container数量 MaxContainerCount int32 `json:"maxContainerCount"` }
而该入参的初始化仍是须要回到cmd/kubelet/app/options/options.go中的NewKubeletServer()接口,实际初始化以下:
func NewKubeletServer() *KubeletServer { ... MaxContainerCount: 240, MaxPerPodContainerCount: 2, MinimumGCAge: unversioned.Duration{Duration: 1 * time.Minute},
从上面的初始化能够看出:
该节点能够建立的最大container数量是240
每一个Pod最大能够容纳2个containers
container结束以后,至少须要在1分钟以后才能被containerGC回收
因此基本的containerGC策略就明白了。
策略结构初始化完以后,还要进行最后一步containerGC结构初始化,须要进入pkg/kubelet/kubelet.go的NewMainKubelet()接口查看:
func NewMainKubelet(...) { ... // setup containerGC containerGC, err := kubecontainer.NewContainerGC(klet.containerRuntime, containerGCPolicy) if err != nil { return nil, err } klet.containerGC = containerGC ... }
继续查看NewContainerGC(),该接口在pkg/kubelet/container/container_gc.go中,看下干了啥:
func NewContainerGC(runtime Runtime, policy ContainerGCPolicy) (ContainerGC, error) { if policy.MinAge < 0 { return nil, fmt.Errorf("invalid minimum garbage collection age: %v", policy.MinAge) } return &realContainerGC{ runtime: runtime, policy: policy, }, nil }
接口很简单,根据以前的策略结构体又初始化了一个realContainerGC结构,能够看出该接口就比较完整了,能够想象一下须要进行container的回收的话,必需要用到runtime的接口(好比查看当前容器状态,删除容器等操做),因此结构中带入实际使用的runtime是必然的。
能够关注下该对象支持的方法,后面会用到。
全部的参数初始化结束后,须要开始进入真正的GC启动流程,该步骤上面讲imageManager时已经说起,这里直接进入正题。
启动containerGC的接口是StartGarbageCollection(),具体实现以下:
func (kl *Kubelet) StartGarbageCollection() { go wait.Until(func() { if err := kl.containerGC.GarbageCollect(kl.sourcesReady.AllReady()); err != nil { glog.Errorf("Container garbage collection failed: %v", err) } }, ContainerGCPeriod, wait.NeverStop) go wait.Until(func() { if err := kl.imageManager.GarbageCollect(); err != nil { glog.Errorf("Image garbage collection failed: %v", err) } }, ImageGCPeriod, wait.NeverStop) }
接下来咱们一块儿看下containerGC的GarbageCollect()接口,但要找到这个接口的话,咱们得回到以前初始化containerGC的步骤。
实际初始化containerGC时真正返回的是realContainerGC结构,因此GarbageCollect()是该结构的方法:
func (cgc *realContainerGC) GarbageCollect(allSourcesReady bool) error { return cgc.runtime.GarbageCollect(cgc.policy, allSourcesReady) }
看到这里,发现containerGC的套路跟imageManager同样,因此一招鲜吃遍天。
咱们使用的runtime就是docker,因此须要去找docker的GarbageCollect()接口实现,具体runtime的初始化能够查看以前一篇文章<Kubelet源码分析(二) DockerClient>的介绍,这里就不讲具体的初始化了,直接进入正题。
Docker的GarbageCollect()接口在pkg/kubelet/dockertools/container_gc.go中:
func (cgc *containerGC) GarbageCollect(gcPolicy kubecontainer.ContainerGCPolicy, allSourcesReady bool) error { // 从全部的容器中分离出那些能够被回收的contianers // evictUnits: 能够识别的但已经dead,而且建立时间大于回收策略中的minAge的containers // unidentifiedContainers: 没法识别的containers evictUnits, unidentifiedContainers, err := cgc.evictableContainers(gcPolicy.MinAge) if err != nil { return err } // 先删除没法识别的containers for _, container := range unidentifiedContainers { glog.Infof("Removing unidentified dead container %q with ID %q", container.name, container.id) err = cgc.client.RemoveContainer(container.id, dockertypes.ContainerRemoveOptions{RemoveVolumes: true}) if err != nil { glog.Warningf("Failed to remove unidentified dead container %q: %v", container.name, err) } } // 全部资源都已经准备好以后,能够删除那些已经dead的containers if allSourcesReady { for key, unit := range evictUnits { if cgc.isPodDeleted(key.uid) { cgc.removeOldestN(unit, len(unit)) // Remove all. delete(evictUnits, key) } } } // 检查全部的evictUnits, 删除每一个Pod中超出的containers if gcPolicy.MaxPerPodContainer >= 0 { cgc.enforceMaxContainersPerEvictUnit(evictUnits, gcPolicy.MaxPerPodContainer) } // 确保节点的最大containers数量 // 检查节点containers数量是否超出了最大限制,是的话就删除多出来的containers // 优先删除最早建立的containers if gcPolicy.MaxContainers >= 0 && evictUnits.NumContainers() > gcPolicy.MaxContainers { // 计算每一个单元最多能够有几个containers numContainersPerEvictUnit := gcPolicy.MaxContainers / evictUnits.NumEvictUnits() if numContainersPerEvictUnit < 1 { numContainersPerEvictUnit = 1 } // cgc.enforceMaxContainersPerEvictUnit(evictUnits, numContainersPerEvictUnit) // 须要删除containers的话,优先删除最老的containers numContainers := evictUnits.NumContainers() if numContainers > gcPolicy.MaxContainers { flattened := make([]containerGCInfo, 0, numContainers) for uid := range evictUnits { // 先整合全部的containers flattened = append(flattened, evictUnits[uid]...) } sort.Sort(byCreated(flattened)) // 删除numContainers-gcPolicy.MaxContainers个最早建立的contianers cgc.removeOldestN(flattened, numContainers-gcPolicy.MaxContainers) } } // 删除containers以后,须要清除对应的软链接 logSymlinks, _ := filepath.Glob(path.Join(cgc.containerLogsDir, fmt.Sprintf("*.%s", LogSuffix))) for _, logSymlink := range logSymlinks { if _, err = os.Stat(logSymlink); os.IsNotExist(err) { err = os.Remove(logSymlink) if err != nil { glog.Warningf("Failed to remove container log dead symlink %q: %v", logSymlink, err) } } } return nil }
上面经过源码介绍了imageManager和containerGC的实现,里面也涉及到了GC Policy的配置,咱们也能够经过手动修改kubelet的flags来改变参数默认值。
image-gc-high-threshold: 该值表示磁盘占用率达到该值后会触发image garbage collection。默认值是90%
image-gc-low-threshold: 该值表示image GC尝试以回收的方式来达到的磁盘占用率,若磁盘占用率本来就小于该值,不会触发GC。默认值是80%
minimum-container-ttl-duration: 表示container结束后多长时间能够被GC回收,默认是1min
maximum-dead-containers-per-container: 表示每一个已经结束的Pod中最多能够存在多少个containers,默认值是2个
maximum-dead-containers: 表示kubelet所在节点最多能够保留已经结束的containers的数量,默认值是240
容器在中止工做后是能够被garbage collection进行回收,可是咱们也须要对containers进行保留,由于有些containers多是异常中止的,而container能够保留有logs或者别的游泳的数据用于开发进行问题定位。根据上面的需求,咱们就能够经过maximum-dead-containers-per-container和maximum-dead-containers很好的来实现这个目标。