Kubernetes 包含一个体验性的功能,支持 AMD和NVIDIA GPUs 跨节点调度。对 NVIDIA GPUs 支持从 v1.6开始,而后通过几回不兼容的叠代修改,对AMD GPUs 的支持从 v1.9 开始,经过 device plugin提供。node
本文描述了用户在不一样版本的kubernetes使用GPUs的方法及其当前版本的限制。git
从1.8开始, 建议调用 GPUs 的方法是经过使用 device plugins。github
为了启用 GPU支持,在1.10以前, 该DevicePlugins
feature gate 须要经过系统设置来激活: --feature-gates="DevicePlugins=true"
. 但在 1.10及之后,再也不须要这一设置。docker
您还须要安装 GPU drivers到各个节点,驱动和device plugin都由相应的GPU生产厂家提供 (AMD, NVIDIA)。shell
当上述条件知足时, Kubernetes 服务将提供名称为 nvidia.com/gpu
和 amd.com/gpu
做为可调度的资源。ubuntu
You can consume these GPUs from your containers by requesting <vendor>.com/gpu
just like you request cpu
or memory
. However, there are some limitations in how you specify the resource requirements when using GPUs:api
limits
section, which means:
limits
without specifying requests
because Kubernetes will use the limit as the request value by default.limits
and requests
but these two values must be equal.requests
without specifying limits
.Here’s an example:app
apiVersion: v1 kind: Pod metadata: name: cuda-vector-add spec: restartPolicy: OnFailure containers: - name: cuda-vector-add # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU
The official AMD GPU device plugin has the following requirements:机器学习
To deploy the AMD device plugin once your cluster is running and the above requirements are satisfied:学习
# For Kubernetes v1.9 kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.9/k8s-ds-amdgpu-dp.yaml # For Kubernetes v1.10 kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.10/k8s-ds-amdgpu-dp.yaml
Report issues with this device plugin to RadeonOpenCompute/k8s-device-plugin.
There are currently two device plugin implementations for NVIDIA GPUs:
The official NVIDIA GPU device plugin has the following requirements:
To deploy the NVIDIA device plugin once your cluster is running and the above requirements are satisfied:
# For Kubernetes v1.8 kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml # For Kubernetes v1.9 kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml
Report issues with this device plugin to NVIDIA/k8s-device-plugin.
The NVIDIA GPU device plugin used by GCE doesn’t require using nvidia-docker and should work with any container runtime that is compatible with the Kubernetes Container Runtime Interface (CRI). It’s tested on Container-Optimized OS and has experimental code for Ubuntu from 1.9 onwards.
On your 1.12 cluster, you can use the following commands to install the NVIDIA drivers and device plugin:
# Install NVIDIA drivers on Container-Optimized OS: kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml # Install NVIDIA drivers on Ubuntu (experimental): kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml # Install the device plugin: kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.12/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
Report issues with this device plugin and installation method to GoogleCloudPlatform/container-engine-accelerators.
Instructions for using NVIDIA GPUs on GKE are here
If different nodes in your cluster have different types of NVIDIA GPUs, then you can use Node Labels and Node Selectors to schedule pods to appropriate nodes.
For example:
# Label your nodes with the accelerator type they have. kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80 kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
Specify the GPU type in the pod spec:
apiVersion: v1 kind: Pod metadata: name: cuda-vector-add spec: restartPolicy: OnFailure containers: - name: cuda-vector-add # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: limits: nvidia.com/gpu: 1 nodeSelector: accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
This will ensure that the pod will be scheduled to a node that has the GPU type you specified.