简介
TensorFLow是深度学习和机器学习最流行的开源框架,它最初是由Google研究团队开发的并致力于解决深度神经网络的机器学习研究,从2015年开源到如今获得了普遍的应用。特别是Tensorboard这一利器,对于数据科学家有效的工做也是很是有效的利器。
Jupyter notebook是强大的数据分析工具,它可以帮助快速开发而且实现机器学习代码的共享,是数据科学团队用来作数据实验和组内合做的利器,也是机器学习初学者入门这一个领域的好起点。
利用Jupyter开发TensorFlow也是许多数据科学家的首选,可是如何可以快速从零搭建一套这样的环境,而且配置GPU的使用,同时支持最新的TensorFlow版本, 对于数据科学家来讲既是复杂的,同时也是浪费精力的。
在Kubernetes集群上,您能够快速的部署一套完整Jupyter Notebook环境,进行模型开发。这个方案惟一的问题在于这里的GPU资源是独享,形成较大的浪费。数据科学家使用notebook实验的时候GPU显存需求量并不大,若是能够可以多人共享同一个GPU能够下降模型开发的成本。
html
而阿里云容器服务团队推出了GPU共享方案,能够在模型开发和模型推理的场景下大大提高GPU资源的利用率,同时也能够保障GPU资源的隔离。node
独享GPU的处理办法
首先咱们回顾下之前调度GPU的状况python
为集群添加一个新的gpu节点
- 建立容器服务集群
- 添加GPU节点做为worker
本例中咱们选择GPU机器规格“ecs.gn6i-c4g1.xlarge”
添加后结果以下"cn-zhangjiakou.192.168.3.189"
linux
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get node -L cgpu,workload_type NAME STATUS ROLES AGE VERSION CGPU WORKLOAD_TYPE cn-zhangjiakou.192.168.0.138 Ready master 11d v1.16.6-aliyun.1 cn-zhangjiakou.192.168.1.112 Ready master 11d v1.16.6-aliyun.1 cn-zhangjiakou.192.168.1.113 Ready <none> 11d v1.16.6-aliyun.1 cn-zhangjiakou.192.168.3.115 Ready master 11d v1.16.6-aliyun.1 cn-zhangjiakou.192.168.3.189 Ready <none> 5m52s v1.16.6-aliyun.1
部署应用
经过命令 kubectl apply -f gpu_deployment.yaml
来部署应用,gpu_deployment.yaml
文件内容以下git
--- # Define the tensorflow deployment apiVersion: apps/v1 kind: Deployment metadata: name: tf-notebook-gpu labels: app: tf-notebook-gpu spec: replicas: 2 selector: # define how the deployment finds the pods it mangages matchLabels: app: tf-notebook-gpu template: # define the pods specifications metadata: labels: app: tf-notebook-gpu spec: containers: - name: tf-notebook image: tensorflow/tensorflow:1.4.1-gpu-py3 resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 ports: - containerPort: 8888 env: - name: PASSWORD value: mypassw0rd # Define the tensorflow service --- apiVersion: v1 kind: Service metadata: name: tf-notebook-gpu spec: ports: - port: 80 targetPort: 8888 name: jupyter selector: app: tf-notebook-gpu type: LoadBalancer
由于只有一个GPU节点,而上面的yaml文件中申请了两个Pod,咱们看到以下pod的调度状况,
能够看到第二个pod的状态是pending,缘由是无对应资源来进行调度,也便是说只能一个Pod“独占”该节点的GPU资源。
github
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get pod NAME READY STATUS RESTARTS AGE tf-notebook-2-7b4d68d8f7-mb852 1/1 Running 0 15h tf-notebook-3-86c48d4c7d-flz7m 1/1 Running 0 15h tf-notebook-7cf4575d78-sxmfl 1/1 Running 0 23h tf-notebook-gpu-695cb6cf89-dsjmv 1/1 Running 0 6s tf-notebook-gpu-695cb6cf89-mwm98 0/1 Pending 0 6s jumper(⎈ |zjk-gpu:default)➜ ~ kubectl describe pod tf-notebook-gpu-695cb6cf89-mwm98 Name: tf-notebook-gpu-695cb6cf89-mwm98 Namespace: default Priority: 0 Node: <none> Labels: app=tf-notebook-gpu pod-template-hash=695cb6cf89 Annotations: kubernetes.io/psp: ack.privileged Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/tf-notebook-gpu-695cb6cf89 Containers: tf-notebook: Image: tensorflow/tensorflow:1.4.1-gpu-py3 Port: 8888/TCP Host Port: 0/TCP Limits: nvidia.com/gpu: 1 Requests: nvidia.com/gpu: 1 Environment: PASSWORD: mypassw0rd Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-wpwn8 (ro) Conditions: Type Status PodScheduled False Volumes: default-token-wpwn8: Type: Secret (a volume populated by a Secret) SecretName: default-token-wpwn8 Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 6 Insufficient nvidia.com/gpu. Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 6 Insufficient nvidia.com/gpu.
真实的程序
在jupyter里执行下面的程序api
import argparse import tensorflow as tf FLAGS = None def train(fraction=1.0): config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = fraction a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) # Creates a session with log_device_placement set to True. config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = fraction sess = tf.Session(config=config) # Runs the op. while True: sess.run(c) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--total', type=float, default=1000, help='Total GPU memory.') parser.add_argument('--allocated', type=float, default=1000, help='Allocated GPU memory.') FLAGS, unparsed = parser.parse_known_args() # fraction = FLAGS.allocated / FLAGS.total * 0.85 fraction = round( FLAGS.allocated * 0.7 / FLAGS.total , 1 ) print(fraction) # fraction 默认值为0.7,该程序最多使用总资源的70% train(fraction)
经过托管版本Prometheus能够看到,在运行时其使用了整机资源的70%,
bash
独享GPU方案的问题
综上所述,独享GPU调度方案存在的问题是在推理、教学等对GPU用量不大的场景中不能将更多的Pod调度在一块儿,完成GPU的共享
为了解决这些问题咱们引入了GPU共享的方案,以便更好的利用GPU资源,提供更密集的部署能力、更高的GPU使用率、完整的隔离能力。
网络
GPU共享方案
环境准备
前提条件
配置 | 支持版本 |
---|---|
Kubernetes | 1.16.06;专属集群-master节点须要在客户的VPC内 |
Helm版本 | 3.0及以上版本 |
Nvidia驱动版本 | 418.87.01及以上版本 |
Docker版本 | 19.03.5 |
操做系统 | CentOS 7.六、CentOS 7.七、Ubuntu 16.04和Ubuntu 18.04 |
支持显卡 | Telsa P四、Telsa P100、 Telsa T4和Telsa v100(16GB) |
建立集群
添加GPU节点
本文中使用的GPU节点规格为 ecs.gn6i-c4g1.xlarge
session
设置节点为GPU共享节点--为GPU节点打标
- 登陆容器服务管理控制台。
- 在控制台左侧导航栏中,选择集群 > 节点
- 在节点列表页面,选择目标集群并单击页面右上角标签管理。
- 在标签管理页面,批量选择节点,而后单击添加标签。
- 在弹出的添加对话框中,填写标签名称和值。
注意 请确保名称设置为cgpu,值设置为true。
- 单击肯定。
为集群安装CGPU组件
- 登陆容器服务管理控制台。
- 在控制台左侧导航栏中,选择市场 > 应用目录。
- 在应用目录页面,选中并单击ack-cgpu。
- 在应用目录-ack-cgpu页面右侧的建立面板中,选中目标集群,而后单击建立。您无需设置命名空间和发布名称,系统显示默认值。
您能够执行命令helm get manifest cgpu -n kube-system | kubectl get -f -
查看cGPU组件是否安装成功。当出现如下命令详情时,说明cGPU组件安装成功。
# helm get manifest cgpu -n kube-system | kubectl get -f - NAME SECRETS AGE serviceaccount/gpushare-device-plugin 1 39s serviceaccount/gpushare-schd-extender 1 39s NAME AGE clusterrole.rbac.authorization.k8s.io/gpushare-device-plugin 39s clusterrole.rbac.authorization.k8s.io/gpushare-schd-extender 39s NAME AGE clusterrolebinding.rbac.authorization.k8s.io/gpushare-device-plugin 39s clusterrolebinding.rbac.authorization.k8s.io/gpushare-schd-extender 39s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/gpushare-schd-extender NodePort 10.6.13.125 <none> 12345:32766/TCP 39s NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/cgpu-installer 4 4 4 4 4 cgpu=true 39s daemonset.apps/device-plugin-evict-ds 4 4 4 4 4 cgpu=true 39s daemonset.apps/device-plugin-recover-ds 0 0 0 0 0 cgpu=false 39s daemonset.apps/gpushare-device-plugin-ds 4 4 4 4 4 cgpu=true 39s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/gpushare-schd-extender 1/1 1 1 38s NAME COMPLETIONS DURATION AGE job.batch/gpushare-installer 3/1 of 3 3s 38s
**
安装arena查看资源状况
安装arena
@ linux
wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.4.0-829b0e9-linux-amd64.tar.gz tar -xzvf arena-installer-0.4.0-829b0e9-linux-amd64.tar.gz sh ./arena-installer/install.sh
@ mac
wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.4.0-829b0e9-darwin-amd64.tar.gz tar -xzvf arena-installer-0.4.0-829b0e9-darwin-amd64.tar.gz sh ./arena-installer/install.sh
查看资源状况
jumper(⎈ |zjk-gpu:default)➜ ~ arena top node NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated) GPU(Shareable) cn-zhangjiakou.192.168.0.138 192.168.0.138 master ready 0 0 No cn-zhangjiakou.192.168.1.112 192.168.1.112 master ready 0 0 No cn-zhangjiakou.192.168.1.113 192.168.1.113 <none> ready 0 0 No cn-zhangjiakou.192.168.3.115 192.168.3.115 master ready 0 0 No cn-zhangjiakou.192.168.3.184 192.168.3.184 <none> ready 1 0 Yes ------------------------------------------------------------------------------------------------ Allocated/Total GPUs In Cluster: 0/1 (0%) jumper(⎈ |zjk-gpu:default)➜ ~ arena top node -s NAME IPADDRESS GPU0(Allocated/Total) cn-zhangjiakou.192.168.3.184 192.168.3.184 0/14 --------------------------------------------------------------------- Allocated/Total GPU Memory In GPUShare Node: 0/14 (GiB) (0%)
如上所示
节点cn-zhangjiakou.192.168.3.184 有1个GPU资源, 设置了 GPU(Shareable)--即在节点上打标签cgpu=true,其上有14个显存资源
运行TensorFLow的GPU实验环境
将以下文件存储为 mem_deployment.yaml,经过kubectl执行 kubectl apply -f mem_deployment.yaml
部署应用
--- # Define the tensorflow deployment apiVersion: apps/v1 kind: Deployment metadata: name: tf-notebook labels: app: tf-notebook spec: replicas: 1 selector: # define how the deployment finds the pods it mangages matchLabels: app: tf-notebook template: # define the pods specifications metadata: labels: app: tf-notebook spec: containers: - name: tf-notebook image: tensorflow/tensorflow:1.4.1-gpu-py3 resources: limits: aliyun.com/gpu-mem: 4 requests: aliyun.com/gpu-mem: 4 ports: - containerPort: 8888 env: - name: PASSWORD value: mypassw0rd # Define the tensorflow service --- apiVersion: v1 kind: Service metadata: name: tf-notebook spec: ports: - port: 80 targetPort: 8888 name: jupyter selector: app: tf-notebook type: LoadBalancer
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl apply -f mem_deployment.yaml deployment.apps/tf-notebook created service/tf-notebook created jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get svc tf-notebook NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE tf-notebook LoadBalancer 172.21.2.50 39.100.193.19 80:32285/TCP 78m
访问http://${EXTERNAL-IP}/ 来访问目标
Deployment配置:
- nvidia.com/gpu 指定调用nvidia gpu的数量
- type=LoadBalancer 指定使用阿里云的负载均衡访问内部服务和负载均衡
- 环境变量 PASSWORD 指定了访问Jupyter服务的密码,您能够按照您的须要修改,默认“mypassw0rd”
如今要验证这个Jupyter实例可使用GPU,能够在运行下面的程序。它将列出Tensorflow可用的全部设备。
from tensorflow.python.client import device_lib def get_available_devices(): local_device_protos = device_lib.list_local_devices() return [x.name for x in local_device_protos] print(get_available_devices())
能够看到以下输出,资源位GPU:0
在首页建立新的terminal
执行 nvidia-smi
能够看到在Pod上资源上限是4308MiB
验证GPU资源的共享
以上部分能够看出新的资源“aliyun.com/gpu-mem: 4”能够正常的申请的GPU资源,并运行对应的GPU任务,下面来看GPU资源共享的状况。
资源使用状况查看
首先,现有资源使用状况以下 arena top node -s -d
jumper(⎈ |zjk-gpu:default)➜ ~ arena top node -s -d NAME: cn-zhangjiakou.192.168.3.184 IPADDRESS: 192.168.3.184 NAME NAMESPACE GPU0(Allocated) tf-notebook-2-7b4d68d8f7-wxlff default 4 tf-notebook-3-86c48d4c7d-lk9h8 default 4 tf-notebook-7cf4575d78-9gxzd default 4 Allocated : 12 (85%) Total : 14 -------------------------------------------------------------------------------------------------------------------------------------- Allocated/Total GPU Memory In GPUShare Node: 12/14 (GiB) (85%)
如上所示每一个节点显存资源为14,能够调度3个pod.
部署更多的服务和副本
为了每一个notebook可以有本身的入口,咱们申请三个服务,指向三个pod,yaml文件以下
ps: mem_deployment-2.yaml、mem_deployment-3.yaml与mem_deployment.yaml内容几乎一致,只是把不一样的svc指向不一样的pod
mem_deployment-2.yaml
--- # Define the tensorflow deployment apiVersion: apps/v1 kind: Deployment metadata: name: tf-notebook-2 labels: app: tf-notebook-2 spec: replicas: 1 selector: # define how the deployment finds the pods it mangages matchLabels: app: tf-notebook-2 template: # define the pods specifications metadata: labels: app: tf-notebook-2 spec: containers: - name: tf-notebook image: tensorflow/tensorflow:1.4.1-gpu-py3 resources: limits: aliyun.com/gpu-mem: 4 requests: aliyun.com/gpu-mem: 4 ports: - containerPort: 8888 env: - name: PASSWORD value: mypassw0rd # Define the tensorflow service --- apiVersion: v1 kind: Service metadata: name: tf-notebook-2 spec: ports: - port: 80 targetPort: 8888 name: jupyter selector: app: tf-notebook-2 type: LoadBalancer
mem_deployment-3.yaml
--- # Define the tensorflow deployment apiVersion: apps/v1 kind: Deployment metadata: name: tf-notebook-3 labels: app: tf-notebook-3 spec: replicas: 1 selector: # define how the deployment finds the pods it mangages matchLabels: app: tf-notebook-3 template: # define the pods specifications metadata: labels: app: tf-notebook-3 spec: containers: - name: tf-notebook image: tensorflow/tensorflow:1.4.1-gpu-py3 resources: limits: aliyun.com/gpu-mem: 4 requests: aliyun.com/gpu-mem: 4 ports: - containerPort: 8888 env: - name: PASSWORD value: mypassw0rd # Define the tensorflow service --- apiVersion: v1 kind: Service metadata: name: tf-notebook-3 spec: ports: - port: 80 targetPort: 8888 name: jupyter selector: app: tf-notebook-3 type: LoadBalancer
应用两个yaml文件,加上以前部署的pod和服务共计在集群上部署3个Pod和3个服务
jumper(⎈ |zjk-gpu:default)➜ ~ kubectl apply -f mem_deployment-2.yaml deployment.apps/tf-notebook-2 created service/tf-notebook-2 created jumper(⎈ |zjk-gpu:default)➜ ~ kubectl apply -f mem_deployment-3.yaml deployment.apps/tf-notebook-3 created service/tf-notebook-3 created jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 172.21.0.1 <none> 443/TCP 11d tf-notebook LoadBalancer 172.21.2.50 39.100.193.19 80:32285/TCP 7h48m tf-notebook-2 LoadBalancer 172.21.1.46 39.99.218.255 80:30659/TCP 8m53s tf-notebook-3 LoadBalancer 172.21.8.56 39.98.242.180 80:31274/TCP 7s jumper(⎈ |zjk-gpu:default)➜ ~ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES tf-notebook-2-7b4d68d8f7-mb852 1/1 Running 0 9m6s 172.20.64.21 cn-zhangjiakou.192.168.3.184 <none> <none> tf-notebook-3-86c48d4c7d-flz7m 1/1 Running 0 20s 172.20.64.22 cn-zhangjiakou.192.168.3.184 <none> <none> tf-notebook-7cf4575d78-sxmfl 1/1 Running 0 7h49m 172.20.64.14 cn-zhangjiakou.192.168.3.184 <none> <none> jumper(⎈ |zjk-gpu:default)➜ ~ arena top node -s NAME IPADDRESS GPU0(Allocated/Total) cn-zhangjiakou.192.168.3.184 192.168.3.184 12/14 ---------------------------------------------------------------------- Allocated/Total GPU Memory In GPUShare Node: 12/14 (GiB) (85%)
查看最终结果
如上所示
经过kubectl get pod -o wide
能够看到在cn-zhangjiakou.192.168.3.184 节点上有3个pod运行
经过 arena top node -s
能够看到cn-zhangjiakou.192.168.3.184节点上的显存资源使用了 12/14
在不一样的服务上开启终端,经过nvidia-smi来查看GPU资源,每一个Pod的上限都是4308MiB
在节点cn-zhangjiakou.192.168.3.184 上运行以下命令,查看节点上的资源状况
[root@iZ8vb4lox93w3mhkqmdrgsZ ~]# nvidia-smi Wed May 27 12:19:25 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:07.0 Off | 0 | | N/A 49C P0 29W / 70W | 4019MiB / 15079MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 11563 C /usr/bin/python3 4009MiB | +-----------------------------------------------------------------------------+
由此能够看出经过使用cgpu的模式能够在同一个节点上部署更多的使用GPU资源的Pod,而“普通的调度一个GPU node 只能负载一个pod”
真实的程序
下面是一段能够持续运行使用GPU资源的代码,其中 参数fraction 为申请显存占可用显存的比例,默认值为0.7,咱们在3个pod的Jupyter里运行下面的程序
import argparse import tensorflow as tf FLAGS = None def train(fraction=1.0): config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = fraction a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) # Creates a session with log_device_placement set to True. config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = fraction sess = tf.Session(config=config) # Runs the op. while True: sess.run(c) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--total', type=float, default=1000, help='Total GPU memory.') parser.add_argument('--allocated', type=float, default=1000, help='Allocated GPU memory.') FLAGS, unparsed = parser.parse_known_args() # fraction = FLAGS.allocated / FLAGS.total * 0.85 fraction = round( FLAGS.allocated * 0.7 / FLAGS.total , 1 ) print(fraction) # fraction 默认值为0.7,该程序最多使用总资源的70% train(fraction)
而后经过托管版Prometheus来观察具体的资源使用状况
如上图所示,每一个Pod实际使用显存3.266GB,亦即每一个Pod的使用的显存资源都限制到了4
总结
总结一下
- 经过给节点添加cgpu: true标签将节点设置为GPU共享型节点。
- 在pod中经过 类型
aliyun.com/gpu-mem: 4
的资源来申请和限制单个pod使用的资源,进而达到GPU共享的目的,每一个pod均可以提供完整的GPU能力; 而Node上的一个GPU资源分享给了3个Pod使用,利用率提高到300% -- 若是资源拆分更小,还能够达到更高的利用率。 - 经过
arena top node
、arena top node -s
来查看GPU资源分配的状况 - 经过 托管版Prometheus的“GPU APP” 大盘能够看到实际运行时使用的显存、GPU、温度、功率等信息。
参考信息
托管版本Prometheus https://help.aliyun.com/document_detail/122123.html
GPU共享方案CGPU https://help.aliyun.com/document_detail/163994.html
arena https://github.com/kubeflow/arena