GPU服务器安装NVIDIA显卡驱动

一、确认服务器系统版本为16.04.02 (每台都须要操做)
预安装准备参考官网:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actionshtml

for i in xsgpu81 xsgpu82  xsgpu83 xsgpu84 xsgpu85; do qssh root@$i 'cat /etc/issue;uname -r';done
Ubuntu 16.04.2 LTS \n \l
4.4.0-62-genericmodprobe

二、下载nvidia driver驱动并安装
可能须要 service lighted stop, 若是机器不干净(以前装过gpu相关的东西)的话linux

wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.26/NVIDIA-Linux-x86_64-375.26.run
root@xsgpu81:~# sudo sh NVIDIA-Linux-x86_64-375.26.run
Accept
OK
OK
OK

三、安装cudagit

wget http://ogo0b6qe6.bkt.clouddn.com/cuda_8.0.61_375.26_linux.run
chmod +x cuda_8.0.61_375.26_linux.run
sudo sh cuda_8.0.61_375.26_linux.run --silent
echo "PATH=/usr/local/cuda-8.0/bin:$PATH" >> /root/.bashrc
echo "LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH" >> /root/.bashrc
source /root/.bashrc

四、拷贝测试文件github

qscp NVIDIA_CUDA-8.0_Samples/0_Simple/vectorAdd/vectorAdd root@xsgpu81:/root/
root@xsgpu81:~# ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

人肉部署含GPU设备的mesos-agent节点
按照标准流程在GPU机器上部署mesos-agent及其它基础服务(boots-docker, consul, logbeat)
人肉流程:
停含有GPU机器上的mesos-agent服务 supervisorctl stop mesos-agent
清理mesos-agent work_dir
rm -rf cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/work_dir
进入到mesos-agent配置文件目录 /home/qboxserver/mesos-agent/current/conf/mesos-agent更新配置
获取机器上的GPU设备数和型号nvidia-smi -L, 列出的GPU设备数即为设备总数
将设备型号写入到attributes文件 echo "NETWORK:BRIDGE;GPU_MODEL:$MODEL” > attributes
增长isolation配置 echo "cgroups/devices,gpu/nvidia“ > isolation
标识可用的gpu设备编号 echo “0, 1, …, 设备总数 - 1” > nvidia_gpu_devices
resources中增长gpu资源{"name":"gpus","type":"SCALAR","scalar":{"value”:设备总数}}
进入/home/qboxserver/mesos-agent/current/libexec/mesos替换executor
保留原始的executor mv mesos-docker-executor mesos-docker-executor.cpp
下载gpu executor docker

wget http://ogo0b6qe6.bkt.clouddn.com/mesos-docker-executor-2017-11-18
mv mesos-docker-executor-2017-11-18 mesos-docker-executor; chown qboxserver.qboxserver mesos-docker-executor
cp mesos-docker-executor.go mesos-docker-executor

安装nvidia-docker-plugin
cd /home/qboxserver && mkdir nvidia-docker
cd /home/qboxserver/nvidia-docker
wget http://ogo0b6qe6.bkt.clouddn.com/nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
tar zxf nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
ln -s 2016-11-22-20-45-30 current
./current/bin/start.sh
curl -s http://localhost:3476/v1.0/gpu/info 查看gpu设备信息
启动mesos-agent服务
升级GPU 驱动(尝试使用apt-get安装驱动)apache

apt-get purge nvidia*
add-apt-repository ppa:graphics-drivers
apt-get update
apt-get install nvidia-<version>
reboot

安装配套的cadvisor ubuntu

cd /home/qboxserver/boots-cadvisor/current/bin && \
mv cadvisor cadvisor.bak && \
wget http://ogo0b6qe6.bkt.clouddn.com/cadvisor && \
chmod +x cadvisor && \
chown qboxserver:qboxserver cadvisor && \
./start.sh

原理:
http://www.linuxandubuntu.com/home/how-to-install-latest-nvidia-drivers-in-linux
http://mesos.apache.org/documentation/latest/gpu-support/
https://github.com/NVIDIA/nvidia-docker/wikibash

xs区域新上线GPU计算节点7台
版本升级步骤:
有些服务会占用gpu, 升级以前这些服务要停掉:服务器

  1. service lightdm stop (有些机器开了这个,有些没有)
    dockerd nvidia-docker-plugin boots-cadvisor stop
  2. 卸载原来的内核模块
    modprobe -r nvidia nvidia_drm nvidia_uvm
    有时候卸载不成功 lsof |grep nvidia 看那个进程还在用,杀掉该进程,重试。
    lsmod |grep nvidia 没有的时候说明老的驱动被卸载干净,能够开始安装。
  3. wget http://us.download.nvidia.com/tesla/396.44/NVIDIA-Linux-x86_64-396.44.run
    sh NVIDIA-Linux-x86_64-396.44.run --slient

    执行完毕后:
    nvidia-smi 查看是否安装成功
    重启机器less

升级实例:
一、查看原来的版本

root@xsgpu9:~# nvidia-smi
 NVIDIA-SMI 375.26

二、查看正在使用的模块

root@xsgpu9:~#  lsmod | grep -i nvidia
nvidia_drm             53248  0
nvidia_modeset        790528  1 nvidia_drm
nvidia              11943936  1 nvidia_modeset
drm_kms_helper        143360  2 ast,nvidia_drm
drm                   360448  5 ast,ttm,drm_kms_helper,nvidia_drm

三、卸载相关的模块
modprobe -r nvidia_drm nvidia_modeset nvidia

四、下载新的版本
root@xsgpu9:~# wget http://us.download.nvidia.com/tesla/396.44/NVIDIA-Linux-x86_64-396.44.run

五、安装新版本
sh NVIDIA-Linux-x86_64-396.44.run --silent

六、查看新版本

nvidia-smi
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |

xs311 apt -get安装了nvidia的驱动,删除命令,
apt-get --purge remove nvidia-*

dora.内部计算 --> dora.内部计算GPU 问题记录:
root@jjh1569:/var/log# cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
NETWORK:HOST
修改成:
NETWORK:HOST;GPU_MODEL:QSV
以后重启dockerd和mesos-agent服务
发现启动mesos-agent服务失败

刚才那个mesos-agent问题,是配置不一致,致使的启动失败(mesos-agent会保持重连机制,配置不一样会失败)
删除work目录,/disk1/mesos

root@jjh1569:/var/log# cd /home/qboxserver/mesos-agent/current/conf/mesos-agent/
root@jjh1569:/home/qboxserver/mesos-agent/current/conf/mesos-agent# cat work_dir
/disk1/mesos
而后执行:
rm -rf /disk1/mesos
root@jjh1569:/var/log# less syslog
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627574  9662 slave.cpp:519] Agent resources: cpus(*):7; mem(*):12288; disk(*):445440; ports(*):[10000-20000]
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627622  9662 slave.cpp:527] Agent attributes: [ NETWORK=HOST, GPU_MODEL=QSV ]
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.627645  9662 slave.cpp:532] Agent hostname: 10.20.78.29
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: I1024 18:55:11.630751  9660 state.cpp:57] Recovering state from '/disk1/mesos/meta'
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: Failed to perform recovery: Incompatible agent info detected.

Oct 24 18:55:11 jjh1569 mesos-agent[9615]: Old agent info:
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   name: "NETWORK"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   type: TEXT
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   text {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:     value: "HOST"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }

Oct 24 18:55:11 jjh1569 mesos-agent[9615]: New agent info:
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   name: "NETWORK"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   type: TEXT
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   text {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:     value: "HOST"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: attributes {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   name: "GPU_MODEL"
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   type: TEXT
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   text {
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:     value: "QSV" #多出的一部分
Oct 24 18:55:11 jjh1569 mesos-agent[9615]:   }
Oct 24 18:55:11 jjh1569 mesos-agent[9615]: }

而后修改attributes和resources(QSV是自定义的gpu类型,gpus是GPU个数,须要对应修改)
再重启dockerd和mesos-agent服务(若是启动失败,删除workdir: /disk1/mesos目录再重启mesos-agent)
#!/bin/bash
if grep -q QSV /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
then echo QSV is exit
else
sed -i "s/NETWORK:HOST/NETWORK:HOST;GPU_MODEL:QSV/g" /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
fi

/home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
cat << EOF >> /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
[
{
"name": "cpus",
"type": "SCALAR",
"scalar": {
"value": 7
}
},
{
"name": "mem",
"type": "SCALAR",
"scalar": {
"value": 14336
}
},
{
"name" : "disk",
"type" : "SCALAR",
"scalar" : { "value" : 20480 }
},
{
"name": "ports",
"type": "RANGES",
"ranges": {
"range": [
{
"begin": 10000,
"end": 20000
}
]
}
},
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 1
}
},
{
"name": "gpuset",
"type": "SET",
"set": {
"item": ["0"]
}
}
]
EOF

gpu插件相关脚本:

root@xs313:~# cat /tmp/gpu.sh
#!/bin/bash
#usage: 部署 dora gpu 机器 gpu 相关配置的脚本

supervisorctl stop mesos-agent
supervisorctl stop boots-cadvisor
supervisorctl stop dockerd

#安装自定义 cadviser

cd /home/qboxserver/boots-cadvisor/current/bin
mv cadvisor cadvisor.bak
wget http://ogo0b6qe6.bkt.clouddn.com/cadvisor
chmod +x cadvisor
chown qboxserver:qboxserver cadvisor

#安装自定义的 mesos-docker-executor

cd /home/qboxserver/mesos-agent/current/libexec/mesos
wget http://ogo0b6qe6.bkt.clouddn.com/mesos-docker-executor-2018-09-10-15-05-00
mv mesos-docker-executor mesos-docker-executor.bak
mv mesos-docker-executor-2018-09-10-15-05-00 mesos-docker-executor
chown qboxserver:qboxserver mesos-docker-executor
chmod +x mesos-docker-executor

#meos-agent 参数

#Part #1** 修改 attributes

MODEL=$(nvidia-smi -L | cut -d" " -f4 | xargs | cut -d" " -f1)
sed -i "s/NETWORK:HOST/NETWORK:HOST;GPU_MODEL:${MODEL}/g" /home/qboxserver/mesos-agent/current/conf/mesos-agent/attributes
nvidia-smi -L

#Part #2** 添加 isolation
echo "cgroups/devices,gpu/nvidia" &gt; /home/qboxserver/mesos-agent/current/conf/mesos-agent/isolation

#Part #3** 添加 nvidia_gpu_devices
echo "0,1,2,3,4,5,6,7" &gt; /home/qboxserver/mesos-agent/current/conf/mesos-agent/nvidia_gpu_devices

#Part #4** 添加 resources

for i in `seq 2`; do sed -i '$d' /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources ; done
cat << EOF >> /home/qboxserver/mesos-agent/current/conf/mesos-agent/resources
},
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 8
}
},
{
"name": "gpuset",
"type": "SET",
"set": {
"item": ["0", "1", "2", "3", "4", "5", "6", "7"]
}
}
]
EOF

#安装 nvidia-docker-plugin

cd /home/qboxserver && mkdir nvidia-docker && cd nvidia-docker
wget http://ogo0b6qe6.bkt.clouddn.com/nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
tar zxf nvidia-docker-plugin.2016-11-22-20-45-30.tar.gz
ln -s 2016-11-22-20-45-30 current
./current/bin/start.sh

#最后上线

rm -rf $(cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/work_dir)
supervisorctl start dockerd
supervisorctl start mesos-agent
supervisorctl start boots-cadvisor

查看nvidia显卡驱动
目前dora使用的gpu有k80和p4两种类型,查看方法:

nvidia-smi -L
root@xs991:~#  nvidia-smi -L
GPU 0: Tesla P4 (UUID: GPU-50850be7-c49e-4693-e20e-a677d2adeb82)
GPU 1: Tesla P4 (UUID: GPU-22e9fbe2-9170-4548-c301-579b786858b6)
GPU 2: Tesla P4 (UUID: GPU-c8132e0e-c8a4-defc-fea3-01b5c930667e)
GPU 3: Tesla P4 (UUID: GPU-762546f1-0b48-c963-954e-fa74b4f7e76f)
GPU 4: Tesla P4 (UUID: GPU-2fdb3d5e-dd66-1f6d-a814-5265df4fa1f4)
GPU 5: Tesla P4 (UUID: GPU-a4011f72-78c2-ab13-c6b8-3e58e9093773)
GPU 6: Tesla P4 (UUID: GPU-84d2bbd4-c3e0-d7ed-6628-5528878de6ea)
GPU 7: Tesla P4 (UUID: GPU-fa3933c0-3cb3-4e8c-a84a-75342a15cc24)

root@xs313:~# nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-a457c419-bcfd-538b-d993-e443d28dcd24)
GPU 1: Tesla K80 (UUID: GPU-07f9795d-3917-b804-a6c5-621e27c239f8)
GPU 2: Tesla K80 (UUID: GPU-78197899-b007-1e74-29a8-3f27958e7d28)
GPU 3: Tesla K80 (UUID: GPU-d594f478-261b-e139-b87f-cf1d7b076f42)
GPU 4: Tesla K80 (UUID: GPU-8df7cf81-e51a-3a88-a4b8-6075d18a9365)
GPU 5: Tesla K80 (UUID: GPU-c9931f33-32c0-da73-aa8f-6109989b129c)
GPU 6: Tesla K80 (UUID: GPU-0830ceaa-f860-b717-67ac-e4e7fec25a26)
GPU 7: Tesla K80 (UUID: GPU-9b509b1c-a186-cf05-8aa3-4ba73aed1eb1)

显卡有nvidia和Intel集成两种类型

root@xsgpu81:~# lspci | grep -i nvidia
04:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
05:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
08:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
09:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
84:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
85:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
88:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
89:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)

qboxserver@jjh1569:~$ lspci | grep -i vga
00:13.0 Non-VGA unclassified device: Intel Corporation Sunrise Point-H Integrated Sensor Hub (rev 31)
07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
相关文章
相关标签/搜索