A Full Hardware Guide to Deep Learning深度学习电脑配置

时间 2019-11-06

标签 hardware guide deep learning 深度学习电脑配置栏目网络硬件繁體版

原文原文链接

https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149（欢迎关注博主主页，学习python视频资源，还有大量免费python经典文章）html

https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/python

深度学习的完整硬件指南

深度学习是计算密集型的，所以您须要具备多个内核的快速CPU，对吧？或者购买快速CPU是否浪费？在构建深度学习系统时，您能够作的最糟糕的事情之一就是在没必要要的硬件上浪费金钱。在这里，我将逐步指导您使用廉价高性能系统所需的硬件。算法

多年来，我总共创建了7个不一样的深度学习工做站，尽管通过仔细的研究和推理，但我在选择硬件部件方面犯了很大的错误。在本指南中，我想分享一下我多年来积累的经验，这样你就不会犯一样的错误。express

博客帖子按错误严重程度排序。这意味着人们一般浪费最多钱的错误首先出现。编程

GPU

这篇博文假设您将使用GPU进行深度学习。若是您正在构建或升级系统以进行深度学习，那么忽略GPU是不明智的。GPU只是深度学习应用程序的核心 - 处理速度的提升太大了，不容忽视。windows

我在GPU推荐博客文章中详细讨论了GPU的选择，而GPU的选择多是深度学习系统最关键的选择。选择GPU时可能会遇到三个主要错误：（1）成本/性能不佳，（2）内存不足，（3）散热不良。安全

为了得到良好的性价比，我建议使用RTX 2070或RTX 2080 Ti。若是使用这些卡，则应使用16位模型。不然，来自eBay的GTX 1070，GTX 1080，GTX 1070 Ti和GTX 1080 Ti是公平的选择，您可使用这些具备32位（但不是16位）的GPU。服务器

选择GPU时要当心内存要求。能够运行16位的RTX卡能够训练相比GTX卡具备相同内存大两倍的型号。所以，RTX卡具备内存优点，而且选择RTX卡并学习如何有效地使用16位模型将带您走很长的路。一般，对内存的要求大体以下：网络

正在寻找最早进分数的研究：> = 11 GB
正在寻找有趣架构的研究：> = 8 GB
任何其余研究：8 GB
Kaggle：4 - 8 GB
启动：8 GB（但检查特定应用领域的型号尺寸）
公司：8 GB用于原型设计，> = 11 GB用于培训

须要注意的另外一个问题是，特别是若是你购买多个RTX卡就是冷却。若是您想将GPU固定在彼此相邻的PCIe插槽中，您应该确保使用鼓风机式风扇得到GPU。不然，您可能会遇到温度问题，而且您的GPU速度会变慢（约30％）而且死得更快。架构

怀疑阵容
您可否识别出因性能不佳而出现故障的硬件部分？其中一个GPU？或者也许这毕竟是CPU的错？

内存

RAM的主要错误是购买时钟频率太高的RAM。第二个错误是购买不够的RAM以得到平滑的原型制做体验。

须要的RAM时钟速率

RAM时钟速率是市场营销的一种状况，RAM公司会引诱你购买“更快”的RAM，实际上几乎没有产生任何性能提高。最好的解释是“ RAM速度真的很重要吗？“关于RAM von Linus技术提示的视频。

此外，重要的是要知道RAM速度与快速CPU RAM-> GPU RAM传输几乎无关。这是由于（1）若是您使用了固定内存，您的迷你批次将被转移到GPU而不须要CPU的参与，以及（2）若是您不使用固定内存，快速与慢速RAM的性能提高是关于0-3％ - 把钱花在别的地方！

RAM大小

RAM大小不会影响深度学习性能。可是，它可能会阻碍您轻松执行GPU代码（无需交换到磁盘）。你应该有足够的内存来温馨地使用你的GPU。这意味着您应该至少拥有与最大GPU匹配的RAM量。例如，若是你有一个24 GB内存的Titan RTX，你应该至少有24 GB的RAM。可是，若是您有更多的GPU，则不必定须要更多RAM。

这种“在RAM中匹配最大GPU内存”策略的问题在于，若是处理大型数据集，您可能仍然没法使用RAM。这里最好的策略是匹配你的GPU，若是你以为你没有足够的RAM，只需再购买一些。

一种不一样的策略受到心理学的影响：心理学告诉咱们，注意力是一种随着时间推移而耗尽的资源。RAM是为数很少的硬件之一，可让您节省集中资源，解决更困难的编程问题。若是你有更多的RAM，你能够将注意力集中在更紧迫的事情上，而不是花费大量时间来环绕RAM瓶颈。有了大量的RAM，您能够避免这些瓶颈，节省时间并提升生产率，解决更紧迫的问题。特别是在Kaggle比赛中，我发现额外的RAM对于特征工程很是有用。所以，若是您有钱并进行大量预处理，那么额外的RAM多是一个不错的选择。所以，使用此策略，您但愿如今拥有更多，更便宜的RAM而不是更晚。

中央处理器

人们犯的主要错误是人们过度关注CPU的PCIe通道。您不该该太在乎PCIe通道。相反，只需查看您的CPU和主板组合是否支持您要运行的GPU数量。第二个最多见的错误是得到一个功能太强大的CPU。

CPU和PCI-Express

人们对PCIe车道感到疯狂！然而，问题是它对深度学习表现几乎没有影响。若是您只有一个GPU，则只须要PCIe通道便可快速将数据从CPU RAM传输到GPU RAM。然而，ImageNet批次的32个图像（32x225x225x3）和32位须要1.1毫秒，16个通道，2.3毫秒，8个通道，4.5毫秒，4个通道。这些是理论数字，在实践中，你常常看到PCIe的速度是它的两倍 - 但这仍然是闪电般的快速！PCIe通道一般具备纳秒范围内的延迟，所以能够忽略延迟。

把这个放在一块儿咱们有一个ImageNet迷你批次的32张图像和一个ResNet-152如下时间：

前向和后向传递：216毫秒（ms）
16个PCIe通道CPU-> GPU传输：大约2 ms（理论上为1.1 ms）
8个PCIe通道CPU-> GPU传输：大约5毫秒（2.3毫秒）
4个PCIe通道CPU-> GPU传输：大约9毫秒（4.5毫秒）

所以，从4到16个PCIe通道将使性能提高约3.2％。可是，若是你使用带固定内存的PyTorch数据加载器，你能够得到0％的性能。所以，若是您使用单个GPU，请不要在PCIe通道上浪费金钱！

选择CPU PCIe通道和主板PCIe通道时，请确保选择支持所需GPU数量的组合。若是您购买的主板支持2个GPU，而且您但愿最终拥有2个GPU，请确保购买支持2个GPU的CPU，但不必定要查看PCIe通道。

PCIe通道和多GPU并行

若是您在具备数据并行性的多个GPU上训练网络，PCIe通道是否重要？我已经在ICLR2016上发表了一篇论文，我能够告诉你，若是你有96个GPU，那么PCIe通道很是重要。可是，若是您有4个或更少的GPU，这并不重要。若是您在2-3个GPU之间并行化，我根本不关心PCIe通道。有了4个GPU，我确保每一个GPU能够得到8个PCIe通道的支持（总共32个PCIe通道）。由于几乎没有人运行超过4个GPU的系统做为经验法则：不要花费额外的钱来得到每GPU更多的PCIe通道 - 这不要紧！

须要CPU核心

为了可以为CPU作出明智的选择，咱们首先须要了解CPU以及它与深度学习的关系。CPU为深度学习作了什么？当您在GPU上运行深网时，CPU几乎不会进行任何计算。主要是它（1）启动GPU函数调用，（2）执行CPU函数。

到目前为止，CPU最有用的应用程序是数据预处理。有两种不一样的常见数据处理策略，它们具备不一样的CPU需求。

第一个策略是在训练时进行预处理：

环：

加载小批量
预处理小批量
小批量训练

第二个策略是在任何培训以前进行预处理：

预处理数据
环：
1. 加载预处理的小批量
2. 小批量训练

对于第一个策略，具备多个内核的良好CPU能够显着提升性能。对于第二种策略，您不须要很是好的CPU。对于第一个策略，我建议每一个GPU至少有4个线程 - 一般每一个GPU有两个核心。我没有对此进行过硬测试，但每增长一个核心/ GPU，你应该得到大约0-5％的额外性能。

对于第二种策略，我建议每一个GPU至少有2个线程 - 一般是每一个GPU一个核心。若是您使用第二个策略，当您拥有更多内核时，您将不会看到性能的显着提高。

须要的CPU时钟频率（频率）

当人们考虑快速CPU时，他们一般首先考虑时钟频率。4GHz优于3.5GHz，仍是它？这对于比较具备相同架构的处理器（例如“Ivy Bridge”）一般是正确的，但它在处理器之间不能很好地比较。此外，它并不老是衡量性能的最佳方法。

在深度学习的状况下，CPU几乎没有计算：在这里增长一些变量，在那里评估一些布尔表达式，在GPU或程序内进行一些函数调用 - 全部这些都取决于CPU核心时钟率。

虽然这种推理彷佛很合理，可是当我运行深度学习程序时，CPU有100％的使用率，那么这里的问题是什么？我作了一些CPU核心速率的低频实验来找出答案。

MNIST和ImageNet上的CPU降频：性能测量为200个历元MNIST或ImageNet上具备不一样CPU核心时钟速率的四分之一时间，其中最大时钟速率被视为每一个CPU的基线。做为比较：从GTX 680升级到GTX Titan的性能约为+ 15％; 从GTX Titan到GTX 980另外+ 20％的性能; GPU超频可为任何GPU带来约+ 5％的性能

请注意，这些实验是在过期的硬件上进行的，可是，对于现代CPU / GPU，这些结果应该仍然相同。

硬盘/ SSD

硬盘一般不是深度学习的瓶颈。可是，若是你作了愚蠢的事情会对你形成伤害：若是你在须要时从磁盘读取你的数据（阻塞等待），那么一个100 MB / s的硬盘驱动器将花费你大约185毫秒的时间为32的ImageNet迷你批次 - 哎哟！可是，若是您在使用数据以前异步获取数据（例如火炬视觉加载器），那么您将在185毫秒内加载小批量，而ImageNet上大多数深度神经网络的计算时间约为200毫秒。所以，在当前仍处于计算状态时加载下一个小批量，您将不会面临任何性能损失。

可是，我建议使用SSD以提升温馨度和工做效率：程序启动和响应速度更快，使用大文件进行预处理要快得多。若是您购买NVMe SSD，与普通SSD相比，您将得到更加平滑的体验。

所以，理想的设置是为数据集和SSD提供大而慢的硬盘驱动器，以提升生产率和温馨度。

电源装置（PSU）

一般，您须要一个足以容纳全部将来GPU的PSU。GPU随着时间的推移一般会变得更加节能; 所以，虽然其余组件须要更换，但PSU应该持续很长时间，所以良好的PSU是一项很好的投资。

您能够经过将CPU和GPU的功耗与其余组件的额外10％瓦特相加来计算所需的功率，并做为功率峰值的缓冲器。例如，若是您有4个GPU，每一个250瓦TDP和一个150瓦TDP的CPU，那么您将须要一个最小为4×250 + 150 + 100 = 1250瓦的PSU。我一般会添加另外10％，以确保一切正常，在这种状况下将致使总共1375瓦特。在这种状况下，我会获得一个1400瓦的PSU。

须要注意的一个重要部分是，即便PSU具备所需的功率，它也可能没有足够的PCIe 8针或6针链接器。确保PSU上有足够的链接器以支持全部GPU！

另外一个重要的事情是购买具备高功率效率的PSU - 特别是若是你运行许多GPU并将运行它们更长的时间。

以全功率（1000-1500瓦）运行4 GPU系统来训练卷积网两周将达到300-500千瓦时，在德国 - 至关高的电力成本为每千瓦时20美分 - 将达到60- 100欧元（66-111美圆）。若是这个价格是100％的效率，那么用80％的电源进行这样的网络培训会增长18-26欧元的成本 - 哎哟！对于单个GPU而言，这一点要少得多，但重点仍然存在 - 在高效电源上投入更多资金是有道理的。

全天候使用几个GPU将大大增长您的碳足迹，它将使运输（主要是飞机）和其余有助于您的足迹的因素蒙上阴影。若是你想要负责，请考虑像纽约大学机器学习语言组（ML2）那样实现碳中性 - 它很容易作到，价格便宜，应该是深度学习研究人员的标准。

CPU和GPU冷却

冷却很重要，它多是一个重要的瓶颈，它会比糟糕的硬件选择下降性能。对于CPU来讲，使用标准散热器或一体化（AIO）水冷却解决方案应该没问题，可是对于GPU来讲，须要特别注意。

风冷GPU

对于单个GPU，空气冷倒是安全可靠的，或者若是您有多个GPU之间有空间（在3-4 GPU状况下为2个GPU）。可是，当您尝试冷却3-4个GPU时，可能会出现一个最大的错误，在这种状况下您须要仔细考虑您的选项。

现代GPU在运行算法时会将速度 - 以及功耗 - 提升到最大值，但一旦GPU达到温度障碍 - 一般为80°C - GPU将下降速度，以便温度阈值为没有违反。这样能够在保持GPU过热的同时实现最佳性能。

然而，对于深度学习程序而言，典型的风扇速度预编程时间表设计得很糟糕，所以在开始深度学习程序后几秒内就达到了这个温度阈值。结果是性能降低（0-10％），这对于GPU相互加热的多个GPU（10-25％）而言可能很重要。

因为NVIDIA GPU首先是游戏GPU，所以它们针对Windows进行了优化。您能够在Windows中单击几回更改风扇计划，但在Linux中不是这样，而且由于大多数深度学习库都是针对Linux编写的，这是一个问题。

Linux下惟一的选择是用于设置Xorg服务器（Ubuntu）的配置，您能够在其中设置“coolbits”选项。这对于单个GPU很是有效，可是若是你有多个GPU，其中一些是无头的，即它们没有链接监视器，你必须模拟一个硬和黑客的监视器。我试了很长时间，而且使用实时启动CD来恢复个人图形设置使人沮丧 - 我没法让它在无头GPU上正常运行。

若是在空气冷却下运行3-4个GPU，最重要的考虑因素是注意风扇设计。“鼓风机”风扇设计将空气推出到机箱背面，以便将新鲜，凉爽的空气推入GPU。非鼓风机风扇在GPU的虚拟性中吸入空气并冷却GPU。可是，若是你有多个相邻的GPU，那么周围没有冷空气，带有非鼓风机风扇的GPU会愈来愈多地加热，直到它们本身下降温度到达更低的温度。不惜一切代价避免3-4个GPU设置中的非鼓风机风扇。

用于多个GPU的水冷GPU

另外一种更昂贵且更加工艺的选择是使用水冷却。若是您使用单个GPU或两个GPU之间有空间（3-4 GPU板中有2个GPU），我不建议使用水冷。然而，水冷却确保即便最强劲的GPU在4 GPU设置中也能保持凉爽，这在用空气冷却时是不可能的。水冷却的另外一个优势是它能够更安静地运行，若是您在其余人工做的区域运行多个GPU，这是一个很大的优点。水冷却每一个GPU花费大约100美圆，还有一些额外的前期成本（大约50美圆）。水冷还须要一些额外的工做来组装你的计算机，但有不少详细的指南，它应该只须要几个小时的时间。维护不该该那么复杂或费力。

冷却的大案例？

我为个人深度学习集群购买了大型塔，由于他们为GPU领域增长了粉丝，但我发现这在很大程度上是不相关的：大约2-5°C的下降，不值得投资和案件的庞大。最重要的部分是直接在GPU上的冷却解决方案 - 不要为其GPU冷却功能选择昂贵的外壳。在这里便宜。这个案子应该适合你的GPU，但就是这样！

结论冷却

因此最后它很简单：对于1 GPU，空气冷倒是最好的。对于多个GPU，您应该得到鼓风式空气冷却并接受微小的性能损失（10-15％），或者您须要额外支付水冷却，这也更难以正确设置而且您没有性能损失。在某些状况下，空气和水冷却都是合理的选择。然而，我会建议空气冷却以简化 - 若是您运行多个GPU，请得到鼓风式GPU。若是您想用水冷却，请尝试为GPU找到一体化（AIO）水冷却解决方案。

母板

您的主板应该有足够的PCIe端口来支持您要运行的GPU数量（一般限制为4个GPU，即便您有更多的PCIe插槽）; 请记住，大多数GPU的宽度为两个PCIe插槽，所以若是您打算使用多个GPU，请购买PCIe插槽之间有足够空间的主板。确保您的主板不只具备PCIe插槽，并且实际上支持您要运行的GPU设置。若是您在newegg上搜索您选择的主板并查看规格页面上的PCIe部分，一般能够在此找到相关信息。

电脑机箱

选择外壳时，应确保它支持位于主板顶部的全长GPU。大多数状况下都支持全长GPU，可是若是你购买一个小盒子，你应该怀疑。检查其尺寸和规格; 你也能够尝试谷歌图像搜索该模型，看看你是否找到了带有GPU的图片。

若是您使用自定义水冷却，请确保您的外壳有足够的空间放置散热器。若是您为GPU使用水冷却，则尤为如此。每一个GPU的散热器都须要一些空间 - 确保您的设置实际上适合GPU。

显示器

我首先想到关于显示器也是愚蠢的，但它们会产生如此巨大的差别而且很是重要，我只须要写下它们。

我在3台27英寸显示器上花的钱多是我用过的最好的钱。使用多台显示器时，生产力会大幅提高。若是我必须使用一台显示器，我会感到很是瘫痪。不要在这件事上作出改变。若是您没法以有效的方式操做它，那么快速深度学习系统有什么用呢？

我深刻学习时的典型显示器布局：左：论文，谷歌搜索，gmail，stackoverflow; 中：代码; right：输出窗口，R，文件夹，系统监视器，GPU监视器，待办事项列表和其余小型应用程序。

关于构建PC的一些话

许多人惧怕建造电脑。硬件组件很昂贵，你不想作错事。但它很是简单，由于不属于一块儿的组件不能组合在一块儿。主板手册一般很是具体如何组装全部内容，而且有大量的指南和分步视频，若是您没有经验，它们将指导您完成整个过程。

构建计算机的好处在于，您知道有关构建计算机的全部知识，由于全部计算机都以相同的方式构建 - 所以构建计算机将成为您的生活技能将可以一次又一次地申请。因此没有理由退缩！

结论/ TL; DR

GPU：RTX 2070或RTX 2080 Ti。来自eBay的GTX 1070，GTX 1080，GTX 1070 Ti和GTX 1080 Ti也不错！
CPU：每GPU 1-2个核心，具体取决于您预处理数据的方式。> 2GHz; CPU应该支持您要运行的GPU数量。PCIe通道并不重要。

RAM：
- 时钟频率可有可无 - 购买最便宜的RAM。
- 购买至少与最大GPU的RAM相匹配的CPU RAM。
- 仅在须要时购买更多RAM。
- 若是您常用大型数据集，则可使用更多RAM。

硬盘/ SSD：
- 用于数据的硬盘驱动器（> = 3TB）
- 使用SSD来提供温馨性并预处理小型数据集。

PSU：
- 加上GPU + CPU的瓦数。而后将所需功率的总和乘以110％。
- 若是您使用多个GPU，请得到高效率。
- 确保PSU有足够的PCIe链接器（6 + 8针）

散热：
- CPU：得到标准CPU散热器或一体化（AIO）水冷解决方案
- GPU：
- 使用空气冷却
- 若是您购买多个GPU
，请使用“鼓风式”风扇获取GPU - 在您的Xorg中设置coolbits标志配置控制风扇速度

主板：
- 为您（将来）的GPU获取尽量多的PCIe插槽（一个GPU须要两个插槽;每一个系统最多4个GPU）

监视器：
- 额外的监视器可能会比其余GPU更高效。

Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high-performance system.

Over the years, I build a total of 7 different deep learning workstations and despite careful research and reasoning, I made my fair share of mistake in selecting hardware parts. In this guide, I want to share my experience that I gained over the years so that you do not make the same mistakes that I did before.

The blog post is ordered by mistake severity. This means the mistakes where people usually waste the most money come first.

GPU

This blog post assumes that you will use a GPU for deep learning. If you are building or upgrading your system for deep learning, it is not sensible to leave out the GPU. The GPU is just the heart of deep learning applications – the improvement in processing speed is just too huge to ignore.

I talked at length about GPU choice in my GPU recommendations blog post, and the choice of your GPU is probably the most critical choice for your deep learning system. There are three main mistakes that you can make when choosing a GPU: (1) bad cost/performance, (2) not enough memory, (3) poor cooling.

For good cost/performance, I generally recommend an RTX 2070 or an RTX 2080 Ti. If you use these cards you should use 16-bit models. Otherwise, GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are fair choices and you can use these GPUs with 32-bit (but not 16-bit).

Be careful about the memory requirements when you pick your GPU. RTX cards, which can run in 16-bits, can train models which are twice as big with the same memory compared to GTX cards. As such RTX cards have a memory advantage and picking RTX cards and learn how to use 16-bit models effectively will carry you a long way. In general, the requirements for memory are roughly the following:

Research that is hunting state-of-the-art scores: >=11 GB
Research that is hunting for interesting architectures: >=8 GB
Any other research: 8 GB
Kaggle: 4 – 8 GB
Startups: 8 GB (but check the specific application area for model sizes)
Companies: 8 GB for prototyping, >=11 GB for training

Another problem to watch out for, especially if you buy multiple RTX cards is cooling. If you want to stick GPUs into PCIe slots which are next to each other you should make sure that you get GPUs with a blower-style fan. Otherwise you might run into temperature issues and your GPUs will be slower (about 30%) and die faster.

Suspect line-up
Can you identify the hardware part which is at fault for bad performance? One of these GPUs? Or maybe it is the fault of the CPU after all?

RAM

The main mistakes with RAM is to buy RAM with a too high clock rate. The second mistake is to buy not enough RAM to have a smooth prototyping experience.

Needed RAM Clock Rate

RAM clock rates are marketing stints where RAM companies lure you into buying “faster” RAM which actually yields little to no performance gains. This is best explained by “Does RAM speed REALLY matter?” video on RAM von Linus Tech Tips.

Furthermore, it is important to know that RAM speed is pretty much irrelevant for fast CPU RAM->GPU RAM transfers. This is so because (1) if you used pinned memory, your mini-batches will be transferred to the GPU without involvement from the CPU, and (2) if you do not use pinned memory the performance gains of fast vs slow RAMs is about 0-3% — spend your money elsewhere!

RAM Size

RAM size does not affect deep learning performance. However, it might hinder you from executing your GPU code comfortably (without swapping to disk). You should have enough RAM to comfortable work with your GPU. This means you should have at least the amount of RAM that matches your biggest GPU. For example, if you have a Titan RTX with 24 GB of memory you should have at least 24 GB of RAM. However, if you have more GPUs you do not necessarily need more RAM.

The problem with this “match largest GPU memory in RAM” strategy is that you might still fall short of RAM if you are processing large datasets. The best strategy here is to match your GPU and if you feel that you do not have enough RAM just buy some more.

A different strategy is influenced by psychology: Psychology tells us that concentration is a resource that is depleted over time. RAM is one of the few hardware pieces that allows you to conserve your concentration resource for more difficult programming problems. Rather than spending lots of time on circumnavigating RAM bottlenecks, you can invest your concentration on more pressing matters if you have more RAM. With a lot of RAM you can avoid those bottlenecks, save time and increase productivity on more pressing problems. Especially in Kaggle competitions, I found additional RAM very useful for feature engineering. So if you have the money and do a lot of pre-processing then additional RAM might be a good choice. So with this strategy, you want to have more, cheap RAM now rather than later.

CPU

The main mistake that people make is that people pay too much attention to PCIe lanes of a CPU. You should not care much about PCIe lanes. Instead, just look up if your CPU and motherboard combination supports the number of GPUs that you want to run. The second most common mistake is to get a CPU which is too powerful.

CPU and PCI-Express

People go crazy about PCIe lanes! However, the thing is that it has almost no effect on deep learning performance. If you have a single GPU, PCIe lanes are only needed to transfer data from your CPU RAM to your GPU RAM quickly. However, an ImageNet batch of 32 images (32x225x225x3) and 32-bit needs 1.1 milliseconds with 16 lanes, 2.3 milliseconds with 8 lanes, and 4.5 milliseconds with 4 lanes. These are theoretic numbers, and in practice you often see PCIe be twice as slow — but this is still lightning fast! PCIe lanes often have a latency in the nanosecond range and thus latency can be ignored.

Putting this together we have for an ImageNet mini-batch of 32 images and a ResNet-152 the following timing:

Forward and backward pass: 216 milliseconds (ms)
16 PCIe lanes CPU->GPU transfer: About 2 ms (1.1 ms theoretical)
8 PCIe lanes CPU->GPU transfer: About 5 ms (2.3 ms)
4 PCIe lanes CPU->GPU transfer: About 9 ms (4.5 ms)

Thus going from 4 to 16 PCIe lanes will give you a performance increase of roughly 3.2%. However, if you use PyTorch’s data loader with pinned memory you gain exactly 0% performance. So do not waste your money on PCIe lanes if you are using a single GPU!

When you select CPU PCIe lanes and motherboard PCIe lanes make sure that you select a combination which supports the desired number of GPUs. If you buy a motherboard that supports 2 GPUs, and you want to have 2 GPUs eventually, make sure that you buy a CPU that supports 2 GPUs, but do not necessarily look at PCIe lanes.

PCIe Lanes and Multi-GPU Parallelism

Are PCIe lanes important if you train networks on multiple GPUs with data parallelism? I have published a paper on this at ICLR2016, and I can tell you if you have 96 GPUs then PCIe lanes are really important. However, if you have 4 or fewer GPUs this does not matter much. If you parallelize across 2-3 GPUs, I would not care at all about PCIe lanes. With 4 GPUs, I would make sure that I can get a support of 8 PCIe lanes per GPU (32 PCIe lanes in total). Since almost nobody runs a system with more than 4 GPUs as a rule of thumb: Do not spend extra money to get more PCIe lanes per GPU — it does not matter!

Needed CPU Cores

To be able to make a wise choice for the CPU we first need to understand the CPU and how it relates to deep learning. What does the CPU do for deep learning? The CPU does little computation when you run your deep nets on a GPU. Mostly it (1) initiates GPU function calls, (2) executes CPU functions.

By far the most useful application for your CPU is data preprocessing. There are two different common data processing strategies which have different CPU needs.

The first strategy is preprocessing while you train:

Loop:

Load mini-batch
Preprocess mini-batch
Train on mini-batch

The second strategy is preprocessing before any training:

Preprocess data
Loop:
1. Load preprocessed mini-batch
2. Train on mini-batch

For the first strategy, a good CPU with many cores can boost performance significantly. For the second strategy, you do not need a very good CPU. For the first strategy, I recommend a minimum of 4 threads per GPU — that is usually two cores per GPU. I have not done hard tests for this, but you should gain about 0-5% additional performance per additional core/GPU.

For the second strategy, I recommend a minimum of 2 threads per GPU — that is usually one core per GPU. You will not see significant gains in performance when you have more cores if you are using the second strategy.

Needed CPU Clock Rate (Frequency)

When people think about fast CPUs they usually first think about the clock rate. 4GHz is better than 3.5GHz, or is it? This is generally true for comparing processors with the same architecture, e.g. “Ivy Bridge”, but it does not compare well between processors. Also, it is not always the best measure of performance.

In the case of deep learning there is very little computation to be done by the CPU: Increase a few variables here, evaluate some Boolean expression there, make some function calls on the GPU or within the program – all these depend on the CPU core clock rate.

While this reasoning seems sensible, there is the fact that the CPU has 100% usage when I run deep learning programs, so what is the issue here? I did some CPU core rate underclocking experiments to find out.

CPU underclocking on MNIST and ImageNet: Performance is measured as time taken on 200 epochs MNIST or a quarter epoch on ImageNet with different CPU core clock rates, where the maximum clock rate is taken as a baseline for each CPU. For comparison: Upgrading from a GTX 680 to a GTX Titan is about +15% performance; from GTX Titan to GTX 980 another +20% performance; GPU overclocking yields about +5% performance for any GPU

Note that these experiments are on a hardware that is dated, however, these results should still be the same for modern CPUs/GPUs.

Hard drive/SSD

The hard drive is not usually a bottleneck for deep learning. However, if you do stupid things it will hurt you: If you read your data from disk when they are needed (blocking wait) then a 100 MB/s hard drive will cost you about 185 milliseconds for an ImageNet mini-batch of size 32 — ouch! However, if you asynchronously fetch the data before it is used (for example torch vision loaders), then you will have loaded the mini-batch in 185 milliseconds while the compute time for most deep neural networks on ImageNet is about 200 milliseconds. Thus you will not face any performance penalty since you load the next mini-batch while the current is still computing.

However, I recommend an SSD for comfort and productivity: Programs start and respond more quickly, and pre-processing with large files is quite a bit faster. If you buy an NVMe SSD you will have an even smoother experience when compared to a regular SSD.

Thus the ideal setup is to have a large and slow hard drive for datasets and an SSD for productivity and comfort.

Power supply unit (PSU)

Generally, you want a PSU that is sufficient to accommodate all your future GPUs. GPUs typically get more energy efficient over time; so while other components will need to be replaced, a PSU should last a long while so a good PSU is a good investment.

You can calculate the required watts by adding up the watt of your CPU and GPUs with an additional 10% of watts for other components and as a buffer for power spikes. For example, if you have 4 GPUs with each 250 watts TDP and a CPU with 150 watts TDP, then you will need a PSU with a minimum of 4×250 + 150 + 100 = 1250 watts. I would usually add another 10% just to be sure everything works out, which in this case would result in a total of 1375 Watts. I would round up in this case an get a 1400 watts PSU.

One important part to be aware of is that even if a PSU has the required wattage, it might not have enough PCIe 8-pin or 6-pin connectors. Make sure you have enough connectors on the PSU to support all your GPUs!

Another important thing is to buy a PSU with high power efficiency rating – especially if you run many GPUs and will run them for a longer time.

Running a 4 GPU system on full power (1000-1500 watts) to train a convolutional net for two weeks will amount to 300-500 kWh, which in Germany – with rather high power costs of 20 cents per kWh – will amount to 60-100€ ($66-111). If this price is for a 100% efficiency, then training such a net with an 80% power supply would increase the costs by an additional 18-26€ – ouch! This is much less for a single GPU, but the point still holds – spending a bit more money on an efficient power supply makes good sense.

Using a couple of GPUs around the clock will significantly increase your carbon footprint and it will overshadow transportation (mainly airplane) and other factors that contribute to your footprint. If you want to be responsible, please consider going carbon neutral like the NYU Machine Learning for Language Group (ML2) — it is easy to do, cheap, and should be standard for deep learning researchers.

CPU and GPU Cooling

Cooling is important and it can be a significant bottleneck which reduces performance more than poor hardware choices do. You should be fine with a standard heat sink or all-in-one (AIO) water cooling solution for your CPU, but what for your GPU you will need to make special considerations.

Air Cooling GPUs

Air cooling is safe and solid for a single GPU or if you have multiple GPUs with space between them (2 GPUs in a 3-4 GPU case). However, one of the biggest mistakes can be made when you try to cool 3-4 GPUs and you need to think carefully about your options in this case.

Modern GPUs will increase their speed – and thus power consumption – up to their maximum when they run an algorithm, but as soon as the GPU hits a temperature barrier – often 80 °C – the GPU will decrease the speed so that the temperature threshold is not breached. This enables the best performance while keeping your GPU safe from overheating.

However, typical pre-programmed schedules for fan speeds are badly designed for deep learning programs, so that this temperature threshold is reached within seconds after starting a deep learning program. The result is a decreased performance (0-10%) which can be significant for multiple GPUs (10-25%) where the GPU heat up each other.

Since NVIDIA GPUs are first and foremost gaming GPUs, they are optimized for Windows. You can change the fan schedule with a few clicks in Windows, but not so in Linux, and as most deep learning libraries are written for Linux this is a problem.

The only option under Linux is to use to set a configuration for your Xorg server (Ubuntu) where you set the option “coolbits”. This works very well for a single GPU, but if you have multiple GPUs where some of them are headless, i.e. they have no monitor attached to them, you have to emulate a monitor which is hard and hacky. I tried it for a long time and had frustrating hours with a live boot CD to recover my graphics settings – I could never get it running properly on headless GPUs.

The most important point of consideration if you run 3-4 GPUs on air cooling is to pay attention to the fan design. The “blower” fan design pushes the air out to the back of the case so that fresh, cooler air is pushed into the GPU. Non-blower fans suck in air in the vincity of the GPU and cool the GPU. However, if you have multiple GPUs next to each other then there is no cool air around and GPUs with non-blower fans will heat up more and more until they throttle themselves down to reach cooler temperatures. Avoid non-blower fans in 3-4 GPU setups at all costs.

Water Cooling GPUs For Multiple GPUs

Another, more costly, and craftier option is to use water cooling. I do not recommend water cooling if you have a single GPU or if you have space between your two GPUs (2 GPUs in 3-4 GPU board). However, water cooling makes sure that even the beefiest GPU stay cool in a 4 GPU setup which is not possible when you cool with air. Another advantage of water cooling is that it operates much more silently, which is a big plus if you run multiple GPUs in an area where other people work. Water cooling will cost you about $100 for each GPU and some additional upfront costs (something like $50). Water cooling will also require some additional effort to assemble your computer, but there are many detailed guides on that and it should only require a few more hours of time in total. Maintenance should not be that complicated or effortful.

A Big Case for Cooling?

I bought large towers for my deep learning cluster, because they have additional fans for the GPU area, but I found this to be largely irrelevant: About 2-5 °C decrease, not worth the investment and the bulkiness of the cases. The most important part is really the cooling solution directly on your GPU — do not select an expensive case for its GPU cooling capability. Go cheap here. The case should fit your GPUs but thats it!

Conclusion Cooling

So in the end it is simple: For 1 GPU air cooling is best. For multiple GPUs, you should get blower-style air cooling and accept a tiny performance penalty (10-15%), or you pay extra for water cooling which is also more difficult to setup correctly and you have no performance penalty. Air and water cooling are all reasonable choices in certain situations. I would however recommend air cooling for simplicity in general — get a blower-style GPU if you run multiple GPUs. If you want to user water cooling try to find all-in-one (AIO) water cooling solutions for GPUs.

Motherboard

Your motherboard should have enough PCIe ports to support the number of GPUs you want to run (usually limited to four GPUs, even if you have more PCIe slots); remember that most GPUs have a width of two PCIe slots, so buy a motherboard that has enough space between PCIe slots if you intend to use multiple GPUs. Make sure your motherboard not only has the PCIe slots, but actually supports the GPU setup that you want to run. You can usually find information in this if you search your motherboard of choice on newegg and look at PCIe section on the specification page.

Computer Case

When you select a case, you should make sure that it supports full length GPUs that sit on top of your motherboard. Most cases support full length GPUs, but you should be suspicious if you buy a small case. Check its dimensions and specifications; you can also try a google image search of that model and see if you find pictures with GPUs in them.

If you use custom water cooling, make sure your case has enough space for the radiators. This is especially true if you use water cooling for your GPUs. The radiator of each GPU will need some space — make sure your setup actually fits into the GPU.

Monitors

I first thought it would be silly to write about monitors also, but they make such a huge difference and are so important that I just have to write about them.

The money I spent on my 3 27 inch monitors is probably the best money I have ever spent. Productivity goes up by a lot when using multiple monitors. I feel desperately crippled if I have to work with a single monitor. Do not short-change yourself on this matter. What good is a fast deep learning system if you are not able to operate it in an efficient manner?

Typical monitor layout when I do deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.

Some words on building a PC

Many people are scared to build computers. The hardware components are expensive and you do not want to do something wrong. But it is really simple as components that do not belong together do not fit together. The motherboard manual is often very specific how to assemble everything and there are tons of guides and step by step videos which guide you through the process if you have no experience.

The great thing about building a computer is, that you know everything that there is to know about building a computer when you did it once, because all computer are built in the very same way – so building a computer will become a life skill that you will be able to apply again and again. So no reason to hold back!

Conclusion / TL;DR

GPU: RTX 2070 or RTX 2080 Ti. GTX 1070, GTX 1080, GTX 1070 Ti, and GTX 1080 Ti from eBay are good too!
CPU: 1-2 cores per GPU depending how you preprocess data. > 2GHz; CPU should support the number of GPUs that you want to run. PCIe lanes do not matter.

RAM:
– Clock rates do not matter — buy the cheapest RAM.
– Buy at least as much CPU RAM to match the RAM of your largest GPU.
– Buy more RAM only when needed.
– More RAM can be useful if you frequently work with large datasets.

Hard drive/SSD:
– Hard drive for data (>= 3TB)
– Use SSD for comfort and preprocessing small datasets.

PSU:
– Add up watts of GPUs + CPU. Then multiply the total by 110% for required Wattage.
– Get a high efficiency rating if you use a multiple GPUs.
– Make sure the PSU has enough PCIe connectors (6+8pins)

Cooling:
– CPU: get standard CPU cooler or all-in-one (AIO) water cooling solution
– GPU:
– Use air cooling
– Get GPUs with “blower-style” fans if you buy multiple GPUs
– Set coolbits flag in your Xorg config to control fan speeds

Motherboard:
– Get as many PCIe slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)

Monitors:
– An additional monitor might make you more productive than an additional GPU.