新AI芯片介绍（3）:tenstorrent - 知乎

今天咱们来看tenstorrent的芯片，这个是一个比较新的startup，没有什么正儿八经的paper，可是咱们能够从各个地方搜集来的信息看这个芯片的细节前端

https://www.tenstorrent.com/wp-content/uploads/2020/04/Tenstorrent-Scales-AI-Performance.pdf www.tenstorrent.com

https://www.youtube.com/watch?v=ME-6uxSoVm0 www.youtube.com

Tenstorrent主要的产品是这些架构

Tenstorrentt跟其余架构最大的差异在于MAC核的数量。Tenstorrent有整整120个核，这些核都比咱们以前接触的TPU、含光或者Groq要来的小的多，大概架构长这样：flex

这个图片里面紫色的CPU不是咱们电脑上面的CPU，而是一个很小的RISC的核。小核有一个很大的优点，就是conditional computation。这个芯片相对别的玩家来讲TDP要低。优化

At a peak rate of 368 TOPS, the chip runs on just 65W

一个小核里面总数大概是一千个int8 的MAC（好比32*32），不过他们也支持fp16跟bf16this

Tenstorrent withheld further details, but to achieve the 3-TOPS rating, the tensor engine likely contains about a thousand 8-bit MAC units

他们也支持相似rowwise的quantization，一组数字共享scaleurl

To save memory space, the design implements a block FP format in which groups of 16 values share the same 8-bit exponent. Tensix defines block FP formats with 8-, 4-, or 2- bit mantissas, trading off throughput for precision. Once the core loads values from memory, it expands them to FP16 before any computation.

有这些能工做的小核了以后，咱们就能够把这些小核串起来。spa

四个Synopsys ARC CPU来负责组织120个小核的工做，总共16G DRAM跟16x PCIE。还有注意这些内存是LPDDR，确定不能跟TPU或者Volta的HBM比较。这些小核之间的通讯也有模块的3d

Although the compute unit can operate only on the local memory, each core can easily access data in other cores using the network-on-a-chip (NoC) interconnect.

这里的NoC大体业务逻辑以下图orm

因此这个NoC相似TPU的ICI，可是我看主要侧重于芯片内部的沟通，并且应该小的多，因此估计没有TPU之类的特别复杂的routing的逻辑，也作不了不一样chip以前的沟通。NoC还负责压缩，看描述应该是varint相似的压缩方式视频

The packet engine implements hardware data compression. It compresses data before transferring it across the NoC. Depending on the number of zeroes in the data, this compression typically shrinks the data by 50–75%, but the percentage can be even greater on sparse data.

链接方式是2D torus，以前在TPU那边有介绍，只是这里的2D Torus是小核之间，而以前TPU是卡跟卡之间

并行方面，Tenstorrent须要的逻辑相对来讲比其余芯片来讲复杂的多。这种多核场景里面并行处理的上线确定比几个大核来的多，可是确定须要compiler更复杂的配合。视频里给了一个例子

软件层来讲onnx或者pytorch做为前端都行

能够看到我以前讲的须要compiler复杂的配合。这个如今ppt作的很漂亮，我估计compiler的代码会很是复杂，须要各类不一样的计划执行方式来优化不一样的模型。

Tenstorrent跟Grop核Titan相对来讲的比较：

IPS/Watt 大概是这样的，能够看一下含光仍是最高，可是Tenstorrent相对来讲仍是很好的。并且含光为了convolution我记得是有特殊的优化的，不单纯是systolic array。

对于啥时候能卖，文章里面说大概2021年。

我的感想

这个多核的架构在Tenstorrent这边说的很好，可是其实以前TPU的paper里面也讨论过了大核跟小核的架构优劣，不熟悉的能够看一下我以前的总结

陈宇飞：新AI芯片介绍（2）: TPUv2/v3 zhuanlan.zhihu.com

TPUv3的文章里面有提到，

Sixteen 64x64 MXUs would have a little higher utilization (38%–52%) but would need more area. The reason is the MXU area is determined either by the logic for the multipliers or by the wires on its perimeter for the inputs, outputs, and control. In our technology, for 128x128 and larger the MXU's area is limited by the multipliers but area for 64x64 and smaller MXUs is limited by the I/O and control wires.

因此相对来讲并非core越小越好，要具体问题具体看。对于我来讲，过小的core其实优点不必定明显，特别是大公司推荐系统或者图像处理，应为实在batch size不够凑能够不一样的访问coalesce起来。固然在汽车上面这个假设不必定成立，可是云端的话仍是成立的。

小core还有一个使人担忧的地方是，鉴于并行优化很是的重要，小core是否是可以适用于多种模型，这个是有待市场验证的。

不过优点仍是很明显的，conditional做为一个功能自己绝对是一个特别好的创新。其实conditional 在如今的模型里面不是特别的常见，我以为很大一个缘由是硬件自己支持的很差，因此也省不下来什么，最多见的相似conditional的逻辑仍是Mixture of expert，可是这个granularity相对来讲就大的多。之后这类支持conditional的硬件出来的越多，越能帮助作模型的人创新。仍是期待Tenstorrent能够卖了以后给你们带来什么样的惊喜吧。