TensorFlow中的Placement启发式算法模块——Placer

时间 2019-11-17

标签 tensorflow placement 启发式算法模块 placer 繁體版

原文原文链接

背景

[做者：DeepLearningStack，阿里巴巴算法工程师，开源TensorFlow Contributor]html

受限于单个Device的计算能力和存储大小，许多深度学习模型都有着使用模型分片或相关策略的需求。模型分片的本质是将模型和相关的计算切分到不一样的Device，这样作不但能够解决单个Device放不下大模型的问题，还有可能有计算加速的收益。在深度学习框架方面，显然TensorFlow比Caffe具备更高的灵活性，这主要得益于TensorFlow的Placement机制。Placement是TensorFlow引入的特有概念，它是指某个Op被放在了哪个Device上，所以模型分片问题实际上就是该模型上每一个Op的Placement设置问题。在Python层面，一共存在两个API与Placement相关的接口，它们不但普遍存在与框架代码中，还能够被用户拿来直接使用。可是用户指定Placement信息存在必定的不可靠性，它与Op的实际状况每每存在必定的矛盾，而TensorFlow中的Placer就是解决这个问题的模块。node

Placer功能描述

首先看一下NodeDef的结构，有两个地方和Placement相关。一个是device属性，它显示指定了这个Node应该被放在何种Device上，另外一个是字符串标记loc:@xxxx，这是placement的约束条件，隐式指明该Node的Placement应该和哪些Node保持一致。准确地说，该Node应该和组名为xxxx内的全部Node的Placement保持一致，这两个信息有时候会出现矛盾的情形。python

Placer不但要处理两者的矛盾，还要经过一些规则尽量避免因Placement不当带来的性能问题。每一个Node在通过Placer处理后都会获得最终的Placement信息，它将从新覆盖NodeDef中的device属性内容。因此，通俗地讲，Placer的功能就是计算并填入全部NodeDef的device属性。算法

前驱内容

阅读代码时不免会碰到一些为解决这个问题专门设立的名词和经典的算法，因此建议在阅读Placer模块相关内容以前先确认已经弄清楚下面的东西，避免走一些弯路。数组

重要概念

Placement：每一个Op的属性信息，它显式地指明了某个Op应该被放置在哪个Device上计算数据结构

Colocation Group：这也是每一个Op的Placement相关的属性信息，从NodeDef上看就是字符串为loc:@xxxx字样的内容。它是若干Node的集合，在算法中又被称为约束（Constraint）条件。属于同一个Colocation Group中的全部Node被约束为必需要具备相同的Placement信息。这是Placement信息的隐式表达，它和Placement能够同时被指定，所以存在矛盾的状况。若是发生冲突，则直接报错 app

Placer决策基本原则

Placer会根据会对Graph进行必定程度的分析，并结合用户的要求对每一个Node的Placement进行微调，微调的原则能够归纳为下面四点数据结构和算法

1. 尽量知足用户要求（User Requirement First）：每一个Node的placement会尽可能知足用户的要求分布式

2. 尽量使用计算更快的设备（High Performance Device）：若某个Node的Placement没有被用户指定，则优先分配计算更快的设备

3. 保证程序可运行（Runable）：若某个Node不存在用户要求的Placement相关实现版本，会退而求其次选择其它实现版本，保障程序能够用

4. 尽量考虑近邻特性（Near When Possible）：在作Placement的微调时考虑节点的近邻特性，尽量减小无心义的拷贝

尽量知足用户要求（User Requirement First）

用户要求分为两种，一种是显示指定，表现为在Node中设置的device信息；另外一种是隐式指定，表现为loc:@xxxx属性，即Colocation Group。Placer会根据用户这两方面的要求并结合实际状况作Placement信息补全和微调。文章开头的截图展现了某个Node的NodeDef信息，它代表类型为MatMul的Op被用户显示指定放到'/device:GPU:0'上，同时但愿放入名为global_step的Colocation Group中。NodeDef中的device属性和loc:@xxxx属性分别由下面两个python级别的API引入，它们都由用户来控制，有些被用在高层API内部封装中。

# device attributes
@tf_export("device")
def device(device_name_or_function):

# colocation attributes
@tf_export("colocate_with")
def colocate_with(op, ignore_existing=False):

尽量使用更快的计算设备（High Performance Device）

若是某个Node的device属性中不含device_type（即GPU或CPU），那么Placer必须决定使用何种Device。每种Device注册到TensorFlow中时都带有优先级，一般高优先级的Device具备更好的计算性能。当某个Op具备多种Device实现时，Placer将选取优先级最高的Device实现版本，经过设置device_type为全部实现版本中最高优先级的Device来实现这种选取。

保证程序可运行（Runable）

这是经过Soft Placement机制保证的。若是某个Node被显示指定精确放在某Device上，但系统中却没有该Device上的实现版本，那么为了保证程序可用，Soft Placement将发挥做用。它将忽略device type，在系统中按照Device优先级选取另外一个可用的实现版本从新改写Placement。举例而言，假设某Node的op是SparseToDense，device_type被指定为GPU，但目前SparseToDense在TensorFlow中只有CPU的实现，那么Soft Placement将改写该Node的device_type为CPU。

尽量考虑近邻特性（Near When Possible）

在Placer中使用如下三种启发式规则来实现这一原则。

a. 若某个Node是GeneratorNode（0-indegree，1-outdegree，且输出非reference type），将其与Consumer具备相同的Placement能够防止无心义的跨Device拷贝。这一步在算法中被称之为启发式规则A。

b. 若某个Node是MetaDataNode（直接在Tensor的元数据MetaData上操做，好比Reshape），将其与Producer具备相同的Placemen能够防止无心义的跨Device拷贝。这一步在算法中被称为启发式规则B。

c. 若某个Node的输入是Reference type或者是Reource type，那么尽可能将其与输入放在同一个Colocation Group中。算法中没有为这个步骤起名字，为了方便咱们称之为启发式规则C。

Placer决策算法整体流程

整体流程分为四个步骤，下图展现了宏观层面的流程图。其中最后两个步骤相对较为复杂，下一小节中将会细化其流程图。

Placer算法决策分步详解与关键代码对照

第一步——根据用户指定作Colocation Group

通常状况下，没有被用户指定Colocation Group信息的Node会被单独放入一个Group中做为惟一的成员，并以该Node的Name做为Group的名字，因此Graph中每一个Node都会有本身的Colocation Group。从逻辑上来讲，合并多个Group是很是简单的问题，可是这个场景中的Group不只是Node的集合，还包含若干属性，好比某个Group的possible device表示这个Group可用的全部Device集合。所以咱们须要一种数据结构和算法，帮助咱们在合并两个Group时很方便地生成新Group及相关属性（方便Union），而且可以根据某个Node快速查看所属Group的全部属性（快速Find），这就是Find-Union的优点所在。Find-Union算法原理将不在这里描述，这里只给出代码中Find-Union用到的基本数据结构——Member，它用来描述Group的基本信息。在阅读下段代码注释前，须要对Find-Union中的树形结构含义有基本的理解。

 1 // Represents a node in the disjoint node set forest, and the
 2   // accumulated constraints on the device used by that node.
 3   struct Member {
 4     Member() = default;
 5     // The id of the node that is the parent of this one, or its own
 6     // id if it is a root. parent <= 0 indicates that this member is invalid.
 7     int parent = -1;
 8 
 9     // A proxy for the depth of the tree that is used to prefer
10     // connecting smaller trees to larger trees when merging disjoint
11     // sets.
12     int rank = 0;
13 
14     // The intersection of all device types supported by this node,
15     // and those of all of its children, in priority order
16     // of the preferred device.
17     DeviceTypeVector supported_device_types;
18 
19     // The merged form of the device requested for this node, with
20     // those of all of its children.
21     DeviceNameUtils::ParsedName device_name;
22 
23     // If this node is a root, stores a list of Devices to which this node
24     // and all of its children have been assigned, or nullptr if this
25     // has not yet been computed.
26     std::vector<Device*> possible_devices;
27   };

下面的代码是处理这一步骤的核心代码。首先建立ColocationGraph对象，这是一个处理Colocation Group的工具类，里面使用了Find-Union算法对Group进行聚合。在调用InitiailizeMembers对Find-Union算法的基本数据结构进行初始化以后，就直接调用ColocationAllNodes根据用户指定的全部colocation信息进行聚合。

 1   ColocationGraph colocation_graph(
 2       graph_, devices_,
 3       options_ == nullptr || options_->config.allow_soft_placement(),
 4       default_device_);
 5 
 6   TF_RETURN_IF_ERROR(colocation_graph.InitializeMembers());
 7 
 8   // 1. First add all of the nodes. Note that steps (1) and (2)
 9   // requires two passes over the nodes because the graph (and hence
10   // the constraints) may not be acyclic.
11   TF_RETURN_IF_ERROR(colocation_graph.ColocateAllNodes());

第二步——启发式规则C的运用

这一步将对Colocation Group进行调整。在遍历Graph的每一个Node时，须要根据Node input来决定是否将该Node所在的Group与Source Node所在的Group合并。若是Node的input是ref_type或者DT_RESOURCE（关于DT_RESOURCE通常会在使用ResourceVariable时才会碰到。ResourceVariable与Variable相比具备不少新特性，这些特性是TF2.0中主推的内容。关于它的优点咱们不在这里展开，只对其Op的类型作一个说明。Variable在C++层面的Op类型是VariableV2，而ResourceVariable在C++层面的Op类型为VarHandleOp。后者产生的Tensor就是一种DT_RESOURCE），那么就尝试作合并。在合并以前须要作必要的可行性检查，适当地主动报错。好比在合并时除了要考虑这一对节点的链接之外，还须要考虑这个Node的其余输入是否属于ref_type或者DT_RESOURCE。这一部分的代码比较长，但相对比较简单，这里再也不展现。

第三步——启发式规则B的运用

从这一步开始，Placer才开始真正的为每一个Node分配Device，下面的流程图中展现了这一步骤。

1. 若是当前的Node的device属性中已经有值，那么Placer将再也不对其作重复的assign操做，直接跳过这个Node。

2. 若是当前Node是GeneratorNode，先将其放入一个名为second_pass的vector中。

3. 若是不是以上两种状况，那么该Node正是这一步骤须要处理的对象。先从该Node所在的Colocation Group中获取可用的Devices（获取会受到Soft Placement的影响）做为候选。若是该node是MetaData node，那么会尝试应用启发式规则B，不然，将分配候选集中优先级最高的Device。

下面的代码展现了对MetaDataNode的处理逻辑，这就是启发式规则B的代码。

 1     int assigned_device = -1;
 2 
 3     // Heuristic B: If the node only operates on metadata, not data,
 4     // then it is desirable to place that metadata node with its
 5     // input.
 6     if (IsMetadata(node)) {
 7       // Make sure that the input device type is in the list of supported
 8       // device types for this node.
 9       const Node* input = (*node->in_edges().begin())->src();
10       // TODO(vrv): if the input is empty, consider postponing this
11       // node's assignment to the second pass, so that we handle the
12       // case where a metadata node's input comes from a backedge
13       // of a loop.
14       if (CanAssignToDevice(input->assigned_device_name(), *devices)) {
15         assigned_device = input->assigned_device_name_index();
16       }
17     }
18 
19     // Provide the default, if necessary.
20     if (assigned_device == -1) {
21       assigned_device = graph_->InternDeviceName((*devices)[0]->name());
22     }
23 
24     AssignAndLog(assigned_device, node);

第四步——启发式规则A的运用

这一步将对second_pass数组中的全部的Node分配Device，下面的流程图中展现了这一步骤。

放在second_pass中的代码所有是GeneratorNode，因此只须要应用启发式规则A便可，和步骤3同样，启发式规则A的应用也是尝试性的，若是实在不能知足，会直接分配候选Device中优先级最高的Device，下面是启发式规则A的应用部分代码。

 1     int assigned_device = -1;
 2 
 3     // Heuristic A application.
 4     if (IsGeneratorNode(node)) {
 5       const Node* output = (*node->out_edges().begin())->dst();
 6       int output_device_name = output->assigned_device_name_index();
 7 
 8       const bool consumers_on_same_device = std::all_of(
 9           node->out_edges().begin(), node->out_edges().end(),
10           [output_device_name](const Edge* e) {
11             return e->dst()->assigned_device_name_index() == output_device_name;
12           });
13 
14       if (consumers_on_same_device &&
15           CanAssignToDevice(output->assigned_device_name(), *devices)) {
16         assigned_device = output_device_name;
17       }
18     }
19 
20     // Provide the default, if necessary.
21     if (assigned_device == -1) {
22       assigned_device = graph_->InternDeviceName((*devices)[0]->name());
23     }
24 
25     AssignAndLog(assigned_device, node);

至此，全部Node的Placement信息都已经分配并微调完毕。

总结

通过Placer处理的GraphDef保证了计算图在Placement层面已经不存在任何冲突，所以它被认为是解决Placement冲突的最后一道防线。在Placer以后，GraphDef将被送入GraphPartitioner模块中根据每一个Node的device作子图切分，并插入Send，Recv以及必要的ControlFlow节点。从上面的梳理中咱们也能够看出Placer模块的核心是应用多种启发式规则对Placement进行微调，但这些启发式规则还相对较为简单，并无彻底解决性能问题。若是在Placement方面去挖掘性能方面的优化空间，咱们立刻能够想到，在分布式模式下，粗糙的Placement方案会让做业性能变得很是差，由于它会引入计算以外的通讯开销。TensorFlow为了高度灵活性，将Placement策略的负担丢给了用户，这也是为何有些用户写出的TensorFlow分布式程序性能很是差的缘由之一。从TensorFlow框架的功能角度来讲，它应该可以解放用户的编写程序负担，让用户可以彻底专一在模型算法层面的研究中。可是自动搜索Placement最佳策略的难度很是大，由于它要考虑集群通讯的带宽，以及每一个Op的计算量，是一个与硬件和环境高度联系的复杂问题。不只如此，一般深度学习模型含有成千上万个Node，这使得方案的搜索空间巨大无比。对于这个问题，Google曾经提出过强化学习搜索最佳模型分片策略的方法，有兴趣地同窗能够参考这篇ICML论文： Device Placement Optimization with Reinforcement Learning。