Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy

时间 2019-11-30

标签 edge intelligence demand deep learning model inference device synergy 栏目 Intel 繁體版

原文原文链接

边缘智能：按需深度学习模型和设备边缘协同的共同推理

本文为SIGCOMM 2018 Workshop (Mobile Edge Communications, MECOMM)论文。node

笔者翻译了该论文。因为时间仓促，且笔者英文能力有限，错误之处在所不免；欢迎读者批评指正。算法

本文及翻译版本仅用于学习使用。若是有任何不当，请联系笔者删除。promise

本文做者包含3位，En Li, Zhi Zhou, and Xu Chen@School of Data and Computer Science, Sun Yat-sen University 服务器

ABSTRACT (摘要）

As the backbone technology of machine learning, deep neural networks (DNNs) have have quickly ascended to the spotlight. Running DNNs on resource-constrained mobile devices is, however, by no means trivial, since it incurs high performance and energy overhead. While offloading DNNs to the cloud for execution suﬀers unpredictable performance, due to the uncontrolled long wide-area network latency. To address these challenges, in this paper, we propose Edgent, a collaborative and on-demand DNN co-inference framework with device-edge synergy. Edgent pursues two design knobs: (1) DNN partitioning that adaptively partitions DNN computation between device and edge, in order to leverage hybrid computation resources in proximity for real-time DNN inference. (2) DNN right-sizing that accelerates DNN inference through early-exit at a proper intermediate DNN layer to further reduce the computation latency. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent’s eﬀectiveness in enabling on-demand low-latency edge intelligence.
网络

做为机器学习的骨干技术，深度神经网络（DNNs）已经迅速成为人们关注的焦点。然而，在资源受限的移动设备上运行DNN毫不是微不足道的，由于它会带来高性能和高能耗开销。因为不受控制的长广域网延迟，将DNN加载到云中以便执行会带来不可预测的性能。为了应对这些挑战，在本文中，咱们提出了Edgent，一种具备设备边缘协同做用的协做和按需DNN协同推理框架。 Edgent追求两个设计目标：（1）DNN划分，自适应地划分设备和边缘之间的DNN计算，以便利用邻近的混合计算资源进行实时DNN推理。（2）DNN正确调整大小，经过在适当的中间DNN层提早退出来加速DNN推理，以进一步减小计算延迟。基于Raspberry Pi的原型实现和普遍评估证实了Edgent在实现按需低延迟边缘智能方面的有效性。架构

1 INTRODUCTION & RELATED WORK （引言和相关工做）

As the backbone technology supporting modern intelligent mobile applications, Deep Neural Networks (DNNs) represent the most commonly adopted machine learning technique and have become increasingly popular. Due to DNNs’s ability to perform highly accurate and reliable inference tasks, they have witnessed successful applications in a broad spectrum of domains from computer vision [14] to speech recognition [12] and natural language processing [16]. However, as DNN-based applications typically require tremendous amount of computation, they cannot be well supported by today’s mobile devices with reasonable latency and energy consumption. app

做为支持现代智能移动应用的骨干技术，深度神经网络（DNN）表明了最经常使用的机器学习技术，而且愈来愈受欢迎。因为DNN可以执行高度准确和可靠的推理任务，他们见证了从计算机视觉[14]到语音识别[12]和天然语言处理[16]等普遍领域的成功应用。可是，因为基于DNN的应用程序一般须要大量的计算，所以当今的移动设备没法很好地支持它们（在合理的延迟和能耗约束下）。框架

In response to the excessive resource demand of DNNs, the traditional wisdom resorts to the powerful cloud datacenter for training and evaluating DNNs. Input data generated from mobile devices is sent to the cloud for processing, and then results are sent back to the mobile devices after the inference. However, with such a cloud-centric approach, large amounts of data (e.g., images and videos) are uploaded to the remote cloud via a long wide-area network data transmission, resulting in high end-to-end latency and energy consumption of the mobile devices. To alleviate the latency and energy bottlenecks of cloud-centric approach, a better solution is to exploiting the emerging edge computing paradigm. Specifcally, by pushing the cloud capabilities from the network core to the network edges (e.g., base stations and WiFi access points) in close proximity to devices, edge computing enables low-latency and energy-efficient DNN inference. less

为了应对DNN的过多资源需求，传统智慧采用强大的云数据中心来训练和评估DNN。从移动设备生成的输入数据被发送到云进行处理，而后在推断以后将结果发送回移动设备。然而，利用这种以云为中心的方法，大量数据（例如，图像和视频）经过长广域网数据传输上传到远程云，致使移动设备上大的端到端延迟和能量消耗。为了缓解以云为中心的方法的延迟和能量瓶颈，更好的解决方案是利用新兴的边缘计算范例。具体地，经过将云的能力从网络核心推送到紧邻设备的网络边缘（例如，基站和WiFi接入点），边缘计算实现低延迟和高效能的DNN推断。dom

While recognizing the benefts of edge-based DNN inference, our empirical study reveals that the performance of edge-based DNN inference is highly sensitive to the available bandwidth between the edge server and the mobile device. Specifcally, as the bandwidth drops from 1Mbps to 50Kbps, the latency of edge-based DNN inference climbs from 0.123s to 2.317s and becomes on par with the latency of local processing on the device. Then, considering the vulnerable and volatile network bandwidth in realistic environments (e.g., due to user mobility and bandwidth contention among various Apps), a natural question is that can we further improve the performance (i.e., latency) of edge-based DNN execution, especially for some mission-critical applications such as VR/AR games and robotics [13].

虽然咱们认识到基于边缘的DNN推理的好处，但咱们的实证研究代表，基于边缘的DNN推理的性能对边缘服务器和移动设备之间的可用带宽高度敏感。具体而言，随着带宽从1Mbps降至50Kbps，基于边缘的DNN推断的延迟从0.123s上升到2.317s，而且与设备上本地处理的延迟至关。而后，考虑到现实环境中易受攻击和易变的网络带宽（例如，因为用户移动性和各类应用之间的带宽争用），一个天然的问题是咱们可否进一步改善基于边缘的DNN执行的性能（即延迟），特别是对于一些关键任务应用，如VR/AR游戏和机器人[13]。

To answer the above question in the positive, in this paper we proposed Edgent, a deep learning model co-inference framework with device-edge synergy. Towards low-latency edge intelligence, Edgent pursues two design knobs. The frst is DNN partitioning, which adaptively partitions DNN computation between mobile devices and the edge server based on the available bandwidth, and thus to take advantage of the processing power of the edge server while reducing data transfer delay. However, worth noting is that the latency after DNN partition is still restrained by the rest part running on the device side. Therefore, Edgent further combines DNN partition with DNN right-sizing which accelerates DNN inference through early-exit at an intermediate DNN layer. Needless to say, early-exit naturally gives rise to the latency-accuracy tradeoﬀ (i.e., early-exit harms the accuracy of the inference). To address this challenge, Edgent jointly optimizes the DNN partitioning and right-sizing in an on-demand manner. That is, for mission-critical applications that typically have a predefned deadline, Edgent maximizes the accuracy without violating the deadline. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent’s eﬀectiveness in enabling on-demand low-latency edge intelligence.

为了回答上述问题，咱们在本文中提出了Edgent，一种具备设备边缘协同做用的深度学习模型协同推理框架。对于低延迟边缘智能（做为初始探索，本文咱们只考虑执行延迟问题。在将来工做中，咱们也将考虑能耗问题），Edgent追求两个设计目标。第一个是DNN分区，其基于可用带宽自适应地划分移动设备和边缘服务器之间的DNN计算，从而利用边缘服务器的处理能力，同时减小数据传输延迟。但值得注意的是，DNN分区后的延迟仍然受到设备端运行的其他部分的限制。所以，Edgent进一步将DNN分区与DNN正确大小调整相结合，经过在中间DNN层的早期退出来加速DNN推断。不用说，提早退出天然会产生延迟和准确度之间的均衡（即，提早退出会损害推断的准确性）。为了解决这一挑战，Edgent以按需方式协同优化DNN分区和正确大小调整。也就是说，对于一般具备预约截止时间的关键任务应用程序，Edgent在不违反截止时间的状况下最大化准确性。基于Raspberry Pi的原型实现和普遍评估证实了Edgent在实现按需低延迟边缘智能方面的有效性。

While the topic of edge intelligence has began to garner much attention recently, our study is diﬀerent from and complementary to existing pilot eﬀorts. On one hand, for fast and low power DNN inference at the mobile device side, various approaches as exemplifed by DNN compression and DNN architecture optimization has been proposed [3–5, 7, 9]. Diﬀerent from these works, we take a scale-out approach to unleash the benefts of collaborative edge intelligence between the edge and mobile devices, and thus to mitigate the performance and energy bottlenecks of the end devices. On the other hand, though the idea of DNN partition among cloud and end device is not new [6], realistic measurements show that the DNN partition is not enough to satisfy the stringent timeliness requirements of mission-critical applications. Therefore, we further apply the approach of DNN right-sizing to speed up DNN inference.

虽然边缘智能的话题在最近引发了不少关注，但咱们的研究与现有的工做不一样而且互为补充。一方面，对于移动设备侧的快速和低功率DNN推断，已经提出了DNN压缩和DNN架构优化为例的各类方法[3-5,7,9]。与这些工做不一样，咱们采用横向扩展方法释放边缘和移动设备之间协同边缘智能的好处，从而减轻终端设备的性能和能耗瓶颈。另外一方面，虽然云和终端设备之间DNN划分的想法并不新鲜[6]，但实际测量代表DNN划分不足以知足任务关键型应用的严格的实时性要求。所以，咱们进一步应用DNN正确大小调整的方法来加速DNN推理。

2 BACKGROUND & MOTIVATION （背景 & 动机）

In this section, we frst give a primer on DNN, then analyse the inefciency of edge- or device-based DNN execution, and finally illustrate the benefts of DNN partitioning and rightsizing with device-edge synergy towards low-latency edge intelligence.

在本节中，咱们首先介绍DNN，而后分析基于边缘或设备的DNN执行的效率，最后说明利用设备边缘协同做用的DNN划分和正确大小调整实现低延迟边缘智能的好处。

2.1 A Primer on DNN （DNN基础）

DNN represents the core machine learning technique for a broad spectrum of intelligent applications spanning computer vision, automatic speech recognition and natural language processing. As illustrated in Fig. 1, computer vision applications use DNNs to extract features from an input image and classify the image into one of the pre-defned categories. A typical DNN model is organized in a directed graph which includes a series of inner-connected layers, and within each layer comprising some number of nodes. Each node is a neuron that applies certain operations to its input and generates an output. The input layer of nodes is set by raw data while the output layer determines the category of the data. The process of passing forward from the input layer to the out layer is called model inference. For a typical DNN containing tens of layers and hundreds of nodes per layer, the number of parameters can easily reach the scale of millions. Thus, DNN inference is computational intensive. Note that in this paper we focus on DNN inference, since DNN training is generally delay tolerant and is typically conducted in an oﬀ-line manner using powerful cloud resources.

DNN表明了包括计算机视觉、自动语音识别和天然语言处理在内的普遍智能应用的核心机器学习技术。如图1所示，计算机视觉应用程序使用DNN从输入图像中提取特征并将图像分类为某一预约类别。典型的DNN模型被组织为有向图，该有向图包括一系列内部链接的层，而且在每一个层内包括一些节点。每一个节点都是一个神经元，它将某些操做应用于其输入并生成输出。输入层节点由原始数据设置，而输出层肯定数据的类别。从输入层到外层的前向传递过程称为模型推断。对于包含数十层和每层包含数百个节点的典型DNN，参数的数量能够轻松达到数百万的规模。所以，DNN推断是计算密集型的。注意，在本文中，咱们关注DNN推理，由于DNN训练一般是延迟容忍的，而且一般使用强大的云资源以一种离线的方式进行。

图1：一个用于计算机视觉的4层DNN

2.2 Inefficiency of Device- or Edge-based DNN Inference （基于设备或边缘的DNN推理不高效）

Currently, the status quo of mobile DNN inference is either direct execution on the mobile devices or offloading to the cloud/edge server for execution. Unfortunately, both approaches may suﬀer from poor performance (i.e., end-to-end latency), being hard to well satisfy real-time intelligent mobile applications (e.g., AR/VR mobile gaming and intelligent robots) [2]. As illustration, we take a Raspberry Pi tiny computer and a desktop PC to emulate the mobile device and edge server respectively, running the classical AlexNet [1] DNN model for image recognition over the Cifar-10 dataset [8]. Fig. 2 plots the breakdown of the end-to-end latency of diﬀerent approaches under varying bandwidth between the edge and mobile device. It clearly shows that it takes more than 2s to execute the model on the resource-limited Raspberry Pi. Moreover, the performance of edge-based execution approach is dominated by the input data transmission time (the edge server computation time keeps at ∼10ms) and thus highly sensitive to the available bandwidth. Specifcally, as the available bandwidth jumps from 1Mbps to 50Kbps, the end-to-end latency climbs from 0.123s to 2.317s. Considering the network bandwidth resource scarcity in practice (e.g., due to network resource contention among users and apps) and computing resource limitation on mobile devices, both of the device- and edge-based approaches are challenging to well support many emerging real-time intelligent mobile applications with stringent latency requirement.

目前，移动DNN推断的现状是要么在移动设备上直接执行，要么加载到云/边缘服务器执行。不幸的是，这两种方法均可能经历较差的性能（即端到端延迟），难以很好地知足实时智能移动应用（例如AR/VR移动游戏和智能机器人）[2]。例如，咱们采用Raspberry Pi小型计算机和台式PC分别模拟移动设备和边缘服务器，运行经典的AlexNet [1] DNN模型，经过Cifar-10数据集进行图像识别[8]。图2绘制了边缘和移动设备之间不一样带宽下不一样方法的端到端延迟的细分。它清楚地代表在资源有限的Raspberry Pi上执行模型须要2秒以上。此外，基于边缘的执行方法的性能由输入数据传输时间（边缘服务器计算时间保持在~10ms）决定，所以对可用带宽高度敏感。具体而言，随着可用带宽从1Mbps降至50Kbps，端到端延迟从0.123秒攀升至2.317秒。考虑到实际中网络带宽资源的稀缺性（例如，因为用户和应用之间的网络资源争用）以及移动设备上的计算资源限制，基于设备和边缘的方法都难以很好地支持许多新兴的具备严格延迟要求的实时智能移动应用程序。

图2：AlexNet运行时间。

2.3 Enabling Edge Intelligence with DNN Partitioning and Right-Sizing （使用DNN划分和正确大小调整使能边缘智能）

DNN Partitioning: For a better understanding of the performance bottleneck of DNN execution, we further break the runtime (on Raspberry Pi) and output data size of each layer in Fig. 3. Interestingly, we can see that the runtime and output data size of diﬀerent layers exhibit great heterogeneities, and layers with a long runtime do not necessarily have a large output data size. Then, an intuitive idea is DNN partitioning, i.e., partitioning the DNN into two parts and offloading the computational intensive one to the server at a low transmission overhead, and thus to reduce the end-to-end latency. For illustration, we choose the second local response normalization layer (i.e., lrn 2) in Fig. 3 as the partition point and offload the layers before the partition point to the edge server while running the rest layers on the device. By DNN partitioning between device and edge, we are able to collaborate hybrid computation resources in proximity for low-latency DNN inference.

DNN划分：为了更好地理解DNN执行的性能瓶颈，咱们进一步分解运行时（在Raspberry Pi上）并在图3中给出每层的数据大小。有趣的是，咱们能够看到运行时间和不一样层的输出数据大小表现出很大的异构性，具备长运行时间的层不必定具备大的输出数据大小。而后，直观的想法是DNN划分，即，将DNN分红两部分并以低传输开销将计算密集的一部分卸载到服务器，从而减小端到端等待时间。为了说明，咱们选择图3中的第二局部响应归一化层（即，lrn 2）做为划分点，并将划分点以前的层卸载到边缘服务器，同时在设备上运行其他层。经过DNN在设备和边缘之间进行划分，咱们可以为低延迟DNN推理协同混合计算资源。

图3：树莓Pi设备上AlexNet层的运行时间。

DNN Right-Sizing: While DNN partitioning greatly reduces the latency by bending the computing power of the edge server and mobile device, we should note that the optimal DNN partitioning is still constrained by the run time of layers running on the device. For further reduction of latency, the approach of DNN Right-Sizing can be combined with DNN partitioning. DNN right-sizing promises to accelerate model inference through an early-exit mechanism. That is, by training a DNN model with multiple exit points and each has a diﬀerent size, we can choose a DNN with a small size tailored to the application demand, meanwhile to alleviate the computing burden at the model division, and thus to reduce the total latency. Fig. 4 illustrates a branchy AlexNet with five exit points. Currently, the early-exit mechanism has been supported by the open source framework BranchyNet[15]. Intuitively, DNN right-sizing further reduces the amount of computation required by the DNN inference tasks.

DNN正确大小调整：虽然DNN划分经过下降边缘服务器和移动设备的计算能力大大减小了延迟，但咱们应该注意到最佳DNN划分仍然受到设备上运行的层的运行时间的限制。为了进一步减小延迟，DNN正确大小调整的方法能够与DNN划分相结合。 DNN正确的规模承诺经过早期退出机制加速模型推理。也就是说，经过训练具备多个出口点的DNN模型而且每一个具备不一样的尺寸，咱们能够选择适合应用需求的小尺寸DNN，同时减轻模型部门的计算负担，从而减小总延迟。图4示出了具备五个出口点分支的AlexNet。目前，早期退出机制获得了开源框架BranchyNet的支持[15]。直观地说，DNN正确大小调整进一步减小了DNN推理任务所需的计算量。

图4: DNN正确大小调整中早期退出机制的示例

Problem Description: Obviously, DNN right-sizing incurs the problem of latency-accuracy tradeoﬀ — while early exit reduces the computing time and the device side, it also deteriorates the accuracy of the DNN inference. Considering the fact that some applications (e.g., VR/AR game) have stringent deadline requirement while can tolerate moderate accuracy loss, we hence strike a nice balance between the latency and the accuracy in an on-demand manner. Particularly, given the predefned and stringent latency goal, we maximize the accuracy without violating the deadline requirement. More specifcally, the problem to be addressed in this paper can be summarized as: given a predefned latency requirement, how to jointly optimize the decisions of DNN partitioning and right-sizing, in order to maximize DNN inference accuracy.

问题描述：显然，DNN正确大小调整会产生延迟-准确性的权衡 - 虽然早期退出会缩短计算时间(设备)，但它也会下降DNN推理的准确性。考虑到某些应用程序（例如，VR/AR游戏）具备严格的期限要求而可以容忍适度的准确度损失的事实，所以咱们以按需方式在延迟和准确性之间取得了良好的平衡。特别是，考虑到预约义和严格的延迟目标，咱们在不违反期限要求的状况下最大化准确性。更具体地说，本文要解决的问题可概括为：给定预约的延迟要求，如何联合优化DNN划分和正确大小调整的决策，以最大化DNN推理的准确性。

3 FRAMEWORK （框架）

We now outline the initial design of Edgent, a framework that automatically and intelligently selects the best partition point and exit point of a DNN model to maximize the accuracy while satisfying the requirement on the execution latency.

咱们如今概述Edgent的初始设计，这是一个自动智能地选择DNN模型的最佳划分点和退出点的框架，以在知足执行延迟要求的同时最大化准确性。

3.1 System Overview （系统概述）

Fig. 5 shows the overview of Edgent. Edgent consists of three stages: ofine training stage, online optimization stage and co-inference stage.

图5显示了Edgent的概述。 Edgent由三个阶段组成：训练阶段、在线优化阶段和共同推理阶段。

图5： Edgent概览

At offline training stage, Edgent performs two initializations: (1) profling the mobile device and the edge server to generate regression-based performance prediction models (Sec. 3.2) for diﬀerent types DNN layer (e.g., Convolution, Pooling, etc.). (2) Using Branchynet to train DNN models with various exit points, and thus to enable early-exit. Note that the performance profling is infrastructure-dependent, while the DNN training is application-dependent. Thus, given the sets of infrastructures (i.e., mobile devices and edge servers) and applications, the two initializations only need to be done once in an offline manner.

在离线训练阶段，Edgent执行两次初始化：（1）对移动设备和边缘服务器进行分析，以针对不一样类型的DNN层（例如，卷积，池化等）生成基于回归的性能预测模型（第3.2节）。（2）使用Branchynet训练具备不一样退出点的DNN模型，从而实现提早退出。请注意，性能分析依赖于基础结构，而DNN训练则取决于应用程序。所以，给定基础设施（即，移动设备和边缘服务器）和应用程序，两个初始化仅须要以离线方式进行一次。

At online optimization stage, the DNN optimizer selects the best partition point and early-exit point of DNNs to maximize the accuracy while providing performance guarantee on the end-to-end latency, based on the input: (1) the profiled layer latency prediction models and Branchynet trained DNN models with various sizes. (2) the observed available bandwidth between the mobile device and edge server. (3) The pre-defined latency requirement. The optimization algorithm is detailed in Sec. 3.3.

在线优化阶段，DNN优化器选择DNN的最佳分区点和早期退出点，以最大化准确性，同时根据输入提供端到端延迟的性能保证：（1）分析的层延迟预测模型和各类尺寸的Branchynet训练的DNN模型。（2）移动设备和边缘服务器之间观察到的可用带宽。（3）预先肯定的延迟要求。优化算法详见第3.3节。

At co-inference stage, according to the partition and early-exit plan, the edge server will execute the layer before the partition point and the rest will run on the mobile device.

在协同推理阶段，根据划分和提早退出规划，边缘服务器将执行该层划分点以前的层，且其他部分将在移动设备上运行。

3.2 Layer Latency Prediction （层延迟预测）

When estimating the runtime of a DNN, Edgent models the per-layer latency rather than modeling at the granularity of a whole DNN. This greatly reduces the profling overhead since there are very limited classes of layers. By experiments, we observe that the latency of diﬀerent layers is determined by various independent variables (e.g., input data size, output data size) which are summarized in Table 1. Note that we also observe that the DNN model loading time also has an obvious impact on the overall runtime. Therefore, we further take the DNN model size as a input parameter to predict the model loading time. Based on the above inputs of each layer, we establish a regression model to predict the latency of each layer based on its profles. The fnal regression models of some typical layers are shown in Table 2 (size is in bytes and latency is in ms).

在估计DNN的运行时间时，Edgent会对每层的延迟进行建模，而不是以整个DNN为粒度进行建模。这极大地减小了分析开销，由于存在很是有限的层类别。经过实验，咱们观察到不一样层的延迟由各类独立变量（例如，输入数据大小，输出数据大小）决定，如表1所示。注意，咱们还观察到DNN模型的加载时间对总运行时间也有明显的影响。所以，咱们进一步将DNN模型的大小做为输入参数来预测模型的加载时间。基于每层的上述输入，咱们创建回归模型以基于分析预测每一个层的延迟。表2中显示了一些典型层的最终回归模型（大小以字节为单位，延迟以毫秒为单位）。

3.3 Joint Optimization on DNN Partition and DNN Right-Sizing （DNN划分和DNN正确大小调整的协同优化）

At online optimization stage, the DNN optimizer receives the latency requirement from the mobile device, and then searches for the optimal exit point and partition point of the trained branchynet model. The whole process is given in Algorithm 1. For a branchy model with M exit points, we denote that the i-th exit point has Ni layers. Here a layer index i correspond to a more accurate inference model of larger size. We use the above-mentioned regression models to predict EDj the runtime of the j-th layer when it runs on device and ESj the runtime of the j-th layer when it runs on server. Dp is the output of the p-th layer. Under a specific bandwidth B, with the input data Input, then we calcuate Ai,p the whole runtime when the p-th is the partition point of the selected model of i-th exit point. When p = 1, the model will only run on the device then ESp = 0, Dp-1/B = 0, Input/B = 0, and when p = Ni, the model will only run on the server then EDp = 0, Dp-1/B = 0. In this way, we can find out the best partition point having the smallest latency for the model of i-th exit point. Since the model partition does not affect the inference accuracy, we can then sequentially try the DNN inference models with diﬀerent exit points(i.e., with diﬀerent accuracy), and find the one having the largest size and meanwhile satisfying the latency requirement. Note that since regression models for layer latency prediction are trained beforehand, Algorithm 1 mainly involves linear search operations and can be done very fast (no more than 1ms in our experiments) .

在线优化阶段，DNN优化器从移动设备接收延迟要求，而后搜索训练的branchynet模型的最佳出口点和分区点。整个过程在算法1中给出。对于具备M个出口点的分支模型，咱们表示第i个出口点具备N_i层。这里，更大的层索引i对应于更准确的推断模型。咱们使用上面提到的回归模型来预测第j层在设备上运行时的运行时间ED_j，ES_j是它在服务器上运行时运行时间。 D_p是第p层的输出。在特定带宽B下，使用输入数据Input，咱们计算总运行时间A_i,p=_，其中，p是所选模型的划分点，i表示个出口点。当p = 1时，模型将仅在设备上运行，那么ES_p = 0，D_p-1 / B = 0，Input/ B = 0；当p = N_i时，模型将仅在服务器上运行，那么ED_p = 0 ，D_p-1 / B = 0。经过这种方式，咱们能够找到具备最小延迟的最佳分区点，用于第i个出口点的模型。因为模型划分不影响推理精度，咱们能够依次尝试具备不一样出口点的DNN推理模型（即，具备不一样的精度），并找到具备最大尺寸并同时知足延迟要求的模型。请注意，因为预先训练了层延迟预测的回归模型，所以算法1主要涉及线性搜索操做，而且能够很是快速地完成（在咱们的实验中不超过1ms）。

4 EVALUATION （评估）

We now present our preliminary implementation and evaluation results.

如今，咱们给出初步实现和评估结果。

4.1 Prototype （原型）

We have implemented a simple prototype of Edgent to verify the feasibility and efcacy of our idea. To this end, we take a desktop PC to emulate the edge server, which is equipped with a quad-core Intel processor at 3.4 GHz with 8 GB of RAM, and runs the Ubuntu system. We further use Raspberry Pi 3 tiny computer to act as a mobile device. The Raspberry Pi 3 has a quad-core ARM processor at 1.2 GHz with 1 GB of RAM. The available bandwidth between the edge server and the mobile device is controlled by the WonderShaper [10] tool. As for the deep learning framework, we choose Chainer [11] that can well support branchy DNN structures.

咱们已经实现了Edgent的简单原型系统来验证咱们的想法的可行性和有效性。为此，咱们采用台式机PC模拟边缘服务器，该服务器配备了3.4 GHz的四核英特尔处理器和8 GB的RAM，并运行Ubuntu系统。咱们进一步使用Raspberry Pi 3微型计算机充当移动设备。 Raspberry Pi 3具备1.2 GHz的四核ARM处理器和1 GB的RAM。边缘服务器和移动设备之间的可用带宽由WonderShaper [10]工具控制。至于深度学习框架，咱们选择可以很好地支持分支DNN结构的Chainer [11]。

For the branchynet model, based on the standard AlexNet model, we train a branchy AlexNet for image recognition over the large-scale Cifar-10 dataset [8]. The branchy AlexNet has five exit points as showed in Fig. 4 (Sec. 2), each exit point corresponds to a sub-model of the branchy AlexNet. Note that in Fig. 4, we only draw the convolution layers and the fully-connected layers but ignore other layers for ease of illustration. For the five sub-models, the number of layers they each have is 12, 16, 19, 20 and 22, respectively.

对于branchynet模型，基于标准的AlexNet模型，咱们训练了一个分支的AlexNet，用于大规模Cifar-10数据集的图像识别[8]。如图4（第2节）所示，分支的AlexNet具备5个出口点，每一个出口点对应于分支AlexNet的子模型。请注意，在图4中，咱们仅绘制卷积层和彻底链接的层，为了便于说明而忽略其余层。对于这5个子模型，它们各自具备的层数分别为12,16,19,20和22。

For the regression-based latency prediction models for each layer, the independent variables are shown in the Table. 1, and the obtained regression models are shown in Table 2.

对于每层的基于回归预测模型，独立变量显示在表1中，得到的回归模型如表2所示。

4.2 Results （结果）

We deploy the branchynet model on the edge server and the mobile device to evaluate the performance of Edgent. Specifcally, since both the pre-defned latency requirement and the available bandwidth play vital roles in Edgent’s optimization logic, we evaluate the performance of Edgent under various latency requirements and available bandwidth.

咱们在边缘服务器和移动设备上部署了branchynet模型，以评估Edgent的性能。具体而言，因为预先肯定的延迟要求和可用带宽在Edgent的优化逻辑中起着相当重要的做用，所以咱们在各类延迟要求和可用带宽下评估Edgent的性能。

We first investigate the eﬀect of the bandwidth by fixing the latency requirement at 1000ms and varying the bandwidth 50kbps to 1.5Mbps. Fig. 6(a) shows the best partition point and exit point under diﬀerent bandwidth. While the best partition points may ﬂuctuate, we can see that the best exit point gets higher as the bandwidth increases. Meaning that the higher bandwidth leads to higher accuracy. Fig. 6(b) shows that as the bandwidth increases, the model runtime first drops substantially and then ascends suddenly. However, this is reasonable since the accuracy gets better while the latency is still within the latency requirement when increase the bandwidth from 1.2Mbps to 2Mbps. It also shows that our proposed regression-based latency approach can well estimate the actual DNN model runtime latency. We further fix the bandwidth at 500kbps and vary the latency from 100ms to 1000ms. Fig. 6(c) shows the best partition point and exit point under diﬀerent latency requirements. As expected, the best exit point gets higher as the latency requirement increases, meaning that a larger latency goal gives more room for accuracy improvement.

咱们首先经过使用固定的1000ms的延迟要求并将带宽由50kbps到1.5Mbps变化来研究带宽的影响。图6（a）显示了不一样带宽下的最佳划分点和退出点。虽然最佳划分点可能会波动，但咱们能够看到随着带宽的增长，最佳退出点会变得更高，这意味着更高的带宽会带来更高的准确性。图6（b）显示随着带宽的增长，模型运行时间首先大幅降低，而后忽然上升。可是，这是合理的，由于当将带宽从1.2Mbps增长到2Mbps时，准确性变得更好，而延迟仍然在延迟要求内。它还代表咱们提出的基于回归的延迟方法能够很好地估计实际的DNN模型运行时延迟。咱们进一步将带宽固定在500kbps，并将延迟从100ms改成1000ms。图6（c）显示了不一样延迟要求下的最佳划分点和出口点。正如预期的那样，随着延迟需求的增长，最佳出口点会愈来愈高，这意味着更大的延迟目标能够为更高的准确性提供更多空间。

图6：不一样带宽和延迟要求下的结果。

In Fig. 7, under diﬀerent latency requirements, it shows the model accuracy of diﬀerent inference methods. The accuracy is negative if the inference can not satisfy the latency requirement. The network bandwidth is set to 400kbps. Seen in the Fig. 7, at a very low latency requirement (100ms), all four methods can’t satisfy the requirement. As the latency requirement increases, inference by Edgent works earlier than the other methods that at the 200ms to 300ms requirements, by using a small model with a moderate inference accuracy to meet the requirements. The accuracy of the model selected by Edgent gets higher as the latency requirement relaxes.

在图7中，根据不一样的延迟要求，显示了不一样推理方法的模型精度。若是推断不能知足延迟要求，则准确度为负。网络带宽设置为400kbps。如图7所示，在很是低的延迟要求（100ms）下，全部四种方法都不能知足要求。随着延迟要求的增长，Edgent的推理比其余200ms至300ms要求的方法更早结束，经过使用具备中等推理精度的小模型来知足要求。随着延迟要求的放松，Edgent选择的模型的准确性会提升。

图7：不一样延迟需求下的精度比较。