Birkhoff-von Neumann Crossbar 光交换网络的调度方案

时间 2021-02-17

标签 css node 算法缓存网络架构 app dom electron async 栏目系统网络繁體版

原文原文链接

Birkhoff-von Neumann Crossbar 光交换网络的调度方案

Birkhoff-von Neumann Crossbar 光交换网络的调度方案

Network Architecture and Optical Switching Technology

这里首先对 optical data center network 的目标优点与挑战进行一些概述 [1]，然后介绍这一技术中占据较重要地位的光交换机 (我有时也很随意地直接叫它光开关) 技术css

Demands of Optical Data Center Networks

光数据中心网络 (optical data center networks, optical DCNs) 是被用以符合高容量，低延迟，细粒度的 DCN 需求的一种架构，它有望弥补下面列举的几条传统电子方案的缺陷 [1.1]node

In hierarchical tree-like topology based on electronic switch, the scaling issues of ball grid array (BGA) package causing bandwidth limited by application-specific integrated circuits (ASICs) input/output (I/O) bandwidth

Electronic switch achieves higher bandwidth by stacking ASICs in multitier structure, which leads to larger latency and higher cost;

Electronic circuit-based solution is facing a difficulties when large interconnectivity and fast reconfiguration are required;

Storing and transmitting properties of electronic switch have significant power dissipation, and the optical/electronic/optical (O/E/O) conversions and format-dependent interfaces will perform lower power/cost efficiency.

这就成为了故事的开端，能够发现这些困境大致是电子器件的一些内禀性质 (fermion... ... 好麻烦... ...) 所致使的，有一些甚至是不可避免的算法

另外一方面，从下面的表格就能够看出光子学解决方案的“显著”优点，它几乎覆盖了人们对于将来 DCN 的全部幻想，这样，各类光交换机与一些调度算法或许能在这里拥有本身的一片天空... ...缓存

Requirements	Optical Solution
Capacity <bandwidth for more inter-rack & intra-cluster communication>	<wavelength division multiplex (WDM) to provide power per unit bandwidth>
Latency <switch latency from buffering & routing algorithm & arbitration>	<optical packet/burst switching (OBS, OPS) based on fast optical circuit switching (OCS)>
Interconnectivity <support concurrent flows & bandwidth ultilization & service delivery>	<(high-capacity) optical interconnection technology>
Scalability <scale nodes number for future capacity & cost>	<photonic integration may provide scaling without loss of capacity and power/cost>
Flexibility <manage servise delivery & adapt to changing needs>	<software-defined networking (SDN) is used to facilitate highly flexible provisioning>
Power/Cost Efficiency <reduce cost & scale bandwidth>	<data traffic in optical domain, no O/E/O power consumption>

顺带一提，里面涉及到的 OPS/OBS，光分组交换/光突发交换技术能借由突发流的统计复用实现亚波长的带宽粒度，其中，OPS 网络在相同信道中同时/前后传输包头 (package header) 与数据包 (package, $\it10^2\rm\ ns\sim\it10^2\rm\ ms$)，不须要提早预留链接，也有很好的 bandwidth efficiency，适合于 DC 较小数据量的传输，OBS 则借助于大量的突发 (burst, $\it10\sim10^3\rm\ ms$) 与先于其在 (物理上的) 分离信道中发送的块控制数据头 (block control header, BCH) 实现资源预留，保留 OPS 的 flexibility 以及 efficiency 的同时下降了对光交换机和光缓存的要求，由于通常光交换机的 reconfiguration time 须要远小于包/突发的时长网络

做为 optical DCNs 架构的实例，能够参考 Opsquare: A flat DCN architecture based on flow-controlled optical packet switches [4] 以及 Torus-topology Data Center Network based on Optical Packet/Agile Circuit Switching with Intelligent Flow Management [5] 这两种基于 OPS 的架构，它们在当前 ($\it2020\pm$) 各类架构中拥有相对较优的性能，都有将 electronic buffer (EB) 设置在网络链接部分，从而起到在冲突中重传数据包或等待调度指令的做用。两个网络分别是基于 OPS 与 OCS 的 Torus (同时使用了 Deflection, ODL 和 EB) 以及基于 OPS 的 OPSquare，顺带一提，它们在 scalability, flexibility 以及 power/cost 方面也都有不错的表现架构

Optical Switches

由上所述，optical DCNs 的性能很大程度上受制于光交换机的性能。下面是一些比较经常使用，且有可能 (有些已经具有了商业化的能力甚至已经进入市场) 被用于集成的光交换机，它们的 switching time 基本都控制在 $\it1\rm\ ns\sim\it10\rm\ \mu s$ 的范围内 [1.1]app

Switch Type	Switching Time	Scale	Loss
Mach-Zehnder interferometer (MZI, thermo-optic)	$\sim\it10\rm\ \mu s$	$\it32\times32$	High
micro-ring resonator (MRR, thermo-optic)	$\sim\it10\rm\ \mu s$	$\it8\times8$	Fair
PLZT MZI	$\sim\it10\rm\ ns$	$\it8\times8$	High
InP MZI	$\sim\it2.5\rm\ ns$	$\it8\times8$	High
LiNbO₃ MZI	$\sim\it1\rm\ ns$	$\it32\times32$	High
MZI + electro-absorption modulator (EAM)	$<\it10\rm\ ns$	$\it8\times8$	Fair
tunable wavelength converters (TWCs) + arrayed waveguide grating router (AWGR)	$\sim\it10s\rm\ ns$	$\it10s\times10s$	Fair
tunable lasers (TLs) + AWGR	$\sim\it10s\rm\ ns$	$\it100s\times100s$	Low
semiconductor optical amplifier (SOA) multistage	$<\it10\rm\ ns$	$\it16\times16$	Fair

稍慢一些的光交换机如基于微机电系统 (microelectromechanical system, MEMS) 的光交换机没有列在上面，这些光交换机大致能够 (依据其构造原理) 分为以下几类 [1.2]dom

Microelectromechanical system optical switch (基于控制微镜的运动控制，大致分为利用反射机制改变光的传播方向以及利用干涉衍射机制改变光的相位两类，前者易于进行批量制造)electron

特征数值 $T=50\rm\ ms,IL=3.5\ dB$；async

Beam-steering-based optical switch (经过 2D piezoelectric actuators 将束流引导至对应的 $\varphi,\theta$ 位置，并由 integrated position sensor 探测光束指导反馈对束流的优化，具有软件定义网络 (software-defined networking, SDN) 的特性)

Liquid crystal optical switch (利用电压控制液晶 (liquid crystal, LC) 中分子取向实现极化或折射率的变动，有些设计涉及到了双折射及 Liquid crystal on silicon (LCoS) 的技术，受限于 polarization-dependent loss (PDL) 以及较高的串扰)

Electro-optic optical switch (基于电光效应 (electro-optic effect, EO effect，外加电场改变材料折射率，具有 $\rm ns$ 级的超快响应)，结合相位调制和干涉机制设计，采用 LiNbO₃，PLZT 等材料，较高的插损限制了 scalability)

Semiconductor optical amplifier-based switch (这个原理相似于一种开关器件，有外加电流的状况下，入射光就成功引起受激辐射光放大，没有的话也就没有了出射光，以后经过 broadcast-and-select 架构将其排列为光交换机的阵列，可是批量生产时这种设计须要 $N^2$ 个 SOA)

Thermo-optic optical switch (电流控制电极温度，进而影响材料的折射率，这样就能制造一些耦合器件做为开关使用，可是须要一些热光系数高且热导率低的材料，其优点在于不依赖于波长与偏振)

下面表格对这几类光交换机的性能给出了更加详细的评价 [1.2]

Characteristics	MEMS based	Beam-Steering	Liquid crystal	Electro-optic	SOA based	Thermo-optic
Switching speed	$\it10\sim20\rm\ ms$	$\it25\rm\ ms$	$\it100\rm\ ms$	$\mathbf{\sim\ ns}$	$\mathbf{\sim\ ns}$	$\it5\sim10\rm\ ms$
Insertion loss	Medium	Low	High	High	Low	Low
Power efficiency	Medium	Low	High	High	Low	Low
Scalability	Large ($\it320\times320$)	Large ($\it384\times384$)	Medium ($\it1\times20$)	Medium ($\it32\times32$)	Medium ($\it16\times16$)	Small ($\it8\times8$)
Reliability	Low	Low	High	High	Medium	High
Implementation cost	Medium	Medium	Low	High	High	Low

Challenges of Optical DCN Solution

因为当前缺乏有效的光记忆器件，DC 中重要的冲突解决 (contention resolution) 功能不易实现，做为解决方案，fixed fiber delay line (FDL) 能够在时域暂存冲突包/突发，wavelength conversion 方法可以在频域将冲突转发到其它信道，deflection routing 提供在空域将冲突转发到其它端口的方案，可是它们都不可避免地引入了路由控制 (routing control) 以及包同步的问题。另外一方面，将来 DCN 对于光交换机的高速切换以及一些相应的系统控制机制提出了更高的要求，当前的 optical DCNs 在 scaling 方面也受制于光交换机的端口数目，要想在这方面既避免使用层次结构，又保留不错的缓冲控制机制，或许得期待 photonic integration 的后续发展了

这些问题的解决方案更多来自于硬件方面如光器件的性能，但也并不是不能考虑从一些调度算法 (scheduling algorithm) 的软件方面入手寻找思路，关于这一方面，能够直接转到后面的 scheduling algorithm，那里会有一些介绍性的内容，若是想看一个比较完整的算法，请移步 crossbar switches as example (●'◡'●)

Switching Architectures and Scheduling Algorithms

下面将重点放在了关于高速交换技术的一些基础知识，固然也涉及到了一些相关的调度算法

Basic Knowledge on Packet Switching

设计一个交换网络 (基于转发表 (forwarding table)，注意交换和路由是不一样的，前者是对于特定网络节点而言，然后者关注大规模网络中节点间的信息传递)，须要考虑“输出冲突 (output contention)”、“出入端口链接的创建速度 (最好可以具有 self-routing 的能力)”的问题。在这里，首先咱们须要一些定义来描述冲突，关于这点，有内部阻塞以及输出冲突两种类型，分别称做 blocking 和 output contention，与此同时，一些无阻塞的交换网络也是存在的，如 crossbar switch

在以上基础上，定义描述交换网络性能的两个物理量 throughput $X\propto\mathbb{E}[C_{\rm input}]/\mathbb{E}[\tau_{\rm output}]$ 以及 speedup $S=T_{\rm input}/T_{\rm forwarding}$，更高的 speedup 说明网络具有更强的冲突化解能力，对应着更高的 throughput

交换架构能够分为以下几类：

Packet switches
├── Time division switching (TDS)
│	├── Shared medium
│	└── Shared memory
└── Space division switching (SDS)
	├── Single path
	│	├── Crossbar
	│	├── Fully interconnected
	│	└── Banyan
	└── Multipath
		├── Augmented banyan
		├── Clos
		├── Multiplane
		└── Recirculation

其中，TDS 须要内部的通讯带宽大于全部输入端口汇集的带宽，但可以轻松支持多播/广播，shared memory type 相比于 shared medium type 有更高的内存利用率但须要双倍内存速度；SDS 中，crossbar (更多信息见后面的 Crossbar Switches as Example) 是 fully interconnected type 的拓扑学等价，两者具有差很少的特性 (如 $N^2$ 复杂度和非阻塞结构)；而 banyan-based 交换机 (包含 Delta-Omega-Banyan 这三类同构的拓扑) 中，每条信息需经 $\log N$ 的节点以到达终点；Multipath 相比于 single path 具有了更强的容错能力，其中的 multiplane type，recirculation type 还能增大 throughput，但 recirculation 有时会致使信息的顺序错乱

数据缓冲策略有以下几类：

Buffering strategy
├── Shared-memory queuing
├── Output queuing (OQ)
├── Input queuing
├── Virtual output queuing (VOQ)
└── Combined input and output queuing (CIOQ)
.(Crosspoint queuing? place incoming cells in crosspoing buffer (XB))

其中，shared-memory queuing structure 经过输出端对于内存的共享实现了内存的最大利用率，但面临尺寸的限制；而 OQ 在以上缺点的基础上，又失去了内存利用率的优点 ε=ε=ε=┏(゜ロ゜;)┛，在服务质量 (Quality of Service, QoS) 的控制上有了一些提升；相比于这两个， input queuing 不受限于尺寸，但带来了线段阻塞 (head-of-line blocking, HOL blocking，尽管输出端口空闲，数据却被堵在了输入队列) 的问题，这使得它的 throughput 稍低一些 ($\it58.6\%$)；为解决 HOL blocking，VOQ 中的每一个输入缓冲被分红 $N$ 份，这个多是当下的研究热点 ($\it 2020\pm$)，其调度算法比前几个要复杂；CIOQ 则是另外一种用于解决 HOL 的策略，它经过交换结构的 speedup 换取更高的 throughput

大规模 (多模块) 交换结构分为以下两种类型

Large-scale switch
├── Single-stage switch
│	└── Parallel packet switch (PPS)
└── Multistage switch
	├── Partially connected multistage network
	│	└── Banyan-based switch
	└── Fully connected multistage network
		└── Clos-network switch

其中，PPS 有最大的 throughput，其性能随 speedup 的大小而变化，可是 single-stage 不免没法支持庞大的端口数目；banyan-based network 和 Clos network 做为 multistage 则避免了上述缺陷，banyan network ($N\times N$ network, $\log_bN$ stages, $b\times b$ switching elements) 也支持 self-routing，但会带来内部阻塞，交换机数目的增大将致使其性能骤减；Clos network ($N\times N$ network, $\it3$ stages, $n$ input lines, $\it m\geq2n-1$ middle-stage switch modules, maximum number of crosspoints $N_x=O(N^{\it3/2})$) 尽管内部全连通，但须要相应 (复杂) 的调整才可以使网络具有非阻塞性质，但即使如此也没法避免内部阻塞的发生，解决方案通常只能是粗暴地增长内部链接数目和带宽了... ...

Performance Analysis

了解了一些基本交换结构后，下面利用一点基础的排队论知识给出 input-buffered switches, output-buffered switches 以及 completely shared-buffer switches 这三种交换策略的性能分析 [2]

注意，这里采用的交通源模型是上图所示的几何分布流，因而负载 $\rho$ 就是每一个时间间隔内信源处于激活状态的占比

\[\begin{equation} \begin{split} \rho=\frac{\beta=E[X]}{\beta=E[X]+\alpha=E[Y]}=\frac{\sum_{i=1}^\infty i{\rm Pr}\{X=i\}}{\sum_{i=1}^\infty i{\rm Pr}\{X=i\}+\sum_{j=1}^\infty j{\rm Pr}\{Y=j\}}=\frac{q}{q+p-pq}, \end{split} \end{equation} \]

Input-Buffered Switches

在有 HOL blocking 存在的情形下，老是会有信息传输延迟，于是这种方案没法得到 $\it100\%$ throughput，倘若将信源定为无记忆型 (Markov)，每一个端口就至关于一个 $\it M/D/1$ 排队模型，稳态下应有每段时间内传输的信息单元数 $\bar{F}$ 与目标为 $i$ 端口的，每段时间后仍滞留的信息单元数 $\bar{B}^i$ 知足以下关系

\[\begin{equation} \begin{split} \bar{F}=\sum_{i=1}^N\bar{B}^i, \end{split} \end{equation} \]

又注意到 $\bar{B}^i=\rho_0^2/[2(1-\rho_0)]$，$\rho_0$ 是归一化 throughput，能够由上式获得 $\bar{B}^i$ 的另外一种表示方式

\[\begin{equation} \begin{split} \bar{B}^i=\frac1N\sum_{i=1}^N\bar{B}^i=1-\frac{\bar{F}}{N}=1-\rho_0=\frac{\rho_0^2}{2(1-\rho_0)}, \end{split} \end{equation} \]

由此可得 $\rho_0=(2-\sqrt2)\approx0.586$，这也就是 input-buffered switches 在这种信源下的渐进吞吐量了，换句话说，一旦输入速率大于这个值，系统就会饱和

下面借由基于上述几何分布的 $\it Geom/G/1$ 模型考察这种调度方式的等待时间，可是须要首先接受“到达每一个端口的信息单元遵循独同分布以及拥有相同几率 $p$”以及“每一个单位有相同概率选取任意输出端口为目标”这两个假设，再借助 $\it Geom/G/1$ 模型的结论，就能够知道平均等待时间为 (用到了服务时间 $S$，对于每段时间而言是一个随机变量)

\[\begin{equation} \begin{split} \overline{W}=\frac{p\overline{S(S-1)}}{2(1-p\bar{S})}+\bar{S}-1, \end{split} \end{equation} \]

显然，它的渐进形式仍是蛮可怜的，随着 $N\to\infty$，它也以一个很快的速率趋近于 $\infty$ [2]

Output-Buffered Switches

这里 $N\to\infty$ 将致使信息单元到达输出端口的数目遵循指数分布 $p^ke^{-p}/k!$，其中的 $p$ 即为输入负载，由此能够马上获得所谓信息单元丢失率 (若服务率/吞吐量没法知足输入负载的需求，则在输出端口的信息累积会致使溢出 output buffer 而丢失) 为 $1-\rho_0/p$，举个例子，对于 $\it80\%$ 的负载，须要 buffer 大小大于 $\it28$ 才能保证几乎无丢失

咱们能够运用 little's law 来求得平均等待时间，也即，平均到达数 $\bar{Q}$ 等于平均等待时间 $\overline{W}$ 乘以服务率 $\rho_0$，对于 $N\to\infty,b\to\infty$，有

\[\begin{equation} \begin{split} \overline{W}=\frac{p}{2(1-p)}, \end{split} \end{equation} \]

Completely Shared-Buffered Switches

这里的全部信息单元都储存在同一个 buffer 中，也就是说输入与输出共享 buffer，它的性能显然会稍高于前两种方案，可是彷佛只能靠一些数值模拟的方式获知它的具体的表现，与 output-buffered switches 不一样的是，其丢失率随 buffer 大小变化呈现非线性 (buffer 越大，丢失率以越大的斜率迅速减少)，而前者是呈 (近似) 线性的恒定速率减少的，也就是说这里可使用较小的 buffer 达到 $\it100\%$ throughput

Scheduling Algorithms

刚刚已经屡次提到调度算法的由来了，简单归纳一下，也就是说

首先因为日益增长的，对于高线速，高端口数的大规模交换机的需求，人们再也不知足于 output buffer 的内存高利用率，由于其不具有这种 scalability ；

可是一旦开始使用 input buffer 或 combined-input-output-buffered 策略，就不可避免地须要一些增大 throughput 的方法 (提升交换机速度或增大路径数以解决 HOL blocking + 仲裁机制以解决冲突问题)；

应运而生的就是这些调度算法和路由技术了... ...

一个有不错性能的调度算法应具有以下特性

Efficiency，有高 throughput 同时也能有较低的 delay，对每一个时段选取最多边的二部图匹配；

Fairness，可以避免单个 VOQ 的过分调用；

Stability，每一个 VOQ 的占有数目都为有限值；

Implementation complexity，易于在硬件以较高速度实现

固然这里有不少算法都能符合要求，因而这里就直截了当地介绍一些适用于 input-buffered switches 的调度算法，也就是下面这些，它们不是所有，并且我也不会把每一个都解释清楚，总之先对这几种算法有个大体印象叭...

二部图最大权匹配 (maximum weight matching，MWM, $O(N^3)$)，是我以为最直接的一种方法，也即

\[\begin{equation} \begin{split} \max_{w_{ij}}\sum_{e_{(i,j)\in M}}w_{ij}, \end{split} \end{equation} \]
包含如 LQF (longest queue first, $w_{ij}(t)\leftarrow L_{ij}(t)$)，OCF (oldest cell first, $w_{ij}(t)\leftarrow \tau_{ij_{\rm HOL}}(t)$) 和 LPF (longest port first, $w_{ij}(t)\leftarrow\delta[L_{ij}(t)]\times[R_{i}(t)+C_{j}(t)]$) 的一系列算法

由它产生的有近似最大权 (1-APRX，牺牲了一些稳定性以换取低一些的复杂度) 和最大尺寸匹配 (maximum size matching, MSM, $O(N^{2.5})$)，是当边权为 $\it1$ 时的 MWM

极大匹配 (maximal matching)，对于 speedup $N$，老是能够由最多 $N$ 次迭代求出匹配结果，复杂度通常比较大

在其中已经有了如 parallel iterative matching (PIM), iterative round-robin matching (iRRM), iterative round-robin with SLIP (iSLIP), FIRM, dual round-robin matching (DRRM), pipelined maximal matching 和 exhaustive service dual round-robin matching (EDRRM) 等算法

随机匹配 (randomized matching)，尽管不必定一直选到最好的，但却有很高的稳定性和线性/对数复杂度

包含如 De-randomized, SERENA, HE-$i$SLIP 等算法

帧匹配 (frame-based matching)，为适应光交换的环境，采用多个 cells 合成单帧进行调度以提升效率，性能高开销低

包含如 reducing the reconfiguration frequency (exact covering, minimum switching, double), fixed-size frame-based matching (frame-based MWM, frame-based maximal weight matching, frame-based multiple iteration weight matching) 和 asynchronous variable-size frame-based matching (EDRRM) 等匹配算法

Speedup 下的稳定匹配，可以保证已匹配的端口具备足够高的优先级

包含如 most-urgent-cell-first algorithm (MUCFA), critical cell first (CCF), last-in-highest-priority (LIHP) 和 lowest-output-occupancy-cell-first (LOOFA) 等匹配算法

上面涉及到的符号都是网络科学中的惯常了，应该没啥理解误差了，在知足必定条件的情形下，上述的各类调度算法均可以达到 $\it100\%$ 的 throughput

Crossbar Switches as Example

下面首先介绍一下 input buffered crossbar switches [3] 的基本结构，而后以一种适用于其的调度算法 (这种算法基于双随机矩阵的 Birkhoff-von Neumann 定理，且不须要 framing 以及 internal speedup) 做为例子来熟悉这类交换结构的功能

如上图所示 (是一个 $\it4\times4$ 的 crossbar 的结构示意图，以及它的基本冲突形式，这种结构显然很简单，且易于对其进行模块化与控制)，所谓 crossbar 就是由 $N\times N$ 个全交叉的链接组成的一个阵列 (i.e.，共 $\it2N$ 个 bars，一种 single path 的 SDS)，它们之间的链接在每个时间均可以被一个置换矩阵 (permutation matrix) $P_k$ 描述，从 $i$ 端口到 $j$ 端口的 requested rate 为 $R=(r_{ij})$ 的矩阵元，若保证网络中没有超额订阅 (overbooking)，则须使非负矩阵 $R$ 为双随机的 (doubly stochastic) $$ \begin{equation} \begin{split} \left\{ \begin{split} \sum_{i=1}^Nr_{ij}\leq1,\ \ \forall j,\\ \sum_{j=1}^Nr_{ij}\leq1,\ \ \forall i, \end{split} \right. \end{split} \end{equation} $$ 能够得知，只要可以获得一组正数 (做为权值) $\phi_k$ (固然，知足 $\sum_{k=1}^K\phi_k=1$ 且 $\it K\leq N^2-2N+2$) 与 $P_k$ 的组合，使得如下分解不等式成立，便可直接将每个 $P_k$ 设置为与其对应的 $\phi_k$ 成正比的形式 $$ \begin{equation} \begin{split} R\leq\sum_{k=1}^K\phi_kP_k, \end{split} \end{equation} $$

这种双随机矩阵的分解恰好对应着所谓 Birkhoff-von Neumann 算法，其正规表述见 付録┍1┙，分解步骤能够伪代码的形式表示以下，注意，这里涉及到共两个步骤，首先是寻找更大元素的双随机矩阵 (von Neumann)，然后对这个双随机矩阵进行分解，得到对应的正权与置换矩阵乘积之和的形式 (Birkhoff)，它们的复杂度都在下面标示出来了

$\mathbf{Algorithm}$:$\mathbf{\ Birkhoff}$-$\mathbf{von\ Neumann}$
$\mathbf{Input}$: the double stochastic matrix ${\color{#ff6347}{\boldsymbol{R}}}=[r_{ij}]$
$\mathbf{Output}$: the set/vector of positive numbers and permutation matrices $\{{\color{#c71585}{\boldsymbol{\Phi}}}=[{\color{#c71585}{\boldsymbol{\phi}}}_k],{\color{#c71585}{\boldsymbol{\mathscr{P}}}}=[{\color{#c71585}{\boldsymbol{P}}}_k]\}$

// find doubly stochastic matrix $\tilde{{\color{#ff8c00}{\boldsymbol{R}}}}=[\tilde{r}_{ij}],{\text{s.t.}}\forall i,j,\tilde{r}_{ij}\geq{r}_{ij}$, with total complexity of $O(N^3)$
$\tilde{{\color{#ff8c00}{\boldsymbol{R}}}}\leftarrow{\color{#ff6347}{\boldsymbol{R}}}$;
// with complexity of $O(2N-1)\sim O(N)$
$\color{#3cb371}{\mathbf{REPEAT}}$
| // with complexity of $O(N^2)$
| $[i,j]\leftarrow{\color{#696969}{\mathbf{find}}}(\tilde{{\color{#ff8c00}{\boldsymbol{R}}}},{\color{#696969}{\mathbf{bind}}}(\sum_nr_{in}<1,\sum_mr_{mj}<1))$;
| $\epsilon\leftarrow1-\max[\sum_nr_{in},\sum_mr_{mj}]$;
| $r_{ij}\leftarrow\epsilon+r_{ij}$;
| ${\color{#696969}{\mathbf{update}}}(\tilde{{\color{#ff8c00}{\boldsymbol{R}}}})$;
$\color{#3cb371}{\mathbf{UNTIL}}$ $\sum_{n,m}r_{nm}=N$;

// find set of positive numbers ${\color{#c71585}{\boldsymbol{\phi}}}_k>0$ with permutation matrices ${\color{#c71585}{\boldsymbol{P}}}_k,{\rm\ st.\ }\tilde{{\color{#ff8c00}{\boldsymbol{R}}}}=\sum_k{\color{#c71585}{\boldsymbol{\phi}}}_k{\color{#c71585}{\boldsymbol{P}}}_k$, with total complexity of $O(N^{4.5})$, which further means $\sum_{k=1}^K{\color{#c71585}{\boldsymbol{\phi}}}_k=1$, doubly stochastic matrices are within the convex hull of permutation matrices
$[i_1,i_2,\cdots,i_N]\leftarrow{\color{#696969}{\mathbf{find}}}(\tilde{{\color{#ff8c00}{\boldsymbol{R}}}},{\color{#696969}{\mathbf{bind}}}(\prod_{k=1}^N\tilde{r}_{ki_k}>0))$;
$\phi'\leftarrow\min_{1\leq k\leq N}[\tilde{r}_{ki_k}]$;
// the permutation matrix of $[i_1,i_2,\cdots,i_N]$
$s_{perm}\leftarrow{\color{#696969}{\mathbf{to\_perm}}}([i_1,i_2,\cdots,i_N])$;
$R'\leftarrow\tilde{{\color{#ff8c00}{\boldsymbol{R}}}}-\phi's_{perm}$;
// with complexity of $O(N^2-2N+2)\sim O(N^2)$
$\color{#2e8b57}{\mathbf{FOR}}$ ${\color{#696969}{\mathbf{true}}}$
| $\color{#6b8e23}{\mathbf{IF}}$ $\phi'<1$
| | // with complexity improved to $O(N^{2.5})$
| | $[i_1,i_2,\cdots,i_N]\leftarrow{\color{#696969}{\mathbf{find}}}(R',{\color{#696969}{\mathbf{bind}}}(\prod_{k=1}^N\tilde{r}_{ki_k}>0))$;
| | $R'\leftarrow R'/(1-\phi')$;
| | $\phi'\leftarrow\min_{1\leq k\leq N}[\tilde{r}_{ki_k}]$;
| | $\mathbf{s}_{perm}\leftarrow{\color{#696969}{\mathbf{to\_perm}}}([i_1,i_2,\cdots,i_N])$;
| | $R'\leftarrow R'-\phi's_{perm}$;
| $\color{#6b8e23}{\mathbf{ELSE}}$
| | $\color{#6b8e23}{\mathbf{BREAK}}$;
| $\color{#6b8e23}{\mathbf{END}}$
| $[{\color{#c71585}{\boldsymbol{\Phi}}},{\color{#c71585}{\boldsymbol{\mathscr{P}}}}].{\color{#696969}{\mathbf{emplaced\_back}}}[\phi',R']$;
$\color{#2e8b57}{\mathbf{END}}$

里面的 ${\color{#2e8b57}{\mathbf{FOR}}}$-${\color{#6b8e23}{\mathbf{IF}}}$ loop 也能够直接表示为一个带条件的 ${\color{#2e8b57}{\mathbf{FOR}}}$ loop，只是我本身以为这样写会稍微清楚一些。这样子获得了分解后，就能够进一步在每一轮仲裁中，都以每个分解后的置换矩阵对应的正数的倒数做为此交换的虚完成时间 $F_k^\ell=1/\phi_k$，做为一种方案，这里将全部 $F_k^\ell,k=1,\cdots,K$ 从小到大排序，每次选最小的那个做为这轮的链接方式，若这一轮选了 $F_k^\ell$，那么下一轮的 $F_k^{\ell+1}=F_k^\ell+1/\phi_k$ 就行了，这个复杂度显然能够由 $K\leq N^2-2N+2$ 看出，是 $O(\log N)$ (这种方法算是 packetized generalized processor sharing, PGPS 算法的简化版本)

为了获得一种既考虑 efficiency 又顾及 fairness 的调度方案，上面用到的速率矩阵 $R$ 须要仔细地进行测量，这显然是一个优化问题 (resource sharing problem)，首先记 guaranteed-rate services 为 $R^g=(r_{ij}^g)$，再如下面的 online measurement 方法迭代求得第 $n$ 次测量周期下的 demand rate $r_{ij}^e(n)$

\[\begin{equation} \begin{split} &r_{ij}^e(n+2)=(1-\alpha(n))r_{ij}^e(n+1)+\alpha(n)\bigg[\frac{A_{ij}(nT)-A_{ij}((n-1)T)}T\bigg],\\ &r_{ij}^e(0)=r_{ij}^e(1)=\frac1N, \end{split} \end{equation} \]

里面的 $\alpha(n)$ 算是可调整占比的常数，若是网络流变化很迅速的话，适当调高 $\alpha(n)$ 有利于调度算法快速适应变化，$T$ 为测量周期，$A_{ij}$ 为累计到达量。能够看出，咱们所须要的最优解必须介于 demand rate $r_{ij}^e$ 与 guaranteed-rate $r_{ij}^g$ 之间，把它看成一种约束，这就转化为一个对于效益函数 $U(\cdot)$ 的优化问题

\[\begin{equation} \begin{split} &\max\ \ \ \ \ \ \ &U(r_{ij},i,j=1,\cdots,N)\sim\sum_{i=1}^Nr_{ij}\sim\text{fairness},\\ &\text{s.t.}& \left\{ \begin{split} &r_{ij}^g\leq r_{ij}\leq r_{ij}^e,\ \ &i,j=1,2,\cdots,N,\\ &\sum_{i=1}^Nr_{ij}\leq1,&j=1,2,\cdots,N,\\ &\sum_{j=1}^Nr_{ij}\leq1,&i=1,2,\cdots,N, \end{split} \right. \end{split} \end{equation} \]

效益函数的不一样取值就表明了 efficiency 与 fairness，前者对应一个最大流问题，然后者能够用水漫法 (water filling) 进行求解，这就完成了整个调度算法的流程，据仿真结果显示 [3] 它可以达到 $\it100\%$ 的 throughput，只是平均队列长度要稍微高于 output buffered switch 的模拟值

这种 crossbar 结构也能够经过多层网络的链接而得到 scalability，基本思路就是将每个 crossbar 模块看做一个单元，用网络拓扑将它们链接起来，例如能够用 Banyan 拓扑将 $\it4\times4$ 的 Birkhoff-von Neumann crossbar switches链接成一个二层网络，然而这种结构具备三条关于 no overbooking 的约束 (前面只有两个)，其速率受到限制，为此能够再加一层网络，这至关于使用十二个 $\it4\times4$ 的置换矩阵来表示一个 $\it16\times16$ 的置换矩阵，在这里的调度算法就须要先求解优化问题获得 allocated rate matrix $R$，只不过这个 $R$ 应该是一个 $\it16\times16$ 的矩阵，对它做分解获得 $\it16\times16$ 的 $K$ 个置换矩阵们，再以 Slepian-Duguid 算法将它们分解为 $\it12K$ 个 $\it4\times4$ 的置换矩阵，将它们反过来合成 $\it1\sim12$ crossbar 对应的 $R$ 阵，最后用 Birkhoff-von Neumann 进行分解求得调度权值进行调度便可

[0.0] 面号：[OPTSxNaN]

Too basic to have NN... ...

[1] Optical switching in next generation data centers[M]. Springer, 2017.

[1.1] Calabretta N, Miao W. Optical switching in data centers: Architectures based on optical packet/burst switching[M]//Optical Switching in Next Generation Data Centers. Springer, Cham, 2018: 45-69.

[1.2] Huang Q. Commercial Optical Switches[M]//Optical Switching in Next Generation Data Centers. Springer, Cham, 2018: 203-219.

[2] Chao H J, Liu B. High performance switches and routers[M]. John Wiley & Sons, 2007.

[3] Chang C S, Chen W J, Huang H Y. Birkhoff-von Neumann input buffered crossbar switches[C]//Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No. 00CH37064). IEEE, 2000, 3: 1614-1623.

[4] Yan F, Miao W, Raz O, et al. Opsquare: A flat DCN architecture based on flow-controlled optical packet switches[J]. IEEE/OSA Journal of Optical Communications and Networking, 2017, 9(4): 291-303.

[5] Kitayama K I, Huang Y C, Yoshida Y, et al. Torus-topology data center network based on optical packet/agile circuit switching with intelligent flow management[J]. Journal of Lightwave Technology, 2015, 33(5): 1063-1071.

[6] https://math.stackexchange.com/questions/47175/does-birkhoff-von-neumann-imply-any-of-the-fundamental-theorems-in-combinatori

付録┍1┙

这里稍微严格一点写一下先前提到的 Birkhoff-von Neumann 定理，首先是文字表述，也即双随机矩阵都包含于置换矩阵的凸包 (convex hull)，反之亦然，那么如今记一个双随机矩阵 $R=[r_{ij}]$，则定理表述为

\[\begin{equation} \begin{split} R\in\bigcup_{K=1}^{n!}\bigg\{\sum_{k=1}^K\phi_kP_k:\sum_{k=1}^K\phi_k=1,\alpha_k\in\R_+,P_k\in S_n\bigg\}\in{\rm co}(S_n), \end{split} \end{equation} \]

不失广泛性，能够令 $M\leq n^2-n+1$，其证实过程用到 Frobenius‐König 定理 [6] ，思路以下

对于 $\boxplus(R)=|\{(i,j):r_{ij}\neq0\}|$ 进行概括，$\boxplus(R)=n$ 对应着 $R$ 置换矩阵的情形，此时若假设以上命题对 $\boxplus(\cdot)=N$ 成立，考虑 $\boxplus(R)=N>n$，下列条件知足

\[\begin{equation} \begin{split} \forall1\leq i\leq n,\mathcal{I}\in\bigg(\begin{matrix}[n]\\r\end{matrix}\bigg),\mathcal{J}\in\bigg(\begin{matrix}[n]\\s\end{matrix}\bigg),{\ \text{s.t.}\ }r+s=n+1\to R[\mathcal{I},\mathcal{J}]\neq\mathbf{0}_{r\times s}, \end{split} \end{equation} \]

则由 Frobenius‐König 定理，总能取置换 $\tau\in S_n{\ \text{s.t.}\ }R[i,\tau(i)]\neq0$，这里记 $\phi=\min_{1\leq i\leq n}R[i,\tau(i)]\in(0,1)$，则有分解 $R=\phi\tau+(1-\phi)R'$，其中的 $R'=(1-\phi)^{-1}(R-\alpha\tau)$ 仍为双随机矩阵，又有 $\boxplus(R')<\boxplus(R)=N$，由概括假设得证 $\blacksquare$

Switch Type	Switching Time	Scale	Loss
Mach-Zehnder interferometer (MZI, thermo-optic)	\(\sim\it10\rm\ \mu s\)	\(\it32\times32\)	High
micro-ring resonator (MRR, thermo-optic)	\(\sim\it10\rm\ \mu s\)	\(\it8\times8\)	Fair
PLZT MZI	\(\sim\it10\rm\ ns\)	\(\it8\times8\)	High
InP MZI	\(\sim\it2.5\rm\ ns\)	\(\it8\times8\)	High
LiNbO₃ MZI	\(\sim\it1\rm\ ns\)	\(\it32\times32\)	High
MZI + electro-absorption modulator (EAM)	\(<\it10\rm\ ns\)	\(\it8\times8\)	Fair
tunable wavelength converters (TWCs) + arrayed waveguide grating router (AWGR)	\(\sim\it10s\rm\ ns\)	\(\it10s\times10s\)	Fair
tunable lasers (TLs) + AWGR	\(\sim\it10s\rm\ ns\)	\(\it100s\times100s\)	Low
semiconductor optical amplifier (SOA) multistage	\(<\it10\rm\ ns\)	\(\it16\times16\)	Fair

Characteristics	MEMS based	Beam-Steering	Liquid crystal	Electro-optic	SOA based	Thermo-optic
Switching speed	\(\it10\sim20\rm\ ms\)	\(\it25\rm\ ms\)	\(\it100\rm\ ms\)	\(\mathbf{\sim\ ns}\)	\(\mathbf{\sim\ ns}\)	\(\it5\sim10\rm\ ms\)
Insertion loss	Medium	Low	High	High	Low	Low
Power efficiency	Medium	Low	High	High	Low	Low
Scalability	Large (\(\it320\times320\))	Large (\(\it384\times384\))	Medium (\(\it1\times20\))	Medium (\(\it32\times32\))	Medium (\(\it16\times16\))	Small (\(\it8\times8\))
Reliability	Low	Low	High	High	Medium	High
Implementation cost	Medium	Medium	Low	High	High	Low