High-Performance Server Architecture

时间 2019-12-20

标签 high performance server architecture 繁體版

原文原文链接

http://www.cppblog.com/isware/archive/2011/07/19/151390.htmlhtml

http://pl.atyp.us/content/tech/servers.htmlnode

引言

本文将与你分享我多年来在服务器开发方面的一些经验。对于这里所说的服务器，更精确的定义应该是每秒处理大量离散消息或者请求的服务程序，网络服务器更符合这种状况，但并不是全部的网络程序都是严格意义上的服务器。使用“高性能请求处理程序”是一个很糟糕的标题，为了叙述起来简单，下面将简称为“服务器”。python

本文不会涉及到多任务应用程序，在单个程序里同时处理多个任务如今已经很常见。好比你的浏览器可能就在作一些并行处理，可是这类并行程序设计没有多大挑战性。真正的挑战出如今服务器的架构设计对性能产生制约时，如何经过改善架构来提高系统性能。对于在拥有上G内存和G赫兹CPU上运行的浏览器来讲，经过DSL进行多个并发下载任务不会有如此的挑战性。这里，应用的焦点不在于经过吸管小口吮吸，而是如何经过水龙头大口畅饮，这里麻烦是如何解决在硬件性能的制约.(做者的意思应该是怎么经过网络硬件的改善来增大流量)react

一些人可能会对个人某些观点和建议发出置疑，或者自认为有更好的方法, 这是没法避免的。在本文中我不想扮演上帝的角色；这里所谈论的是我本身的一些经验，这些经验对我来讲, 不只在提升服务器性能上有效，并且在下降调试困难度和增长系统的可扩展性上也有做用。可是对某些人的系统可能会有所不一样。若是有其它更适合于你的方法，那实在是很不错. 可是值得注意的是，对本文中所提出的每一条建议的其它一些可替代方案，我通过实验得出的结论都是悲观的。你本身的小聪明在这些实验中或许有更好的表现，可是若是所以怂恿我在这里建议读者这么作，可能会引发无辜读者的反感。你并不想惹怒读者，对吧？ios

本文的其他部分将主要说明影响服务器性能的四大杀手:nginx

1) 数据拷贝（Data Copies）git

2) 环境切换（Context Switches）github

3) 内存分配（Memory allocation）web

4) 锁竞争（Lock contention）算法

在文章结尾部分还会提出其它一些比较重要的因素，可是上面的四点是主要因素。若是服务器在处理大部分请求时可以作到没有数据拷贝，没有环境切换，没有内存分配，没有锁竞争，那么我敢保证你的服务器的性能必定很出色。

数据拷贝（Data Copies）

本节会有点短，由于大多数人在数据拷贝上吸收过教训。几乎每一个人都知道产生数据拷贝是不对的，这点是显而易见的，在你的职业生涯中, 你很早就会见识过它；并且遇到过这个问题，由于10年前就有人开始说这个词。对我来讲确实如此。现今，几乎每一个大学课程和几乎全部how-to文档中都提到了它。甚至在某些商业宣传册中，"零拷贝" 都是个流行用语。

尽管数据拷贝的坏处显而易见，可是仍是会有人忽视它。由于产生数据拷贝的代码经常隐藏很深且带有假装，你知道你所调用的库或驱动的代码会进行数据拷贝吗？答案每每超出想象。猜猜“程序I/O”在计算机上到底指什么？哈希函数是假装的数据拷贝的例子，它有带拷贝的内存访问消耗和更多的计算。曾经指出哈希算法是一种有效的“拷贝+”彷佛可以被避免，但据我所知，有一些很是聪明的人说过要作到这一点是至关困难的。若是想真正去除数据拷贝，无论是由于影响了服务器性能，仍是想在黑客大会上展现"零复制”技术，你必须本身跟踪可能发生数据拷贝的全部地方，而不是轻信宣传。

有一种能够避免数据拷贝的方法是使用buffer的描述符（或者buffer chains的描述符）来取代直接使用buffer指针，每一个buffer描述符应该由如下元素组成：

l 一个指向buffer的指针和整个buffer的长度

l 一个指向buffer中真实数据的指针和真实数据的长度，或者长度的偏移

l 以双向链表的形式提供指向其它buffer的指针

l 一个引用计数

如今，代码能够简单的在相应的描述符上增长引用计数来代替内存中数据的拷贝。这种作法在某些条件下表现的至关好，包括在典型的网络协议栈的操做上，但有些状况下这作法也使人很头大。通常来讲，在buffer chains的开头和结尾增长buffer很容易，对整个buffer增长引用计数，以及对buffer chains的即刻释放也很容易。在chains的中间增长buffer，一块一块的释放buffer，或者对部分buffer增长引用技术则比较困难。而分割，组合chains会让人立马崩溃。

我不建议在任何状况下都使用这种技术，由于当你想在链上搜索你想要的一个块时，就不得不遍历一遍描述符链，这甚至比数据拷贝更糟糕。最适用这种技术地方是在程序中大的数据块上，这些大数据块应该按照上面所说的那样独立的分配描述符，以免发生拷贝，也能避免影响服务器其它部分的工做.(大数据块拷贝很消耗CPU,会影响其它并发线程的运行)。

关于数据拷贝最后要指出的是：在避免数据拷贝时不要走极端。我看到过太多的代码为了不数据拷贝，最后结果反而比拷贝数据更糟糕，好比产生环境切换或者一个大的I/O请求被分解了。数据拷贝是昂贵的，可是在避免它时，是收益递减的(意思是作过头了，效果反而很差)。为了除去最后少许的数据拷贝而改变代码，继而让代码复杂度翻番，不如把时间花在其它方面。

上下文切换（Context Switches）

相对于数据拷贝影响的明显，很是多的人会忽视了上下文切换对性能的影响。在个人经验里，比起数据拷贝，上下文切换是让高负载应用完全完蛋的真正杀手。系统更多的时间都花费在线程切换上，而不是花在真正作有用工做的线程上。使人惊奇的是，（和数据拷贝相比）在同一个水平上，致使上下文切换缘由老是更常见。引发环境切换的第一个缘由每每是活跃线程数比CPU个数多。随着活跃线程数相对于CPU个数的增长，上下文切换的次数也在增长，若是你够幸运，这种增加是线性的，但更常见是指数增加。这个简单的事实解释了为何每一个链接一个线程的多线程设计的可伸缩性更差。对于一个可伸缩性的系统来讲，限制活跃线程数少于或等于CPU个数是更有实际意义的方案。曾经这种方案的一个变种是只使用一个活跃线程，虽然这种方案避免了环境争用，同时也避免了锁，但它不能有效利用多CPU在增长总吞吐量上的价值，所以除非程序无CPU限制（non-CPU-bound），(一般是网络I/O限制 network-I/O-bound)，应该继续使用更实际的方案。

一个有适量线程的程序首先要考虑的事情是规划出如何建立一个线程去管理多链接。这一般意味着前置一个select/poll, 异步I/O，信号或者完成端口，然后台使用一个事件驱动的程序框架。关于哪一种前置API是最好的有不少争论。 Dan Kegel的C10K在这个领域是一篇不错的论文。我的认为，select/poll和信号一般是一种丑陋的方案，所以我更倾向于使用AIO或者完成端口，可是实际上它并不会好太多。也许除了select()，它们都还不错。因此不要花太多精力去探索前置系统最外层内部到底发生了什么。

对于最简单的多线程事件驱动服务器的概念模型, 其内部有一个请求缓存队列，客户端请求被一个或者多个监听线程获取后放到队列里，而后一个或者多个工做线程从队列里面取出请求并处理。从概念上来讲，这是一个很好的模型，有不少用这种方式来实现他们的代码。这会产生什么问题吗？引发环境切换的第二个缘由是把对请求的处理从一个线程转移到另外一个线程。有些人甚至把对请求的回应又切换回最初的线程去作，这真是雪上加霜，由于每个请求至少引发了2次环境切换。把一个请求从监听线程转换到成工做线程，又转换回监听线程的过程当中，使用一种“平滑”的方法来避免环境切换是很是重要的。此时，是否把链接请求分配到多个线程，或者让全部线程依次做为监听线程来服务每一个链接请求，反而不重要了。

即便在未来,也不可能有办法知道在服务器中同一时刻会有多少激活线程.毕竟，每时每刻均可能有请求从任意链接发送过来，一些进行特殊任务的“后台”线程也会在任意时刻被唤醒。那么若是你不知道当前有多少线程是激活的，又怎么可以限制激活线程的数量呢？根据个人经验，最简单同时也是最有效的方法之一是：用一个老式的带计数的信号量，每个线程执行的时候就先持有信号量。若是信号量已经到了最大值，那些处于监听模式的线程被唤醒的时候可能会有一次额外的环境切换,(监听线程被唤醒是由于有链接请求到来, 此时监听线程持有信号量时发现信号量已满,因此即刻休眠), 接着它就会被阻塞在这个信号量上，一旦全部监听模式的线程都这样阻塞住了，那么它们就不会再竞争资源了，直到其中一个线程释放信号量，这样环境切换对系统的影响就能够忽略不计。更主要的是，这种方法使大部分时间处于休眠状态的线程避免在激活线程数中占用一个位置，这种方式比其它的替代方案更优雅。

一旦处理请求的过程被分红两个阶段(监听和工做)，那么更进一步，这些处理过程在未来被分红更多的阶段(更多的线程)就是很天然的事了。最简单的状况是一个完整的请求先完成第一步,而后是第二步(好比回应)。然而实际会更复杂：一个阶段可能产生出两个不一样执行路径，也可能只是简单的生成一个应答(例如返回一个缓存的值)。由此每一个阶段都须要知道下一步该如何作，根据阶段分发函数的返回值有三种可能的作法：

l 请求须要被传递到另一个阶段（返回一个描述符或者指针）

l 请求已经完成（返回ok）

l 请求被阻塞(返回"请求阻塞")。这和前面的状况同样，阻塞到直到别的线程释放资源

应该注意到在这种模式下，对阶段的排队是在一个线程内完成的，而不是经由两个线程中完成。这样避免不断把请求放在下一阶段的队列里，紧接着又从该队列取出这个请求来执行。这种经由不少活动队列和锁的阶段很不必。

这种把一个复杂的任务分解成多个较小的互相协做的部分的方式，看起来很熟悉，这是由于这种作法确实很老了。个人方法，源于CAR在1978年发明的"通讯序列化进程"(Communicating Sequential Processes CSP)，它的基础能够上溯到1963时的Per Brinch Hansen and Matthew Conway--在我出生以前！然而，当Hoare创造出CSP这个术语的时候，“进程”是从抽象的数学角度而言的，并且，这个CSP术语中的进程和操做系统中同名的那个进程并无关系。依我看来，这种在操做系统提供的单个线程以内，实现相似多线程同样协同并发工做的CSP的方法，在可扩展性方面让不少人头疼。

一个实际的例子是，Matt Welsh的SEDA，这个例子代表分段执行的(stage-execution)思想朝着一个比较合理的方向发展。SEDA是一个很好的“server Aarchitecture done right”的例子，值得把它的特性评论一下：

1. SEDA的批处理倾向于强调一个阶段处理多个请求，而个人方式倾向于强调一个请求分红多个阶段处理。

2. 在我看来SEDA的一个重大缺陷是给每一个阶段申请一个独立的在加载响应阶段中线程“后台”重分配的线程池。结果，缘由1和缘由2引发的环境切换仍然不少。

3. 在纯技术的研究项目中，在Java中使用SEDA是有用的，然而在实际应用场合，我以为这种方法不多被选择。

内存分配（Memory Allocator）

申请和释放内存是应用程序中最多见的操做, 所以发明了许多聪明的技巧使得内存的申请效率更高。然而再聪明的方法也不能弥补这种事实:在不少场合中，通常的内存分配方法很是没有效率。因此为了减小向系统申请内存，我有三个建议。

建议一是使用预分配。咱们都知道因为使用静态分配而对程序的功能加上人为限制是一种糟糕的设计。可是仍是有许多其它很不错的预分配方案。一般认为，经过系统一次性分配内存要比分开几回分配要好，即便这样作在程序中浪费了某些内存。若是可以肯定在程序中会有几项内存使用，在程序启动时预分配就是一个合理的选择。即便不能肯定，在开始时为请求句柄预分配可能须要的全部内存也比在每次须要一点的时候才分配要好。经过系统一次性连续分配多项内存还能极大减小错误处理代码。在内存比较紧张时，预分配可能不是一个好的选择，可是除非面对最极端的系统环境，不然预分配都是一个稳赚不赔的选择。

建议二是使用一个内存释放分配的lookaside list(监视列表或者后备列表)。基本的概念是把最近释放的对象放到链表里而不是真的释放它，当不久再次须要该对象时，直接从链表上取下来用，不用经过系统来分配。使用lookaside list的一个额外好处是能够避免复杂对象的初始化和清理.

一般，让lookaside list不受限制的增加，即便在程序空闲时也不释放占用的对象是个糟糕的想法。在避免引入复杂的锁或竞争状况下，不按期的“清扫"非活跃对象是颇有必要的。一个比较稳当的办法是,让lookaside list由两个能够独立锁定的链表组成：一个"新链"和一个"旧链".使用时优先从"新"链分配，而后最后才依靠"旧"链。对象老是被释放的"新"链上。清除线程则按以下规则运行：

1. 锁住两个链

2. 保存旧链的头结点

3. 把前一个新链挂到旧链的前头

4. 解锁

5. 在空闲时经过第二步保存的头结点开始释放旧链的全部对象。

使用了这种方式的系统中，对象只有在真的没用时才会释放，释放至少延时一个清除间隔期(指清除线程的运行间隔)，但同常不会超过两个间隔期。清除线程不会和普通线程发生锁竞争。理论上来讲，一样的方法也能够应用到请求的多个阶段，但目前我尚未发现有这么用的。

使用lookaside lists有一个问题是，保持分配对象须要一个链表指针(链表结点)，这可能会增长内存的使用。可是即便有这种状况，使用它带来的好处也可以远远弥补这些额外内存的花销。

第三条建议与咱们尚未讨论的锁有关系。先抛开它不说。即便使用lookaside list,内存分配时的锁竞争也经常是最大的开销。解决方法是使用线程私有的lookasid list, 这样就能够避免多个线程之间的竞争。更进一步，每一个处理器一个链会更好，但这样只有在非抢先式线程环境下才有用。基于极端考虑，私有lookaside list甚至能够和一个共用的链工做结合起来使用。

锁竞争（Lock Contention）

高效率的锁是很是难规划的, 以致于我把它称做卡律布狄斯和斯库拉(参见附录)。一方面, 锁的简单化(粗粒度锁)会致使并行处理的串行化,于是下降了并发的效率和系统可伸缩性; 另外一方面, 锁的复杂化(细粒度锁)在空间占用上和操做时的时间消耗上均可能产生对性能的侵蚀。偏向于粗粒度锁会有死锁发生，而偏向于细粒度锁则会产生竞争。在这二者之间，有一个狭小的路径通向正确性和高效率，可是路在哪里？

因为锁倾向于对程序逻辑产生束缚，因此若是要在不影响程序正常工做的基础上规划出锁方案基本是不可能的。这也就是人们为何憎恨锁，而且为本身设计的不可扩展的单线程方案找借口了。

几乎咱们每一个系统中锁的设计都始于一个"锁住一切的超级大锁"，并寄但愿于它不会影响性能，当但愿落空时(几乎是必然), 大锁被分红多个小锁，而后咱们继续祷告(性能不会受影响)，接着，是重复上面的整个过程(许多小锁被分红更小的锁), 直到性能达到可接受的程度。一般，上面过程的每次重复都回增长大于20%-50%的复杂性和锁负荷，并减小5%-10%的锁竞争。最终结果是取得了适中的效率，可是实际效率的下降是不可避免的。设计者开始抓狂:"我已经按照书上的指导设计了细粒度锁，为何系统性能仍是很糟糕?"

在个人经验里，上面的方法从基础上来讲就不正确。设想把解决方案当成一座山，优秀的方案表示山顶，糟糕的方案表示山谷。上面始于"超级锁"的解决方案就好像被形形色色的山谷，凹沟，小山头和死胡同挡在了山峰以外的爬山者同样，是一个典型的糟糕登山法；从这样一个地方开始登顶，还不以下山更容易一些。那么登顶正确的方法是什么？

首要的事情是为你程序中的锁造成一张图表，有两个轴：

l 图表的纵轴表示代码。若是你正在应用剔出了分支的阶段架构(指前面说的为请求划分阶段)，你可能已经有这样一张划分图了，就像不少人见过的OSI七层网络协议架构图同样。

l 图表的水平轴表示数据集。在请求的每一个阶段都应该有属于该阶段须要的数据集。

如今，你有了一张网格图，图上每一个单元格表示一个特定阶段须要的特定数据集。下面是应该遵照的最重要的规则：两个请求不该该产生竞争，除非它们在同一个阶段须要一样的数据集。若是你严格遵照这个规则，那么你已经成功了一半。

一旦你定义出了上面那个网格图，在你的系统中的每种类型的锁就均可以被标识出来了。你的下一个目标是确保这些标识出来的锁尽量在两个轴之间均匀的分布, 这部分工做是和具体应用相关的。你得像个钻石切割工同样，根据你对程序的了解，找出请求阶段和数据集之间的天然“纹理线”。有时候它们很容易发现，有时候又很难找出来，此时须要不断回顾来发现它。在程序设计时，把代码分隔成不一样阶段是很复杂的事情，我也没有好的建议，可是对于数据集的定义，有一些建议给你：

l 若是你能对请求按顺序编号，或者能对请求进行哈希，或者能把请求和事物ID关联起来，那么根据这些编号或者ID就能对数据更好的进行分隔。

l 有时，基于数据集的资源最大化利用，把请求动态的分配给数据，相对于依据请求的固有属性来分配会更有优点。就好像现代CPU的多个整数运算单元知道把请求分离同样。

l 肯定每一个阶段指定的数据集是不同的是很是有用的，以便保证一个阶段争夺的数据在另外阶段不会争夺。

若是你在纵向和横向上把“锁空间(这里实际指锁的分布)" 分隔了，而且确保了锁均匀分布在网格上，那么恭喜你得到了一个好方案。如今你处在了一个好的爬山点，打个比喻，你面有了一条通向顶峰的缓坡，但你尚未到山顶。如今是时候对锁竞争进行统计，看看该如何改进了。以不一样的方式分隔阶段和数据集，而后统计锁竞争，直到得到一个满意的分隔。当你作到这个程度的时候，那么无限风景将呈如今你脚下。

其余方面

我已经阐述完了影响性能的四个主要方面。然而还有一些比较重要的方面须要说一说，大所属均可归结于你的平台或系统环境：

l 你的存储子系统在大数据读写和小数据读写，随即读写和顺序读写方面是如何进行？在预读和延迟写入方面作得怎样？

l 你使用的网络协议效率如何？是否能够经过修改参数改善性能？是否有相似于TCP_CORK, MSG_PUSH,Nagle-toggling算法的手段来避免小消息产生？

l 你的系统是否支持Scatter-Gather I/O(例如readv/writev)? 使用这些可以改善性能，也能避免使用缓冲链(见第一节数据拷贝的相关叙述)带来的麻烦。(说明：在dma传输数据的过程当中，要求源物理地址和目标物理地址必须是连续的。但在有的计算机体系中，如IA，连续的存储器地址在物理上不必定是连续的，则dma传输要分红屡次完成。若是传输完一块物理连续的数据后发起一次中断，同时主机进行下一块物理连续的传输，则这种方式即为block dma方式。scatter/gather方式则不一样，它是用一个链表描述物理不连续的存储器，而后把链表首地址告诉dma master。dma master传输完一块物理连续的数据后，就不用再发中断了，而是根据链表传输下一块物理连续的数据，最后发起一次中断。很显然 scatter/gather方式比block dma方式效率高)

l 你的系统的页大小是多少？高速缓存大小是多少？向这些大小边界进行对起是否有用？系统调用和上下文切换花的代价是多少？

l 你是否知道锁原语的饥饿现象？你的事件机制有没有"惊群"问题?你的唤醒/睡眠机制是否有这样糟糕的行为: 当X唤醒了Y, 环境马上切换到了Y,可是X还有没完成的工做?

我在这里考虑的了不少方面，相信你也考虑过。在特定状况下，应用这里提到的某些方面可能没有价值，但能考虑这些因素的影响仍是有用的。若是在系统手册中，你没有找到这些方面的说明，那么就去努力找出答案。写一个测试程序来找出答案；无论怎样，写这样的测试代码都是很好的技巧锻炼。若是你写的代码在多个平台上都运行过，那么把这些相关的代码抽象为一个平台相关的库，未来在某个支持这里提到的某些功能的平台上，你就赢得了先机。

对你的代码,“知其因此然”, 弄明白其中高级的操做, 以及在不一样条件下的花销.这不一样于传统的性能分析, 不是关于具体的实现,而是关乎设计. 低级别的优化永远是蹩脚设计的最后救命稻草.

（map注：下面这段文字原文没有，这是译者对于翻译的理）

[附录：奥德修斯（Odysseus，又译“奥德赛”），神话中伊塔刻岛国王，《伊利亚特》和《奥德赛》两大史诗中的主人公（公元前11世纪到公元前9世纪的希腊史称做“荷马时代”。包括《伊利亚特》和《奥德赛》两部分的《荷马史诗》，是古代世界一部著名的杰做）。奥德修斯曾参加过著名的特洛伊战争，在战争中他以英勇善战、神机妙算而著称，为赢得战争的胜利，他设计制造了著名的“特洛伊木马”（后来在西方成了“为毁灭敌人而送的礼物”的代名词）。特洛伊城毁灭后，他在回国途中又经历了许多风险，荷马的《奥德赛》就是奥德修斯历险的记述。“斯库拉和卡律布狄斯”的故事是其中最惊险、最恐怖的一幕。

相传，斯库拉和卡律布狄斯是古希腊神话中的女妖和魔怪，女妖斯库拉住在乎大利和西西里岛之间海峡中的一个洞穴里，她的对面住着另外一个妖怪卡律布狄斯。它们为害全部过往航海的人。据荷马说，女妖斯库拉长着12只不规则的脚，有6个蛇同样的脖子，每一个脖子上各有一颗可怕的头，张着血盆大口，每张嘴有3 排毒牙，随时准备把猎物咬碎。它们天天在乎大利和西西里岛之间海峡中兴风做浪，航海者在两个妖怪之间经过是异常危险的，它们时刻在等待着穿过西西里海峡的船舶。在海峡中间，卡律布狄斯化成一个大旋涡，波涛汹涌、水花飞溅，天天3次从悬崖上奔涌而出，在退落时将经过此处的船只所有淹没。当奥德修斯的船接近卡律布狄斯大旋涡时，它像火炉上的一锅沸水，波涛滔天，激起漫天雪白的水花。当潮退时，海水混浊，涛声如雷，惊天动地。这时，黑暗泥泞的岩穴一见到底。正当他们惊恐地注视着这一可怕的景象时，正当舵手当心翼翼地驾驶着船只从左绕过旋涡时，忽然海怪斯库拉出如今他们面前，她一口叼住了6个同伴。奥德修斯亲眼看见本身的同伴在妖怪的牙齿中间扭动着双手和双脚，挣扎了一下子，他们便被嚼碎，成了血肉模糊的一团。其他的人侥幸经过了卡律布狄斯大旋涡和海怪斯库拉之间的危险的隘口。后来又历经种种灾难，最后终于回到了故乡——伊塔刻岛。

这个故事在语言学界和翻译界被广为流传。前苏联著名翻译家巴尔胡达罗夫就曾把 “斯库拉和卡律布狄斯”比做翻译中“直译和意译”。他说：“形象地说，译者老是不得不在直译和意译之间迂回应变，犹如在斯库拉和卡律布狄斯之间曲折前行，以求在这海峡两岸之间找到一条狭窄然而却足够深邃的航道，以便达到理想的目的地——最大限度的等值翻译。”

德国著名语言学家洪堡特也说过相似的话：“我确信任何翻译无疑地都是企图解决不可能解决的任务。由于任何一个翻译家都会碰到一个暗礁而遭到失败，他们不是因为十分准确地遵照了原文的形式而破坏了译文语言的特色，就是为了照顾译文语言的特色而损坏了原文。介于二者之间的作法不只难于办到，并且简直是不可能办到。”

历史上长久以来都认为，翻译只能选择两个极端的一种：或者这种——逐字翻译（直译）；或者那种——自由翻译（意译）。就好像翻译中的斯库拉和卡律布狄斯”同样。现在 “斯库拉和卡律布狄斯”已成为表示双重危险——海怪和旋涡的代名词，人们常说“介于斯库拉和卡律布狄斯之间”，这就是说：处于两面受敌的险境，比喻“危机四伏”，用来喻指译者在直译和意译之间反复做出抉择之艰难。]

Introduction

The purpose of this document is to share some ideas that I've developed over the years about how to develop a certain kind of application for which the term "server" is only a weak approximation. More accurately, I'll be writing about a broad class of programs that are designed to handle very large numbers of discrete messages or requests per second. Network servers most commonly fit this definition, but not all programs that do are really servers in any sense of the word. For the sake of simplicity, though, and because "High-Performance Request-Handling Programs" is a really lousy title, we'll just say "server" and be done with it.

I will not be writing about "mildly parallel" applications, even though multitasking within a single program is now commonplace. The browser you're using to read this probably does some things in parallel, but such low levels of parallelism really don't introduce many interesting challenges. The interesting challenges occur when the request-handling infrastructure itself is the limiting factor on overall performance, so that improving the infrastructure actually improves performance. That's not often the case for a browser running on a gigahertz processor with a gigabyte of memory doing six simultaneous downloads over a DSL line. The focus here is not on applications that sip through a straw but on those that drink from a firehose, on the very edge of hardware capabilities where how you do it really does matter.

Some people will inevitably take issue with some of my comments and suggestions, or think they have an even better way. Fine. I'm not trying to be the Voice of God here; these are just methods that I've found to work for me, not only in terms of their effects on performance but also in terms of their effects on the difficulty of debugging or extending code later. Your mileage may vary. If something else works better for you that's great, but be warned that almost everything I suggest here exists as an alternative to something else that I tried once only to be disgusted or horrified by the results. Your pet idea might very well feature prominently in one of these stories, and innocent readers might be bored to death if you encourage me to start telling them. You wouldn't want to hurt them, would you?

The rest of this article is going to be centered around what I'll call the Four Horsemen of Poor Performance:

Data copies
Context switches
Memory allocation
Lock contention

There will also be a catch-all section at the end, but these are the biggest performance-killers. If you can handle most requests without copying data, without a context switch, without going through the memory allocator and without contending for locks, you'll have a server that performs well even if it gets some of the minor parts wrong.

Data Copies

This could be a very short section, for one very simple reason: most people have learned this lesson already. Everybody knows data copies are bad; it's obvious, right? Well, actually, it probably only seems obvious because you learned it very early in your computing career, and that only happened because somebody started putting out the word decades ago. I know that's true for me, but I digress. Nowadays it's covered in every school curriculum and in every informal how-to. Even the marketing types have figured out that "zero copy" is a good buzzword.

Despite the after-the-fact obviousness of copies being bad, though, there still seem to be nuances that people miss. The most important of these is that data copies are often hidden and disguised. Do you really know whether any code you call in drivers or libraries does data copies? It's probably more than you think. Guess what "Programmed I/O" on a PC refers to. An example of a copy that's disguised rather than hidden is a hash function, which has all the memory-access cost of a copy and also involves more computation. Once it's pointed out that hashing is effectively "copying plus" it seems obvious that it should be avoided, but I know at least one group of brilliant people who had to figure it out the hard way. If you really want to get rid of data copies, either because they really are hurting performance or because you want to put "zero-copy operation" on your hacker-conference slides, you'll need to track down a lot of things that really are data copies but don't advertise themselves as such.

The tried and true method for avoiding data copies is to use indirection, and pass buffer descriptors (or chains of buffer descriptors) around instead of mere buffer pointers. Each descriptor typically consists of the following:

A pointer and length for the whole buffer.
A pointer and length, or offset and length, for the part of the buffer that's actually filled.
Forward and back pointers to other buffer descriptors in a list.
A reference count.

Now, instead of copying a piece of data to make sure it stays in memory, code can simply increment a reference count on the appropriate buffer descriptor. This can work extremely well under some conditions, including the way that a typical network protocol stack operates, but it can also become a really big headache. Generally speaking, it's easy to add buffers at the beginning or end of a chain, to add references to whole buffers, and to deallocate a whole chain at once. Adding in the middle, deallocating piece by piece, or referring to partial buffers will each make life increasingly difficult. Trying to split or combine buffers will simply drive you insane.

I don't actually recommend using this approach for everything, though. Why not? Because it gets to be a huge pain when you have to walk through descriptor chains every time you want to look at a header field. There really are worse things than data copies. I find that the best thing to do is to identify the large objects in a program, such as data blocks, make sure those get allocated separately as described above so that they don't need to be copied, and not sweat too much about the other stuff.

This brings me to my last point about data copies: don't go overboard avoiding them. I've seen way too much code that avoids data copies by doing something even worse, like forcing a context switch or breaking up a large I/O request. Data copies are expensive, and when you're looking for places to avoid redundant operations they're one of the first things you should look at, but there is a point of diminishing returns. Combing through code and then making it twice as complicated just to get rid of that last few data copies is usually a waste of time that could be better spent in other ways.

Context Switches

Whereas everyone thinks it's obvious that data copies are bad, I'm often surprised by how many people totally ignore the effect of context switches on performance. In my experience, context switches are actually behind more total "meltdowns" at high load than data copies; the system starts spending more time going from one thread to another than it actually spends within any thread doing useful work. The amazing thing is that, at one level, it's totally obvious what causes excessive context switching. The #1 cause of context switches is having more active threads than you have processors. As the ratio of active threads to processors increases, the number of context switches also increases - linearly if you're lucky, but often exponentially. This very simple fact explains why multi-threaded designs that have one thread per connection scale very poorly. The only realistic alternative for a scalable system is to limit the number of active threads so it's (usually) less than or equal to the number of processors. One popular variant of this approach is to use only one thread, ever; while such an approach does avoid context thrashing, and avoids the need for locking as well, it is also incapable of achieving more than one processor's worth of total throughput and thus remains beneath contempt unless the program will be non-CPU-bound (usually network-I/O-bound) anyway.

The first thing that a "thread-frugal" program has to do is figure out how it's going to make one thread handle multiple connections at once. This usually implies a front end that uses select/poll, asynchronous I/O, signals or completion ports, with an event-driven structure behind that. Many "religious wars" have been fought, and continue to be fought, over which of the various front-end APIs is best. Dan Kegel's C10K paper is a good resource is this area. Personally, I think all flavors of select/poll and signals are ugly hacks, and therefore favor either AIO or completion ports, but it actually doesn't matter that much. They all - except maybe select() - work reasonably well, and don't really do much to address the matter of what happens past the very outermost layer of your program's front end.

The simplest conceptual model of a multi-threaded event-driven server has a queue at its center; requests are read by one or more "listener" threads and put on queues, from which one or more "worker" threads will remove and process them. Conceptually, this is a good model, but all too often people actually implement their code this way. Why is this wrong? Because the #2 cause of context switches is transferring work from one thread to another. Some people even compound the error by requiring that the response to a request be sent by the original thread - guaranteeing not one but two context switches per request. It's very important to use a "symmetric" approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.

Usually, it's not possible to know how many threads will be active even one instant into the future. After all, requests can come in on any connection at any moment, or "background" threads dedicated to various maintenance tasks could pick that moment to wake up. If you don't know how many threads are active, how can you limit how many are active? In my experience, one of the most effective approaches is also one of the simplest: use an old-fashioned counting semaphore which each thread must hold whenever it's doing "real work". If the thread limit has already been reached then each listen-mode thread might incur one extra context switch as it wakes up and then blocks on the semaphore, but once all listen-mode threads have blocked in this way they won't continue contending for resources until one of the existing threads "retires" so the system effect is negligible. More importantly, this method handles maintenance threads - which sleep most of the time and therefore dont' count against the active thread count - more gracefully than most alternatives.

Once the processing of requests has been broken up into two stages (listener and worker) with multiple threads to service the stages, it's natural to break up the processing even further into more than two stages. In its simplest form, processing a request thus becomes a matter of invoking stages successively in one direction, and then in the other (for replies). However, things can get more complicated; a stage might represent a "fork" between two processing paths which involve different stages, or it might generate a reply (e.g. a cached value) itself without invoking further stages. Therefore, each stage needs to be able to specify "what should happen next" for a request. There are three possibilities, represented by return values from the stage's dispatch function:

The request needs to be passed on to another stage (an ID or pointer in the return value).
The request has been completed (a special "request done" return value)
The request was blocked (a special "request blocked" return value). This is equivalent to the previous case, except that the request is not freed and will be continued later from another thread.

Note that, in this model, queuing of requests is done within stages, not between stages. This avoids the common silliness of constantly putting a request on a successor stage's queue, then immediately invoking that successor stage and dequeuing the request again; I call that lots of queue activity - and locking - for nothing.

If this idea of separating a complex task into multiple smaller communicating parts seems familiar, that's because it's actually very old. My approach has its roots in the Communicating Sequential Processes concept elucidated by C.A.R. Hoare in 1978, based in turn on ideas from Per Brinch Hansen and Matthew Conway going back to 1963 - before I was born! However, when Hoare coined the term CSP he meant "process" in the abstract mathematical sense, and a CSP process need bear no relation to the operating-system entities of the same name. In my opinion, the common approach of implementing CSP via thread-like coroutines within a single OS thread gives the user all of the headaches of concurrency with none of the scalability.

A contemporary example of the staged-execution idea evolved in a saner direction is Matt Welsh's SEDA. In fact, SEDA is such a good example of "server architecture done right" that it's worth commenting on some of its specific characteristics (especially where those differ from what I've outlined above).

SEDA's "batching" tends to emphasize processing multiple requests through a stage at once, while my approach tends to emphasize processing a single request through multiple stages at once.
SEDA's one significant flaw, in my opinion, is that it allocates a separate thread pool to each stage with only "background" reallocation of threads between stages in response to load. As a result, the #1 and #2 causes of context switches noted above are still very much present.
In the context of an academic research project, implementing SEDA in Java might make sense. In the real world, though, I think the choice can be characterized as unfortunate.

Memory Allocation

Allocating and freeing memory is one of the most common operations in many applications. Accordingly, many clever tricks have been developed to make general-purpose memory allocators more efficient. However, no amount of cleverness can make up for the fact that the very generality of such allocators inevitably makes them far less efficient than the alternatives in many cases. I therefore have three suggestions for how to avoid the system memory allocator altogether.

Suggestion #1 is simple preallocation. We all know that static allocation is bad when it imposes artificial limits on program functionality, but there are many other forms of preallocation that can be quite beneficial. Usually the reason comes down to the fact that one trip through the system memory allocator is better than several, even when some memory is "wasted" in the process. Thus, if it's possible to assert that no more than N items could ever be in use at once, preallocation at program startup might be a valid choice. Even when that's not the case, preallocating everything that a request handler might need right at the beginning might be preferable to allocating each piece as it's needed; aside from the possibility of allocating multiple items contiguously in one trip through the system allocator, this often greatly simplifies error-recovery code. If memory is very tight then preallocation might not be an option, but in all but the most extreme circumstances it generally turns out to be a net win.

Suggestion #2 is to use lookaside lists for objects that are allocated and freed frequently. The basic idea is to put recently-freed objects onto a list instead of actually freeing them, in the hope that if they're needed again soon they need merely be taken off the list instead of being allocated from system memory. As an additional benefit, transitions to/from a lookaside list can often be implemented to skip complex object initialization/finalization.

It's generally undesirable to have lookaside lists grow without bound, never actually freeing anything even when your program is idle. Therefore, it's usually necessary to have some sort of periodic "sweeper" task to free inactive objects, but it would also be undesirable if the sweeper introduced undue locking complexity or contention. A good compromise is therefore a system in which a lookaside list actually consists of separately locked "old" and "new" lists. Allocation is done preferentially from the new list, then from the old list, and from the system only as a last resort; objects are always freed onto the new list. The sweeper thread operates as follows:

Lock both lists.
Save the head for the old list.
Make the (previously) new list into the old list by assigning list heads.
Unlock.
Free everything on the saved old list at leisure.

Objects in this sort of system are only actually freed when they have not been needed for at least one full sweeper interval, but always less than two. Most importantly, the sweeper does most of its work without holding any locks to contend with regular threads. In theory, the same approach can be generalized to more than two stages, but I have yet to find that useful.

One concern with using lookaside lists is that the list pointers might increase object size. In my experience, most of the objects that I'd use lookaside lists for already contain list pointers anyway, so it's kind of a moot point. Even if the pointers were only needed for the lookaside lists, though, the savings in terms of avoided trips through the system memory allocator (and object initialization) would more than make up for the extra memory.

Suggestion #3 actually has to do with locking, which we haven't discussed yet, but I'll toss it in anyway. Lock contention is often the biggest cost in allocating memory, even when lookaside lists are in use. One solution is to maintain multiple private lookaside lists, such that there's absolutely no possibility of contention for any one list. For example, you could have a separate lookaside list for each thread. One list per processor can be even better, due to cache-warmth considerations, but only works if threads cannot be preempted. The private lookaside lists can even be combined with a shared list if necessary, to create a system with extremely low allocation overhead.

Lock Contention

Efficient locking schemes are notoriously hard to design, because of what I call Scylla and Charybdis after the monsters in the Odyssey. Scylla is locking that's too simplistic and/or coarse-grained, serializing activities that can or should proceed in parallel and thus sacrificing performance and scalability; Charybdis is overly complex or fine-grained locking, with space for locks and time for lock operations again sapping performance. Near Scylla are shoals representing deadlock and livelock conditions; near Charybdis are shoals representing race conditions. In between, there's a narrow channel that represents locking which is both efficient and correct...or is there? Since locking tends to be deeply tied to program logic, it's often impossible to design a good locking scheme without fundamentally changing how the program works. This is why people hate locking, and try to rationalize their use of non-scalable single-threaded approaches.

Almost every locking scheme starts off as "one big lock around everything" and a vague hope that performance won't suck. When that hope is dashed, and it almost always is, the big lock is broken up into smaller ones and the prayer is repeated, and then the whole process is repeated, presumably until performance is adequate. Often, though, each iteration increases complexity and locking overhead by 20-50% in return for a 5-10% decrease in lock contention. With luck, the net result is still a modest increase in performance, but actual decreases are not uncommon. The designer is left scratching his head (I use "his" because I'm a guy myself; get over it). "I made the locks finer grained like all the textbooks said I should," he thinks, "so why did performance get worse?"

In my opinion, things got worse because the aforementioned approach is fundamentally misguided. Imagine the "solution space" as a mountain range, with high points representing good solutions and low points representing bad ones. The problem is that the "one big lock" starting point is almost always separated from the higher peaks by all manner of valleys, saddles, lesser peaks and dead ends. It's a classic hill-climbing problem; trying to get from such a starting point to the higher peaks only by taking small steps and never going downhill almost never works. What's needed is a fundamentally different way of approaching the peaks.

The first thing you have to do is form a mental map of your program's locking. This map has two axes:

The vertical axis represents code. If you're using a staged architecture with non-branching stages, you probably already have a diagram showing these divisions, like the ones everybody uses for OSI-model network protocol stacks.
The horizontal axis represents data. In every stage, each request should be assigned to a data set with its own resources separate from any other set.

You now have a grid, where each cell represents a particular data set in a particular processing stage. What's most important is the following rule: two requests should not be in contention unless they are in the same data set and the same processing stage. If you can manage that, you've already won half the battle.

Once you've defined the grid, every type of locking your program does can be plotted, and your next goal is to ensure that the resulting dots are as evenly distributed along both axes as possible. Unfortunately, this part is very application-specific. You have to think like a diamond-cutter, using your knowledge of what the program does to find the natural "cleavage lines" between stages and data sets. Sometimes they're obvious to start with. Sometimes they're harder to find, but seem more obvious in retrospect. Dividing code into stages is a complicated matter of program design, so there's not much I can offer there, but here are some suggestions for how to define data sets:

If you have some sort of a block number or hash or transaction ID associated with requests, you can rarely do better than to divide that value by the number of data sets.
Sometimes, it's better to assign requests to data sets dynamically, based on which data set has the most resources available rather than some intrinsic property of the request. Think of it like multiple integer units in a modern CPU; those guys know a thing or two about making discrete requests flow through a system.
It's often helpful to make sure that the data-set assignment is different for each stage, so that requests which would contend at one stage are guaranteed not to do so at another stage.

If you've divided your "locking space" both vertically and horizontally, and made sure that lock activity is spread evenly across the resulting cells, you can be pretty sure that your locking is in pretty good shape. There's one more step, though. Do you remember the "small steps" approach I derided a few paragraphs ago? It still has its place, because now you're at a good starting point instead of a terrible one. In metaphorical terms you're probably well up the slope on one of the mountain range's highest peaks, but you're probably not at the top of one. Now is the time to collect contention statistics and see what you need to do to improve, splitting stages and data sets in different ways and then collecting more statistics until you're satisfied. If you do all that, you're sure to have a fine view from the mountaintop.

Other Stuff

As promised, I've covered the four biggest performance problems in server design. There are still some important issues that any particular server will need to address, though. Mostly, these come down to knowing your platform/environment:

How does your storage subsystem perform with larger vs. smaller requests? With sequential vs. random? How well do read-ahead and write-behind work?
How efficient is the network protocol you're using? Are there parameters or flags you can set to make it perform better? Are there facilities like TCP_CORK, MSG_PUSH, or the Nagle-toggling trick that you can use to avoid tiny messages?
Does your system support scatter/gather I/O (e.g. readv/writev)? Using these can improve performance and also take much of the pain out of using buffer chains.
What's your page size? What's your cache-line size? Is it worth it to align stuff on these boundaries? How expensive are system calls or context switches, relative to other things?
Are your reader/writer lock primitives subject to starvation? Of whom? Do your events have "thundering herd" problems? Does your sleep/wakeup have the nasty (but very common) behavior that when X wakes Y a context switch to Y happens immediately even if X still has things to do?

I'm sure I could think of many more questions in this vein. I'm sure you could too. In any particular situation it might not be worthwhile to do anything about any one of these issues, but it's usually worth at least thinking about them. If you don't know the answers - many of which you will not find in the system documentation - find out. Write a test program or micro-benchmark to find the answers empirically; writing such code is a useful skill in and of itself anyway. If you're writing code to run on multiple platforms, many of these questions correlate with points where you should probably be abstracting functionality into per-platform libraries so you can realize a performance gain on that one platform that supports a particular feature.

The "know the answers" theory applies to your own code, too. Figure out what the important high-level operations in your code are, and time them under different conditions. This is not quite the same as traditional profiling; it's about measuringdesign elements, not actual implementations. Low-level optimization is generally the last resort of someone who screwed up the design.

4.2 Server Architectures

We have seen different models for socket I/O--and file I/O, in case of a web server for static content. Now, we are now in need of models merging I/O operations, CPU-bound activities such as request parsing and request handling into general server architectures.

There are traditionally two competitive server architectures--one is based on threads, the other on events. Over time, more sophisticated variants emerged, sometimes combining both approaches. There has been a long controversy, whether threads or events are generally the better fundament for high performance web servers [Ous96,vB03a,Wel01]. After more than a decade, this argument has been now reinforced, thanks to new scalability challenges and the trend towards multi-core CPUs.

Before we evaluate the different approaches, we introduce the general architectures, the corresponding patterns in use and give some real world examples.

Thread-based Server Architectures

The thread-based approach basically associates each incoming connection with a separate thread (resp. process). In this way, synchronous blocking I/O is the natural way of dealing with I/O. It is a common approach that is well supported by many programming languages. It also leads to a straight forward programming model, because all tasks necessary for request handling can be coded sequentially. Moreover, it provides a simple mental abstraction by isolating requests and hiding concurrency. Real concurrency is achieved by employing multiple threads/processes at the same time.

Conceptually, multi-process and multi-threaded architectures share the same principles: each new connection is handled by a dedicated activity.

Multi-Process Architectures

A traditional approach to UNIX-based network servers is the process-per-connection model, using a dedicated process for handling a connection [Ste03]. This model has also been used for the first HTTP server, CERN httpd. Due to the nature of processes, they are isolating different requests promptly, as they do not share memory. Being rather heavyweight structures, the creation of processes is a costly operation and servers often employ a strategy called preforking. When using preforking, the main server process forks several handler processes preemptively on start-up, as shown infigure 4.1. Often, the (thread-safe) socket descriptor is shared among all processes, and each process blocks for a new connection, handles the connection and then waits for the next connection.

Figure 4.1: A multi-process architecture that make use of preforking. On startup, the main server process forks several child processes that will later handle requests. A socket is created and shared between the processes. Each request handler process waits for new connections to handle and thereafter blocks for new connections.

Some multi-process servers also measure the load and spawn additional requests when needed. However, it is important to note that the heavyweight structure of a process limits the maximum of simultaneous connections. The large memory footprint as a result of the connection-process mapping leads to a concurrency/memory trade-off. Especially in case of long-running, partially inactive connections (e.g. long-polling notification requests), the multi-process architecture provides only limited scalability for concurrent requests.

The popular Apache web server provides a robust multi-processing module that is based on process preforking, Apache-MPM prefork. It is still the default multi-processing module for UNIX-based setups of Apache.

Multi-Threaded Architectures

When reasonable threading libraries have become available, new server architectures emerged that replaced heavyweight processes with more leightweight threads. In effect, they employ a thread-per-connection model. Although following the same principles, the multi-threaded approach has several important differences. First of all, multiple threads share the same address space and hence share global variables and state. This makes it possible to implement mutual features for all request handlers, such as a shared cache for cacheable responses inside the web server. Obviously, correct synchronization and coordination is then required. Another difference of the more leightweight structures of threads is their smaller memory footprint. Compared to the full-blown memory size of an entire process, a thread only consumes limited memory (i.e. the thread stack). Also, threads require less resources for creation/termination. We have already seen that the dimensions of a process are a severe problem in case of high concurrency. Threads are generally a more efficient replacement when mapping connections to activities.

Figure 4.2: A multi-threaded architecture that make use of an acceptor thread. The dedicated acceptor blocks for new socket connections, accepts connections and dispatches them to the worker pool and continues. The worker pool provides a set of threads that handle incoming requests. Worker threads are either handling requests or waiting for new requests to process.

In practice, it is a common architecture to place a single dispatcher thread (sometimes also called acceptor thread) in front of a pool of threads for connection handling [Ste03], as shown in figure 4.2. Thread pools are a common way of bounding the maximum number of threads inside the server. The dispatcher blocks on the socket for new connections. Once established, the connection is passed to a queue of incoming connections. Threads from the thread pool take connections from the queue, execute the requests and wait for new connections in the queue. When the queue is also bounded, the maximum number of awaiting connections can be restricted. Additional connections will be rejected. While this strategy limits the concurrency, it provides more predictable latencies and prevents total overload.

Apache-MPM worker is a multi-processing module for the Apache web server that combines processes and threads. The module spawns several processes and each process in turn manages its own pool of threads.

Scalability Considerations for Multi-Threaded Architectures

Multi-threaded servers using a thread-per-connection model are easy to implement and follow a simple strategy. Synchronous, blocking I/O operations can be used as a natural way of expressing I/O access. The operating system overlaps multiple threads via preemptively scheduling. In most cases, at least a blocking I/O operation triggers scheduling and causes a context switch, allowing the next thread to continue. This is a sound model for decent concurrency, and also appropriate when a reasonable amount of CPU-bound operations must be executed. Furthermore, multiple CPU cores can be used directly, as threads and processes are scheduled to all cores available.

Under heavy load, a multi-threaded web server consumes large amounts of memory (due to a single thread stack for each connection), and constant context switching causes considerable losses of CPU time. An indirect penalty thereof is increased chance of CPU cache misses. Reducing the absolute number of threads improves the per-thread performance, but limits the overall scalability in terms of maximum simultaneous connections.

Event-driven Server Architectures

As an alternative to synchronous blocking I/O, the event-driven approach is also common in server architectures. Due to the asynchronous/non-blocking call semantics, other models than the previously outlined thread-per-connection model are needed. A common model is the mapping of a single thread to multiple connections. The thread then handles all occurring events from I/O operations of these connections and requests. As shown in figure 4.3, new events are queued and the thread executes a so-called event loop--dequeuing events from the queue, processing the event, then taking the next event or waiting for new events to be pushed. Thus, the work executed by a thread is very similar to that of a scheduler, multiplexing multiple connections to a single flow of execution.

Figure 4.3: This conceptual model shows the internals of an event-driven architecture. A single-threaded event loop consumes event after event from the queue and sequentially executes associated event handler code. New events are emitted by external sources such as socket or file I/O notifications. Event handlers trigger I/O actions that eventually result in new events later.

Processing an event either requires registered event handler code for specific events, or it is based on the execution of a callback associated to the event in advance. The different states of the connections handled by a thread are organized in appropriate data structures-- either explicitly using finite state machines or implicitly via continuations or closures of callbacks. As a result, the control flow of an application following the event-driven style is somehow inverted. Instead of sequential operations, an event-driven program uses a cascade of asynchronous calls and callbacks that get executed on events. This notion often makes the flow of control less obvious and complicates debugging.

The usage of event-driven server architectures has historically depended on the availability of asynchronous/non-blocking I/O operations on OS level and suitable, high performance event notification interfaces such as epoll and kqueue. Earlier implementations of event-based servers such as the Flash web server by Pai et al [Pai99].

Non-blocking I/O Multiplexing Patterns

Different patterns have emerged for event-based I/O multiplexing, recommending solutions for highly concurrent, high-performance I/O handling. The patterns generally address the problem of network services to handle multiple concurrent requests.

Reactor Pattern

The Reactor pattern [Sch95] targets synchronous, non-blocking I/O handling and relies on an event notification interface. On startup, an application following this pattern registers a set of resources (e.g. a socket) and events (e.g. a new connection) it is interested in. For each resource event the application is interested in, an appropriate event handler must be provided--a callback or hook method. The core component of the Reactor pattern is a synchronous event demultiplexer, that awaits events of resources using a blocking event notification interface. Whenever the synchronous event demultiplexer receives an event (e.g. a new client connection), it notifies a dispatcher and awaits for the next event. The dispatcher processes the event by selecting the associated event handler and triggering the callback/hook execution.

The Reactor pattern thus decouples a general framework for event handling and multiplexing from the application-specific event handlers. The original pattern focuses on a single-threaded execution. This requires the event handlers to adhere to the non-blocking style of operations. Otherwise, a blocking operation can suspend the entire application. Other variants of the Reactor pattern use a thread pool for the event handlers. While this improves performance on multi-core platforms, an additional overhead for coordination and synchronization must be taken into account.

Proactor Pattern

In contrast, the Proactor pattern [Pya97] leverages truly asynchronous, non-blocking I/O operations, as provided by interfaces such as POSIX AIO. As a result, the Proactor can be considered as an entirely asynchronous variant of the Reactor pattern seen before. It incorporates support for completition events instead of blocking event notification interfaces. A proactive initiator represents the main application thread and is responsible for initiating asynchronous I/O operations. When issuing such an operation, it always registers a completition handler and completition dispatcher. The execution of the asynchronous operation is governed by the asynchronous operation processor, an entity that is part of the OS in practice. When the I/O operation has been completed, the completition dispatcher is notified. Next, the completition handler processes the resulting event.

An important property in terms of scalability compared to the Reactor pattern is the better multithreading support. The execution of completition handlers can easily be handed off to a dedicated thread pool.

Scalability Considerations for Event-driven Architectures

Having a single thread running an event loop and waiting for I/O notifications has a different impact on scalability than the thread-based approach outlined before. Not associating connections and threads does dramatically decrease the number of threads of the server--in an extreme case, down to the single event-looping thread plus some OS kernel threads for I/O. We thereby get rid of the overhead of excessive context switching and do not need a thread stack for each connection. This decreases the memory footprint under load and wastes less CPU time to context switching. Ideally, the CPU becomes the only apparent bottleneck of an event-driven network application. Until full saturation of resources is archived, the event loop scales with increasing throughput. Once the load increases beyond maximum saturation, the event queue begins to stack up as the event-processing thread is not able to match up. Under this condition, the event-driven approach still provides a thorough throughput, but latencies of requests increase linearly, due to overload. This might be acceptable for temporary load peaks, but permanent overload degrades performance and renders the service unusable. One countermeasure is a more resource-aware scheduling and decoupling of event processing, as we will see soon when analysing a staged-based approach.

For the moment, we stay with the event-driven architectures and align them with multi-core architectures. While the thread-based model covers both--I/O-based and CPU-based concurrency, the initial event-based architecture solely addresses I/O concurrency. For exploiting multiple CPUs or cores, event-driven servers must be further adapted.

An obvious approach is the instantiation of multiple separate server processes on a single machine. This is often referred to as the N-copy approach for using N instances on a host with N CPUs/cores. In our case a machine would run multiple web server instances and register all instances at the load balancers. A less isolated alternative shares the server socket between all instances, thus requiring some coordination. For instance, an implementation of this approach is available for node.js using the cluster module, which forks multiple instances of an application and shares a single server socket.

The web servers in the architectural model have a specific feature--they are stateless, shared-nothing components. Already using an internal cache for dynamic requests requires several changes in the server architecture. For the moment, the easier concurrency model of having a single-threaded server and sequential execution semantics of callbacks can be accepted as part of the architecture. It is exactly this simple execution model that makes single-threaded applications attractive for developers, as the efforts of coordination and synchronization are diminished and the application code (i.e. callbacks) is guaranteed not to run concurrently. On the other hand, this characteristic intrinsically prevents the utilization of multiple processes inside a single event-driven application. Zeldovich et al. have addresses this issue withlibasync-smp [Zel03], an asynchronous programming library taking advantage of multiple processes and parallel callback execution. The simple sequential programming model is still preserved. The basic idea is the usage of tokens, so-called colors assigned to each callback. Callbacks with different colors can be executed in parallel, while serial execution is guaranteed for callbacks with the same color. Using a default color to all non-labelled callbacks makes this approach backward compatible to programs without any colors.

Let us extend our web server with a cache, using the coloring for additional concurrency. Reading and parsing a new request are sequential operations, but different requests can be handled at the same time. Thus, each request gets a distinct color (e.g. using the socket descriptor), and the parsing operation of different request can actually happen in parallel, as they are labelled differently. After having parsed the request, the server must check if the required content is already cached. Otherwise, it must be requested from the application server. Checking the cache now is a concurrent operation that must be executed sequentially, in order to provide consistency. Hence, the same color label is used for this step for all requests, indicating the scheduler to run all of these operations always serially, and never in parallel. This library also allows the callback to execute partially blocking operations. As long as the operation is not labelled with a shared color, it will not block other callbacks directly. The library is backed by a thread pool and a set of event queues, distinguished by colors. This solution allows to adhere to the traditional event-driven programming style, but introduces real concurrency to a certain extent. However, it requires the developer to label callbacks correctly. Reasoning about the flows of executions in an event-driven program is already difficult sometimes, and the additional effort may complicate this further.

Combined Approaches

The need for scalable architectures and the drawbacks of both general models have led to alternative architectures and libraries incorporating features of both models.

Staged Event-driven Architecture

A formative architecture combining threads and events for scalable servers has been designed by Welsh et al. [Wel01], the so called SEDA. As a basic concept, it divides the server logic into a series of well-defined stages, that are connected by queues, as shown in figure 4.4. Requests are passed from stage to stage during processing. Each stage is backed by a thread or a thread pool, that may be configured dynamically.

Figure 4.4: This illustration shows the concept of SEDA. In this example, there are two stages, each with a queue for incoming events, an event handler backed by thread pool and a controller that monitors resources. The only interaction between stages is the emission of events to the next stage(s) in the pipeline.

The separation favors modularity as the pipeline of stages can be changed and extended easily. Another very important feature of the SEDA design is the resource awareness and explicit control of load. The size of the enqueued items per stage and the workload of the thread pool per stage gives explicit insights on the overall load factor. In case of an overload situation, a server can adjust scheduling parameters or thread pool sizes. Other adaptive strategies include dynamic reconfiguration of the pipeline or deliberate request termination. When resource management, load introspection and adaptivity are decoupled from the application logic of a stage, it is simple to develop well-conditioned services. From a concurrency perspective, SEDA represents a hybrid approach between thread-per-connection multithreading and event-based concurrency. Having a thread (or a thread pool) dequeuing and processing elements resembles an event-driven approach. The usage of multiple stages with independent threads effectively utilizies multiple CPUs or cores and tends to a multi-threaded environment. From a developer's perspective, the implementation of handler code for a certain stage also resembles more traditional thread programming.

The drawbacks of SEDA are the increased latencies due to queue and stage traversal even in case of minimal load. In a later retrospective [Wel10], Welsh also criticized a missing differentiation of module boundaries (stages) and concurrency boundaries (queues and threads). This distribution triggers too many context switches, when a requests passes through multiple stages and queues. A better solution groups multiple stages together with a common thread pool. This decreaes context switches and improves response times. Stages with I/O operations and comparatively long execution times can still be isolated.

The SEDA model has inspired several implementations, including the generic server framework Apache MINA and enterprise service buses such as Mule ESB.

Special-Purpose Libraries

For instance, the Capriccio threading library by von Behren et al. [vB03b] promises scalable threads for servers by tackling the main thread issues. The problem of extensive context switches is addressed by using a non-preemptive scheduling. Threads eithers yield on I/O operations, or on an explicit yield operation. The stack size of each thread is limited based on prior analysis at compile time. This makes it unnecessary to overprovide bounded stack space preemptively. However, unbounded loops and the usage of recursive calls render a complete calculation of stack size apriori impossible. As a workaround, checkpoints are inserted into the code, that determine if a stack overflow is about to happen and allocate new stack chunks in that case. The checkpoints are inserted at compile time and are placed in a manner that there will never be a stack overflow within the code between two checkpoints. Additionally, resource-aware scheduling is applied that prevents thrashing. Therefore, CPU, memory and file descriptors are watched and combined with a static analysis of the resource usage of threads, scheduling is dynamically adapted.

Also, hybrid libraries, combining threads and events, have been developed. Li and Zdancewic [Li07] have implemented a combined model for Haskell, based on concurrency monads. The programming language Scala also provides event-driven and multi-threaded concurrency, that can be combined for server implementations.

	thread-based	event-driven
connection/request state	thread context	state machine/continuation
main I/O model	synchronous/blocking	asynchronous/non-blocking
activity flow	thread-per-connection	events and associated handlers
primary scheduling strategy	preemptive (OS)	cooperative
scheduling component	scheduler (OS)	event loop
calling semantics	blocking	dispatching/awaiting events

Table 4.2: Main differences between thread-based and event-driven server architectures.

Evaluation

Pariag et al. [Par07] have conducted a detailed performance-oriented comparison of thread-based, event-driven and hybrid pipelined servers. The thread-based server (knot) has taken advantage of the aforementioned Capriccio library. The event-driven server (26#26server) has been designed to support socket sharing and multiprocessor support using the N-copy approach. Lastly, the hybrid pipelined server (WatPipe) has been heavily inspired by SEDA, and consists of four stages for serving web requests. Pariag and his team then tested and heavily tuned the three servers. Finally, they benchmarked the servers using different scenarios, including deliberate overload situations. Previous benchmarks have been used to promote either new thread-based or event-driven architectures[Pai99,Wel01,vB03a], often with clear benefits for the new architecture. The extensive benchmark of Pariag et al. revealed that all three architectural models can be used for building highly scalable servers, as long as thorough tuning and (re-)configuration is conducted. The results also showed that event-driven architectures using asynchronous I/O have still a marginal advantage over thread-based architectures.

Event-driven web servers like nginx (e.g. GitHub, WordPress.com), lighttpd (e.g. YouTube, Wikipedia) or Tornado (e.g. Facebook, Quora) are currently very popular and several generic frameworks have emerged that follow this architectural pattern. Such frameworks available for Java include netty and MINA.

Please note that we do not conduct our own benchmarks in this chapter. Nottingham, one of the editors of the HTTP standards, has written an insightful summary, why even handed server benchmarking is extremely hard and costly [Not11]. Hence, we solely focus on the architecture concepts and design principles of web servers and confine our considerations to the prior results of Pariag et al. [Par07].

http://www.nightmare.com/medusa/medusa.html

Medusa: A High-Performance Internet Server Architecture

What is Medusa?

Medusa is an architecture for very-high-performance TCP/IP servers (like HTTP, FTP, and NNTP). Medusa is different from most other servers because it runs as a single process, multiplexing I/O with its various client and server connections within a single process/thread.

Medusa is written in Python, a high-level object-oriented language that is particularly well suited to building powerful, extensible servers. Medusa can be extended and modified at run-time, even by the end-user. User 'scripts' can be used to completely change the behavior of the server, and even add in completely new server types.

How Does it Work?

Most Internet servers are built on a 'forking' model. ('Fork' is a Unix term for starting a new process.) Such servers actually invoke an entire new process for every single client connection. This approach is simple to implement, but does not scale very well to high-load situations. Lots of clients mean a lot of processes, which gobble up large quantities of virtual memory and other system resources. A high-load server thus needs to have a lot of memory. Many popular Internet servers are running with hundreds of megabytes of memory.

The I/O bottleneck.

The vast majority of Internet servers are I/O bound - for any one process, the CPU is sitting idle 99.9% of the time, usually waiting for input from an external device (in the case of an Internet server, it is waiting for input from the network). This problem is exacerbated by the imbalance between server and client bandwidth: most clients are connecting at relatively low bandwidths (28.8 kbits/sec or less, with network delays and inefficiencies it can be far lower). To a typical server CPU, the time between bytes for such a client seems like an eternity! (Consider that a 200 Mhz CPU can perform roughly 50,000 operations for each byte received from such a client).

A simple metaphor for a 'forking' server is that of a supermarket: for every 'customer' being processed [at a cash register], another 'person' must be created to handle each client session. But what if your checkout clerks were so fast they could each individually handle hundreds of customers per second? Since these clerks are almost always waiting for a customer to come through their line, you have a very large staff, sitting around idle 99.9% of the time! Why not replace this staff with a single super-clerk ?

This is exactly how Medusa works!

Why is it Better?

Performance

The most obvious advantage to a single long-running server process is a dramatic improvement in performance. There are two types of overhead involved in the forking model:

Process creation/destruction.
Starting up a new process is an expensive operation on any operating system. Virtual memory must be allocated, libraries must be initialized, and the operating system now has yet another task to keep track of. This start-up cost is so high that it is actually noticeable to people! For example, the first time you pull up a web page with 15 inline images, while you are waiting for the page to load you may have created and destroyed at least 16 processes on the web server.
Virtual Memory
Each process also requires a certain amount of virtual memory space to be allocated on its behalf. Even though most operating systems implement a 'copy-on-write' strategy that makes this much less costly than it could be, the end result is still very wasteful. A 100-user FTP server can still easily require hundreds of megabytes of real memory in order to avoid thrashing (excess paging activity due to lack of real memory).

Medusa eliminates both types of overhead. Running as a single process, there is no per-client creation/destruction overhead. This means each client request is answered very quickly. And virtual memory requirements are lowered dramatically. Memory requirements can even be controlled with more precision in order to gain the highest performance possible for a particular machine configuration.

Persistence

Another major advantage to the single-process model is persistence. Often it is necessary to maintain some sort of state information that is available to each and every client, i.e., a database connection or file pointer. Forking-model servers that need such shared state must arrange some method of getting it - usually via an IPC (inter-process communication) mechanism such as sockets or named pipes. IPC itself adds yet another significant and needless overhead - single-process servers can share such information within a single address space.

Implementing persistence in Medusa is easy - the address space of its process (and thus its open database handles, variables, etc...) is available to each and every client.

Not a Strawman

Alright, at this point many of my readers will say I'm beating up on a strawman. In fact, they will say, such server architectures are already available - namely Microsoft's Internet Information Server. IIS avoids the above-named problems by using threads. Threads are 'lightweight processes' - they represent multiple concurrent execution paths within a single address space. Threads solve many of the problems mentioned above, but also create new ones:

Multi-threaded programs are notoriously difficult to write and debug - especially with servers that want to utilize persistent shared data - great care must be taken when accessing or modifying shared resources. This translates into larger and more complex code.
There is still additional system overhead when using threads. Each thread requires its own stack, and must be managed by a preemptive scheduler.

Threads are required in only a limited number of situations. In many cases where threads seem appropriate, an asynchronous solution can actually be written with less work, and will perform better. Avoiding the use of threads also makes access to shared resources (like database connections) easier to manage, since multi-user locking is not necessary.

Note: In the rare case where threads are actually necessary, Medusa can of course use them, if the host operating system supports them.

Another solution (used by many current HTTP servers on Unix) is to 'pre-spawn' a large number of processes - clients are attached to each server in turn. Although this alleviates the performance problem up to that number of users, it still does not scale well. To reliably and efficiently handle [n] users, [n] processes are still necessary.

Other Advantages

Extensibility
Since Medusa is written in Python, it is easily extensible. No separate compilation is necessary. New facilities can be loaded and unloaded into the server without any recompilation or linking, even while the server is running. [For example, Medusa can be configured to automatically upgrade itself to the latest version every so often].
Security
Many of the most popular security holes (popular, at least, among the mischievous) exploit the fact that servers are usually written in a low-level language. Unless such languages are used with extreme care, weaknesses can be introduced that are very difficult to predict and control. One of the favorite loop-holes is the 'memory buffer overflow', used by the Internet Worm (and many others) to gain unwarranted access to Internet servers.

Such problems are virtually non-existent when working in a high-level language like Python, where for example all access to variables and their components are checked at run-time for valid range operations. Even unforseen errors and operating system bugs can be caught - Python includes a full exception-handling system which promotes the construction of 'highly available' servers. Rather than crashing the entire server, Medusa will often inform the user, log the error, and keep right on running.

Current Features

The currently available version of Medusa includes integrated World Wide Web (HTTP) and file transfer (FTP) servers. This combined server can solve a major performance problem at any high-load site, by replacing two forking servers with a single non-forking, non-threading server. Multiple servers of each type can also be instantiated.
Also included is a secure 'remote-control' capability, called a monitor server. With this server enabled, authorized users can 'log in' to the running server, and control, manipulate, and examine the server while it is running .
Several extensions are available for the HTTP server, and more will become available over time. Each of these extensions can be loaded/unloaded into the server dynamically.
An API is evolving for users to extend not just the HTTP server but Medusa as a whole, mixing in other server types and new capabilities into existing servers. I am actively encouraging other developers to produce (and if they wish, to market) Medusa extensions. The underlying socket library (and thus the core networking technology of Medusa) is very stable, and has been running virtually unchanged since 1995.

Where Can I Get It?

Medusa is available from http://www.nightmare.com/medusa

Feedback, both positive and negative, is much appreciated; please send email to rushing@nightmare.com.