分布式系统领域有哪些经典论文

谢邀！五一快乐！
分布式系统在互联网时代，尤为是大数据时代到来以后，成为了每一个程序员的必备技能之一。分布式系统从上个世纪80年代就开始有了很多出色的研究和论文，我在这里只列举最近15年范围之内我以为有重大影响意义的15篇论文（15 within 15）。
1. The Google File System: 这是分布式文件系统领域划时代意义的论文，文中的多副本机制、控制流与数据流隔离和追加写模式等概念几乎成为了分布式文件系统领域的标准，其影响之深远经过其5000+的引用就可见一斑了，Apache Hadoop鼎鼎大名的HDFS就是GFS的模仿之做；
2. MapReduce: Simplified Data Processing on Large Clusters：这篇也是Google的大做，经过Map和Reduce两个操做，大大简化了分布式计算的复杂度，使得任何须要的程序员均可以编写分布式计算程序，其中使用到的技术值得咱们好好学习：简约而不简单！Hadoop也根据这篇论文作了一个开源的MapReduce；
3. Bigtable: A Distributed Storage System for Structured Data：Google在NoSQL领域的分布式表格系统，LSM树的最好使用范例，普遍使用到了网页索引存储、YouTube数据管理等业务，Hadoop对应的开源系统叫HBase（我在前公司任职时也开发过一个相应的系统叫BladeCube，性能较HBase有数倍提高）；
4. The Chubby lock service for loosely-coupled distributed systems：Google的分布式锁服务，基于Paxos协议，这篇文章相比于前三篇可能知道的人就少了，可是其对应的开源系统zookeeper几乎是每一个后端同窗都接触过，其影响力其实不亚于前三篇；
5. Finding a Needle in Haystack: Facebook's Photo Storage：facebook的在线图片存储系统，目前来看是对小文件存储的最好解决方案之一，facebook目前经过该系统存储了超过300PB的数据，一个师兄就在这个团队工做，听过不少有意思的事情（我在前公司的时候开发过一个相似的系统pallas，不只支持副本，还支持Reed Solomon-LRC，性能也有较多优化）；
6. Windows Azure Storage: a highly available cloud storage service with strong consistency：windows azure的整体介绍文章，是一篇很好的描述云存储架构的论文，其中经过分层来同时保证可用性和一致性的思路在现实工做中也给了我不少启发；
7. GraphLab: A New Framework for Parallel Machine Learning：CMU基于图计算的分布式机器学习框架，目前已经成立了专门的商业公司，在分布式机器学习上颇有两把刷子，其单机版的GraphChi在百万维度的矩阵分解都只须要2~3分钟；
8. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing：其实就是 Spark，目前这两年最流行的内存计算模式，经过RDD和lineage大大简化了分布式计算框架，一般几行scala代码就能够搞定原来上千行MapReduce代码才能搞定的问题，大有取代MapReduce的趋势；
9. Scaling Distributed Machine Learning with the Parameter Server：百度少帅李沐大做，目前大规模分布式学习各家公司主要都是使用ps，ps具有良好的可扩展性，使得大数据时代的大规模分布式学习成为可能，包括Google的深度学习模型也是经过ps训练实现，是目前最流行的分布式学习框架，豆瓣的开源系统paracell也是ps的一个实现；
10. Dremel: Interactive Analysis of Web-Scale Datasets：Google的大规模（近）实时数据分析系统，号称能够在3秒相应1PB数据的分析请求，内部使用到了查询树来优化分析速度，其开源实现为Drill，在工业界对实时数据分析也是比价有影响力；
11. Pregel: a system for large-scale graph processing: Google的大规模图计算系统，至关长一段时间是Google PageRank的主要计算系统，对开源的影响也很大（包括GraphLab和GraphChi）；
12. Spanner: Google's Globally-Distributed Database：这是第一个全球意义上的分布式数据库，Google的出品。其中介绍了不少一致性方面的设计考虑，简单起见，还采用了GPS和原子钟确保时间最大偏差在20ns之内，保证了事务的时间序，一样在分布式系统方面具备很强的借鉴意义；
13. Dynamo: Amazon’s Highly Available Key-value Store：Amazon的分布式NoSQL数据库，意义至关于BigTable对于Google，于BigTable不一样的是，Dynamo保证CAP中的AP，C经过vector clock作弱保证，对应的开源系统为Cassandra；
14. S4: Distributed Stream Computing Platform：Yahoo出品的流式计算系统，目前最流行的两大流式计算系统之一（另外一个是storm），Yahoo的主要广告计算平台；
15. Storm @Twitter：这个系统很少说，开启了流式计算的新纪元，几乎是全部公司流式计算的首选，绝对值得关注；
最近一两年时间主要精力放到了机器学习上，分布式系统的研究不太多了，现阶段就列这15篇文章吧，覆盖了分布式系统的主要领域。若是想起来有遗漏再来补充。Good luck！html

----------------------------------------------分割线-------------------------------------------------------
评论里边和提到的两篇论文也挺不错的，一并补充在这里。
1. Large-scale cluster management at Google with Borg；
2. F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business；

编辑于 2015-05-08

1. 背景知识
体系结构

系统和网络
通讯：RPC、RMI、MOM。。

进程和线程：
用户态、内核态；轻量级进程；协程；Actor。。

分布式相关问题
同步和互斥：保证相互冲突的并发进程能够共享资源
Double checked Locking、Immutable Value、Future 。。

事件分离和分发：Reactor、Proactor。。。

选举：从进程集中选出一个进程执行特别的任务

2. 分布式理论
数据结构
B树
log merge tree
merkle tree
一致性hash
DHT
vector clock
lock-free data structure
....

CAP、BASE
CAP: Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web
BASE an Acid Alternative

状态、时序

Time Clocks and the Ordering of Events in a Distributed Systemreact

Virtual Time and Global States of Distributed Systems ios

Distributed Snapshots: Determining Global States of a Distributed Systemgit

2PC、3PC 、Paxos ...
A brief history of Consensus- 2PC and Transaction Commit
Paxos Made Simple.
Paxos Made Practical
Paxos made live . An engineering perspective.

一致性、事务
Life beyond Distributed Transactions: an Apostate’s Opinion
Impossibility of distributed consensus with one faulty process.
Consensus on Transaction Commit.
Uniform consensus is harder than consensus

3. 分布式系统
分布式基础设施
消息队列
RabbitMQ 、ZeroMQ...

分布式锁服务、协调
The Chubby lock service for loosely-coupled distributed systems
Zookeeper

集群Monitoring

The ganglia distributed monitoring system:design, implementation, and 程序员

experiencegithub

Chukwa: A large-scale monitoring systemweb

分布式存储系统
分布式文件系统
The Google file system.
Lustre
Cepth
Panasas

分布式块存储
Sheepdob
Parallax
Petal

分布式k-v存储系统
Dynamo: Amazon’s highly available key-value store

分布式表格系统
Amazon DynamoDB
Bigtable: A Distributed Storage System for Structured Data.

分布式数据库
Spanner: Google's Globally-Distributed Database

分布式计算
Map-Reduce
MapReduce: Simplified Data Processing on Large Clusters

内存计算
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

流式计算
S4: Distributed Stream Computing Platform
Twitter Storm

图计算
GraphLab: A New Framework for Parallel Machine Learning
Pregel: a system for large-scale graph processing

4. 分布式应用
图片、视频等
Finding a Needle in Haystack: Facebook's Photo Storage

搜索
Web search for a planet: The Google cluster architecture

IM

编辑于 2015-05-02

分布式系统是一个很大的领域，里面包含不少方向。
既然你都要读paper了，应该也有必定基础了。

伊利诺伊大学的Advanced Distributed Systems 里把各个方向重要papers（updated Spring 2015）列举出来，能够参考一下（我只列举main papers，optional本身能够去看）
https://courses.engr.illinois.edu/cs525/sched.htm

Before, There Were Clouds

Historical reflections: The rise, fall, and resurrection of software as a service, M. Campbell-Kelly, CACM, May 2009.
Above the clouds (see the latest version of the paper on the site), M. Armbrust et al, Berkeley RADLAB, 2009.

•Larry Ellison's Rant on Cloud Computing (Youtube video)

You can join the Googlegroups on Cloud Computing算法

Cloud Computing Continued

MapReduce: Simplified Data Processing on Large Clusters, J. Dean et al, OSDI 2004 (Google)
Grid: a new infrastructure for 21st century science, I. Foster, Physics Today, 2002 (Argonne)

P2P Systemsspring

The Gnutella protocol specification v 0.4sql

P2P Systems (contd.)

Chord: a scalable peer-to-peer lookup service for Internet applications, I. Stoica et al, SIGCOMM 2001
Pastry: scalable, distributed object location and routing for large-scale peer-to-peer systems, A. Rowstron et al, Middleware 2001.
Kelips, I. Gupta et al, IPTPS 2003

Key-value Stores and NoSQL

Others: MongoDB

Basic Distributed Algorithms Fundamentals and Sensor Networks

Time, clocks and the ordering of events in a distributed system, L. Lamport, Communications ACM 1978
Distributed snapshots: determining global states of distributed systems, Chandy and Lamport, ACM TOCS 1985
Impossibility of distributed consensus with one faulty process, Fischer, Lynch and Patterson, Journal ACM 1985

Paxos and CommitingPlease don't review the first paper

(Indy will briefly present this paper) Paxos Made Simple, L. Lamport. Indy's slides: [ppt] [pdf]

Paxos Quorum Leases: Fast Reads Without Sacrificing Writes, Iulian Moraru, David G. Andersen, Michael Kaminsky, SoCC 2014
Low-latency multi-datacenter databases using replicated commit, H. Mahmoud et al, VLDB 2013.

Cloud Programming

Hive - a warehousing solution over a map-reduce framework A. Thusoo et al, VLDB 2009
Storm (use the wiki or other web resources)

Stream Processing

Adaptive Stream Processing using Dynamic Batch Sizing, Tathagata Das, Yuan Zhong, Ion Stoica, Scott Shenker, SoCC 2014
Stream: The Stanford data stream management system,A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, J. Widom, Technical Report, Stanford University, 2004.

Somewhat Consistent

GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks, Jiaqing Du, Calin Iorgulescu, Amitabha Roy, Willy Zwaenepoel, SoCC 2014
A Self-Configurable Geo-Replicated Cloud Storage System, Masoud Saeida Ardekani, and Douglas B. Terry,OSDI 2014

Litmus Tests

Salt: Combining ACID and BASE in a Distributed Database, Chao Xie, Chunzhi Su, Manos Kapritsos, Yang Wang, Navid Yaghmazadeh, Lorenzo Alvisi, and Prince Mahajan, OSDI 2014
Extracting More Concurrency from Distributed Transactions, Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, Jinyang Li, OSDI 2014

Adaptivity

Starﬁsh: a self-tuning system for big data analytics, H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, S. Babu, CIDR 2011.
Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process, Liuhua Chen, Haiying Shen, Karan Sapra, SoCC 2014

Blowing Hot and Cold: Storage

f4: Facebook’s Warm BLOB Storage System,
Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, Sanjeev Kumar, OSDI 2014
Pelican: A Building Block for Exascale Cold Data Storage, Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, and Sergey Legtchenko, Aaron Ogus, Eric Peterson and Antony Rowstron, OSDI 2014.

Reliability

Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems, T. Do et al, SOCC 2013
Heading Off Correlated Failures through Independence-as-a-Service, Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, Bryan Ford, OSDI 2014.

A Touch of Sensor Nets

Directed diffusion: A scalable and robust communication paradigm for sensor networks, C. Intanagonwiwat et al, Mobicom 2000
A review of current routing protocols for ad hoc mobile wireless networks, E.M. Royer et al, IEEE Personal Communications 1999

Graph Processing

LFGraph: Simple and Fast Distributed Graph Analytics, I. Hoque, I. Gupta, TRIOS 2013
GraphX: Graph Processing in a Distributed Dataflow Framework, Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica, OSDI 2014.

Latency is King

Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency, Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, Steven D. Gribble, SoCC 2014
PriorityMeister: Tail Latency QoS for Shared Networked Storage, Timothy Zhu, Alexey Tumanov, Michael A. Kozuch, Mor Harchol-Balter, Gregory R. Ganger, SoCC 2014

There's a P2P App for That

Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility, A. Rowstron et al, SOSP 2001
Ivy: A Read/Write Peer-to-Peer File System, Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen, OSDI 2002

Process it In-network

TAG: A Tiny Aggregation service for ad-hoc sensor networks, S. Madden, et al, OSDI 2002
Synopsis diffusion for robust aggregation in sensor networks, S. Nath et al, ACM TOSN, 2008.

How does it Really Behave?

Measurement, modeling, and analysis of a peer-to-peer file-sharing workload
Krishna P. Gummadi et al, SOSP 2003
Understanding availability, R. Bhagwan et al, IPTPS 2003
Measurement and Modeling of a Large-scale Overlay for Multimedia Streaming, L. Vu, I. Gupta, J. Liang, K. Nahrstedt, QShine 2007
An Evaluation of Amazon's Grid Computing Services: EC2, S3 and SQS, Simson Garfinkel, Harvard TechRep., 2007
What do Real-Life Hadoop Workloads Look Like, Cloudera Blog

Low Fees Required - Probabilistic Membership

A gossip-based failure detection service, R. van Renesse et al, Middleware 1998
SWIM: Scalable Weakly-consistent Infection-style process group Membership protocol, A. Das et al, DSN 2002
On scalable and efficient distributed failure detectors, I. Gupta et al, PODC 2001

Cluster Scheduling

The Power of Choice in Data-Aware Cluster Scheduling, Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J. Franklin, Ion Stoica, OSDI 2014.
Reservation-based scheduling: if you're late don't blame us! Carlo Curino, Djellel Difallah, Chris Douglas, Subru Krishnan, Raghu Ramakrishnan, Sriram Rao, SoCC 2014

Distributed Machine Learning

Project Adam: Building an Efficient and Scalable Deep Learning Training System, Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman, OSDI 2014
Scaling Distributed Machine Learning with the Parameter Server, Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, Bor-Yiing Su, OSDI 2014

Now Emerging

Apache Hadoop YARN: Yet Another Resource Negotiator, V. K. Vavilapalli, A. C Murthy et al, SOCC 2013
C-Hint: An Effective and Reliable Cache Management for RDMA-Accelerated Key-Value Stores, Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan, SoCC 2014

So Much Data!

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, Anang D. Satria, SoCC 2014
Heterogeneity and dynamicity of clouds at scale: Google trace analysis, C. Reiss et al, SoCC 2012

Spreading the Rumor

Bimodal multicast, K Birman et al, ACM TOCS 1999
Epidemic algorithms for replicated database maintenance, A. Demers et al, PODC 1987.

How do Networks Look?

Exploring complex networks, Steven Strogatz, Nature 2001
Scaling properties of the Internet graph, A. Akella et al, PODC 2003
Mapping the Gnutella network, M. Ripeanu et al, IEEE Computing Journal 2002

编辑于 2015-11-04

经典的意思是通过时间验证的。
排名第一的

的回答列举了他本身选择的最近15年经典，并且其中不少都是10年之后的文章。不能否认，这些是目前分布式比较热门的话题，但我以为其中能称得上经典的只有一小部分（1，2，3，12，13）。其余文章不能说写的很差，但我的认为离经典还差一些。

读经典是为了掌握这个领域最基本的思想，知其然，更要知其因此然。好比chubby，读实现以前，难道不更应该看看paxos算法自己是什么？

其实美国比较好的大学的研究生分布式系统课应该都会有reading list，这些差很少就是经典了。
好比cmu的：15-712 Syllabus。若是你要选30篇，70年代至今分布式最经典的文章，大概就是这些了。你会看到上面好多文章是很老的。为何还要看？由于想法被继承了，这些文章能够帮你了解因此然。固然上面有些文章其余同窗也提到了（好比leslie lamport的paxos等）。

对db感兴趣的，能够看看这个：Reading List // 15-799 :: Advanced Topics in Database Systems (Fall 2013) reynold xin 维护了这个rxin/db-readings · GitHub

发布于 2015-08-08

我以为分布式系统这一块其实没有一个很是清晰的知识图谱，更多的是人们遇到了不一样的问题，给出了不一样的解决方案。因此要说很是经典和基础的文章很难。凑巧的是这学期上了一门咱们系陈康老师分布式系统导论的课，每节课讲一两篇论文，颇有意思，收获很大。因此在这里分享一下课程中涉及到的论文，未必切合题主要求，仅供参考。
1. GFS。google三驾马车之一，分布式文件系统。毋庸置疑，这应该是分布式系统领域最经典的文章，几乎全部分布式、存储和大数据相关的topic都要提到它。
2. BigTable。google三驾马车之一，经典的分布式key/value store。个人理解这类应用为一个简化版的数据库。在实现上相似于操做系统的多级页表。
3. Dynamo。Dynamo是Amazon开发的一套分布式key/value store，可是从设计到属性都和Bigtable相差很远。里面首次提出著名的DHT（分布式哈希表），能够在系统增减节点时迁移代价更低。
3. MapReduce。分布式计算框架。也是google三驾马车之一。把全部的分布式操做抽象成Map和Reduce两类，使得编程很是简单。只须要实现这两个接口就好了。这应该是最先地最有影响力的提出了分布式计算框架，把程序员从裸写mpi程序中解放出来。
4. Spark。分布式计算框架。如今也是大数据时代的宠儿，应该和MapReduce是应用的最广的两个计算框架了。MapReduce每一轮迭代都是在硬盘上，Spark是在内存中，因此速度可能快上两个数量级。
5. Dryad。是微软出的一个分布式计算框架，提出的时间很早，惋惜影响力不如前二者。它提供的接口是把分布式计算流程抽象成一个有向无环图，程序员实现每一个节点的计算和边的数据传输便可。比MapReduce复杂，可是也更灵活。
6. Raft。Raft是14年提出的一个一致性协议。用来取代Paxos，由于后者实在太复杂，太难以理解。（lamport表示大家都是渣渣）。分布式系统一个经典的模型就是副本状态机，Raft就是用来维护这个副本状态机的一致性的。
MIT 6.824的课程实验：6.824 Home Page: Spring 2016，基本就是以raft为基础进行展开的。7. Time Vector Clock。分布式系统里面很难找到一个全局的时间，由于各个机器的时间是不一致的。因此lamport他们就提出了一个向量时钟的概念，来表示分布式系统里面各个事件的相对顺序。8. Distributed Snapshot。分布式系统快照。这也是很是经典的一个分布式问题，由于分布式系统作快照的时候，各个机器不一样步，加上有些信息在网络上飞，因此如何获得一个正确的快照是一个很难的问题。这篇文章提出一个能够理论证实是正确的解。9. Concurrency Control & Transaction。严格来讲这不是论文，是微软出的一本书，concurrency control and recovery in database systems。可是引用已经破了5000。从理论上介绍了什么是事务（transaction），以及如何保证事务的可顺序化和可恢复性。10. 2 phase lock。这也是上面那本书中的内容，2pl是一个协议，遵循该协议能够确保事务的可顺序化，不会出现多个事务同时操做致使结果不正确的现象。11. OCC，乐观控制协议。如何不上锁，又能实现多个事务同时处理的正确性。也是数据库领域的经典文章。12. 2 phase commit，两阶段提交协议。这是分布式事务环境下，如何确保多个机器上事务同时提交或者失败的一个协议。13. Byzantine容错。Raft和Paxso的环境是全部的机器都是按照正确的逻辑运行，只是有可能失效；Byzantine算法的环境是有些机器可能被劫持，故意扰乱正常的操做。Byzantine算法是解决这种环境下的一致性协议问题。14. Memory Coherence in Shared Virtual Memory Systems。这个应该归到分布式一致性领域的问题。只不过应用场景在于分布式共享内存。提供一个统一的接口，使得全部的机器看到的是同一个内存空间，而实际上有一个虚拟内存到物理内存的映射。须要重点考虑的是各个机器一致性的问题。这里用到的是顺序一致性15. Lazy release consistency for software distributed shared memory。和上面的问题同样，都是分布式共享内存，只不过使用了释放一致性。16. Bayou，是一个手机订会议室的系统。可是以这个系统为例，实现了分布式系统里很是重要的一个概念，最终一致性。咱们如今生活中碰到的一些现象，好比微信不一样的人看到的聊天记录顺序不同，颇有可能就是由于最终一致性。

分布式系统领域有哪些经典论文

0 个回答