2014 MapReduce

时间 2019-11-18

原文原文链接

function map(String name, String document):
  // name: document name
  // document: document contents
  for each word w in document:
    emit (w, 1)

function reduce(String word, Iterator partialCounts):
  // word: a word
  // partialCounts: a list of aggregated partial counts
  sum = 0
  for each pc in partialCounts:
    sum += pc
  emit (word, sum)

The prototypical MapReduce example counts the appearance of each word in a set of documents:^[14]php

Here, each document is split into words, and each word is counted by the map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to reduce. Thus, this function just needs to sum all of its input values to find the total appearances of that word.编程

 SELECT age, AVG(contacts)
    FROM social.person
GROUP BY age
ORDER BY age

function Map is
    input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records
    for each social.person record in the K1 batch do
        let Y be the person's age
        let N be the number of contacts the person has
        produce one output record (Y,(N,1))
    repeat
end function

function Reduce is
    input: age (in years) Y
    for each input record (Y,(N,C)) do
        Accumulate in S the sum of N*C
        Accumulate in Cnew the sum of C
    repeat
    let A be S/Cnew
    produce one output record (Y,(A,Cnew))
end function

-- map output #1: age, quantity of contacts
10, 9
10, 9
10, 9

-- map output #2: age, quantity of contacts
10, 9
10, 9

-- map output #3: age, quantity of contacts
10, 10

-- reduce step #1: age, average of contacts
10, 9

(9*3+9*2+10*1)/(3+2+1)服务器

(9*5+10*1)/(5+1)网络

imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age架构

Dataflow

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:并发

an input reader
a Map function
a partition function
a compare function
a Reduce function
an output writer

wapp

https://zh.wikipedia.org/wiki/MapReduceless

^ "咱们的灵感来自lisp和其余函数式编程语言中的古老的映射和概括操做." -"MapReduce：大规模集群上的简单数据处理方式"编程语言

MapReduce是Google提出的一个软件架构，用于大规模数据集（大于1TB）的并行运算。概念“Map（映射）”和“Reduce（概括）”，及他们的主要思想，都是从函数式编程语言借来的，还有从矢量编程语言借来的特性。^[1]函数式编程

当前的软件实现是指定一个Map（映射）函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduce（概括）函数，用来保证全部映射的键值对中的每个共享相同的键组。

映射和概括

简单来說，一个映射函数就是对一些独立元素组成的概念上的列表（例如，一个测试成绩的列表）的每个元素进行指定的操做（好比，有人发现全部学生的成绩都被高估了一分，他能够定义一个“减一”的映射函数，用来修正这个错误。）。事实上，每一个元素都是被独立操做的，而原始列表没有被更改，由于这里建立了一个新的列表来保存新的答案。这就是说，Map操做是能够高度并行的，这对高性能要求的应用以及并行计算领域的需求很是有用。

而概括操做指的是对一个列表的元素进行适当的合并（继续看前面的例子，若是有人想知道班级的平均分该怎么作？他能够定义一个概括函数，经过让列表中的奇數（odd）或偶數（even）元素跟本身的相邻的元素相加的方式把列表减半，如此递归运算直到列表只剩下一个元素，而后用这个元素除以人数，就获得了平均分）。虽然他不如映射函数那么并行，可是由于概括老是有一个简单的答案，大规模的运算相对独立，因此概括函数在高度并行环境下也颇有用。

分布和可靠性

MapReduce经过把对数据集的大规模操做分发给网络上的每一个节点实现可靠性；每一个节点会周期性的把完成的工做和状态的更新报告回来。若是一个节点保持沉默超过一个预设的时间间隔，主节点（类同Google檔案系統中的主服务器）记录下这个节点状态为死亡，并把分配给这个节点的数据发到别的节点。每一个操做使用命名文件的不可分割操做以确保不会发生并行线程间的冲突；当文件被更名的时候，系统可能会把他们复制到任务名之外的另外一个名字上去。（避免反作用）。

概括操做工做方式很相似，可是因为概括操做在并行能力较差，主节点会尽可能把概括操做调度在一个节点上，或者离须要操做的数据尽量近的节点上了；这个特性能够知足Google的需求，由于他们有足够的带宽，他们的内部网络没有那么多的机器。

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.^[1]^[2]

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The model is a specialization of the split-apply-combine strategy for data analysis.^[3] It is inspired by the map and reduce functions commonly used in functional programming,^[4] although their purpose in the MapReduce framework is not the same as in their original forms.^[5] The key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 Message Passing Interface standard's^[6] reduce^[7] and scatter^[8] operations), but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine. As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations.^[9] The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.^[10]

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. By 2014, Google was no longer using MapReduce as their primary Big Data processing model,^[11] and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.^[12]