Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network based, all the complications of network programming kick in, thus making distributed filesystems more complex than regular disk filesystems.
管理网络中跨多台计算机存储的文件系统被称之为分布式文件系统。node
由于是以网络为基础,也就引入了网络编程的复杂性,所以使得分布式文件系统比普通的磁盘文件系统更加复杂。apache
1) The Design of HDFS
a) HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
HDFS被设计成以流数据訪问模式来进行存储超大型文件。执行在商业硬件集群上。
b) HDFS is built around the idea that the most efficient data processing pattern is a writeonce,read-many-times pattern. A dataset is typically generated or copied from source,and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
HDFS创建在一次写入,屡次读取这样一个最高效的数据处理模式的理念之上。数据集一般有数据源生成或者从数据源复制而来,接着在此数据集上进行长时间的数据分析操做。每一次分析都会涉及到一大部分数据,甚至整个数据集,所以读取整个数据集的时间比读取第一条记录的延迟更重要。编程
c) Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors)for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
hadoop不需要昂贵的、高可靠性的硬件,hadoop执行在商业硬件集群上(普通硬件可以从各类供应商来得到),所以整个集群节点发生问题的机会是很是高的。至少是对于大集群而言。
2) HDFS Concepts
a) A disk has a block size, which is the minimum amount of data that it can read or write.Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes.
每一个磁盘都有一个默认的数据块大小,这是磁盘可以读写的最小数据量。单个磁盘文件管理系统构建于处理磁盘块数据之上,它的大小是磁盘块大小的整数倍。磁盘文件系统的块大小通常是几KB,然而磁盘块大小通常是512字节。缓存
b) Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
跟单个磁盘文件系统不一样的是,HDFS中比磁盘块小的文件不会沾满整个块的潜在存储空间。markdown
c) HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.
HDFS的块比磁盘块大,这样会最小化搜索代价。假设块足够大,那么从磁盘数据传输的时间就会比搜寻块開始位置的时间长得多,所以,传输由多个块组成的大文件取决于磁盘传输速率
d) Having a block abstraction for a distributed filesystem brings several benefits. The first benefit is the most obvious: a file can be larger than any single disk in the network. Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem. Furthermore, blocks fit well with replication for providing fault tolerance and availability.
对分布式文件系统的块进行抽象可以带来几个优势 。网络
首先一个显而易见的是。一个文件的大小可以大于网络中不论什么一个磁盘的容量。app
其次,用抽象块而不是整个文件可以使得存储子系统获得简化。最后,抽象块很是适合于备份,这样可以提升容错性和有用性。
e) An HDFS cluster has two types of nodes operating in a master−worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace.
在主机-从机模式下工做的HDFS集群有两种类型的节点可以操做:一个namenode(主机上)和若干个datanode(从机上)。namenode管理整个文件系统的命名空间。
f) Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
datanode是文件系统的直接工做点。它们存储和检索数据块(受client或者namenode通知),并且周期性的向namenode报告它们所存储的块列表信息。
g) For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata.
基于这个缘由。确保namenode对故障的弹性机制很是重要,为此,hadoop提供了两种机制。第一种机制是备份那些由文件系统元数据持久状态组成的文件。
h) It is also possible to run a secondary namenode, which despite its name does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
还有一种可能的方法是执行一个辅助的namenode,虽然它不会被用做namenode。分布式
它的主要角色是经过可编辑的日志周期性的融合命名空间镜像,以防止可编辑日志过大。ide
i) However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary.
而后,辅助namenode中的状态老是滞后于主节点,所以,在主节点的整个故障事件中。数据丢失差点儿是确定的。在这样的状况下。一般的作法是把存储在NFS上的元数据文件复制到辅助namenode中,并且做为一个新的主节点namenode来执行。
j) Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.
一般一个节点会从磁盘中读取块数据,但是对已频繁訪问的文件,其块数据可能会缓存在节点的内存中。一个非堆形式的块缓存。
k) HDFS federation,introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace.
HDFS中的federation,是在2.X系列公布中引进的,它赞成一个集群经过添加namenode节点来扩展。每一个namenode节点管理文件系统命名空间中的一部分。
l) Hadoop 2 remedied this situation by adding support for HDFS high availability (HA). In this implementation, there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption. A few architectural changes are needed to allow this to happen:
hadoop 2 经过添加对HA的支持纠正了这样的状况。在这样的实现方式中,将会有2个namenode实现双机热备份。oop
在发生主活动节点故障的时候,备份主节点就可以在不发生明显的中断的状况下接管继续响应client请求的责任。下面这些结构性的变化是赞成发生的:
m) The namenodes must use highly available shared storage to share the edit log. Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk. Clients must be configured to handle namenode failover, using a mechanism that is transparent to users. The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace.
namenode必须使用高有用性的共享存储来实现可编辑日志的共享,由于块映射信息是存储在namenode的内存中,而不是磁盘上,因此datanode必须发送块信息报告至双机热备份的namenode,client必须进行故障切换的操做配置。这个可以经过一个对用户透明的机制来实现。辅助节点的角色经过备份被包括进来,其含有活动主节点命名空间周期性检查点信息。
n) There are two choices for the highly available shared storage: an NFS filer, or a quorum journal manager (QJM). The QJM is a dedicated HDFS implementation, designed for the sole purpose of providing a highly available edit log, and is the recommended choice for most HDFS installations.
对于高有用性共享存储有两种选择:NFS文件。QJM(quorum journal manager)。QJM专一于HDFS的实现。其惟一目的就是提供一个高有用性的可编辑日志。也是大可能是HDFS安装时所推荐的。
o) If the active namenode fails, the standby can take over very quickly (in a few tens of seconds) because it has the latest state available in memory: both the latest edit log entries and an up-to-date block mapping.
假设活动namenode发生问题 。备份节点会迅速接管任务(在数秒内)。由于在内存中备份节点有最新的可用状态,包括最新的可编辑日志记录和块映射信息。
p) The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller. There are various failover controllers, but the default implementation uses ZooKeeper to ensure that only one namenode is active.Each namenode runs a lightweight failover controller process whose job it is to monitor its namenode for failures and trigger a failover should a namenode fail.
从活动主节点到备份节点的故障切换是由系统中一个新的实体——故障切换控制器来管理的。虽然有多种版本号的故障切换控制器。但是hadoop默认的是ZooKeeper,它也可确保惟独一个namenode是处于活动状态。每一个namenode节点上都执行一个轻量级的故障切换控制器进程。它的任务就是去监控namenode的故障。一旦namenode发生问题,它就会触发故障切换。
q) The HA implementation goes to great lengths to ensure that the previously active namenode is prevented from doing any damage and causing corruption — a method known as fencing.
HA的实现会不遗余力的去确保以前的活动主节点不会作出不论什么致使故障的有害举动。这种方法就是fencing。
r) The QJM only allows one namenode to write to the edit log at one time; however, it is still possible for the previously active namenode to serve stale read requests to clients, so setting up an SSH fencing command that will kill the namenode’s process is a good idea.
QJM只赞成一个namenode在同一时刻进行可编辑日志的写入操做。然而,对已以前的活动节点来讲,响应来自client的陈旧的读取请求服务是可能的。所以创建一个可以杀死namenode进程的fencing命令式一个好方法。
3) The Command-Line Interface
a) You can type hadoop fs -help to get detailed help on every command.
你可以在每一个命令上使用hadoop fs –help来得到具体的帮助。
b) Let’s copy the file back to the local filesystem and check whether it’s the same:
% hadoop fs -copyToLocal quangle.txt quangle.copy.txt
% md5 input/docs/quangle.txt quangle.copy.txt
MD5 (input/docs/quangle.txt) = e7891a2627cf263a079fb0f18256ffb2
MD5 (quangle.copy.txt) = e7891a2627cf263a079fb0f18256ffb2
The MD5 digests are the same, showing that the file survived its trip to HDFS and is back intact.
让咱们来拷贝这个文件到本地,检查它们是不是同样的文件。
MD5是同样的,代表这个文件传输到了HDFS,并且原封不动的回来了。
4) Hadoop Filesystems
a) Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop, and there are several concrete implementations.
hadoop对于文件系统有一个抽象的概念,HDFS仅是当中一个实现。Java的抽象类org.apache.hadoop.fs.FileSystem定义了client到文件系统之间的接口,并且该抽象类还有几个具体的实现。
b) Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the Java API. hadoop是用Java编写的,所以大多数hadoop文件系统的交互经过Java API来进行的。 c) By exposing its filesystem interface as a Java API, Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages to interact with HDFS. Note that the HTTP interface is slower than the native Java client, so should be avoided for very large data transfers if possible. 把文件系统的接口做为一个Java API。会让非Java应用程序訪问HDFS时有些麻烦。经过WebHDFS协议实现的HTTP REST API可以很是easy的让其它语言与HDFS进行交互。注意,HTTP接口比本地Javaclient要慢,所以,有可能的话,应该避免使用该接口进行大文件数据传输。