Chapter 4: Repositories

Chapter 4: Repositories
第四章:配置库
This is part of an online book called Source Control HOWTO, a best practices guide on source control, version control, and configuration management.
这是一篇名为如何作源码控制的在线书籍的一部分,一本关于源码控制、版本控制、配置管理的最佳实践手册。

Cars and clocks
汽车和钟
In previous chapters I have mentioned the concept of a repository, but I haven't said much further about it.  In this chapter, I want to provide a lot more detail.  Please bear with me as I spend a little time talking about how an SCM tool works "under the hood".  I am doing this because an SCM tool is more like a car than a clock. 
  • An SCM tool is not like a clock.  Clock users have no need to know how a clock works inside.  We just want to know what time it is.  Those who understand the inner workings of a clock cannot tell time any more skillfully than the rest of us.
  • An SCM tool is more like a car.  Lots of people do use cars without knowing how they work.  However, people who really understand cars tend to get better performance out of them.
在以前的章节里面,我提到过库的概念,可是我没有过多的谈及。在本章,我想作更多的描述。请容忍我花点时间谈谈关于配置管理工具如何“在引擎盖”下工做。我解释这个是由于一个配置管理工具同钟比起来更像汽车。
l         一个配置管理工具不像钟。钟的使用者不须要知道一个钟的内部是如何工做的。咱们只须要知道时间。那些知道钟内部如何工做的人并不能比咱们这些不知道的人可以更准确地报时。
l         一个配置管理工具更像汽车。许多开车的人都不知道它们是怎么工做的,可是,真正知道汽车的人们更注意从汽车身上得到更好的性能。
Rest assured, that this book is still a "HOWTO".  My goal here remains to create a practical explanation of how to do source control.  However, I believe that you can use an SCM tool more effectively if you know a little bit about what's happening inside.
放心,这本书依旧是说“如何作”。个人目标仍是建立一个实践来解释如何作配置控制。固然,我相信你若是知道一点工具内部的工做,你就可以更有效的使用配置工具。
Repository = File System * Time
配置库=文件系统*时间
A repository is the official place where you store all your source code.  It keeps track of all your files, as well as the layout of the directories in which they are stored.  It resides on a server where it can be shared by all the members of your team.
一个库就是你存储你的全部源代码的正式的地方。它保存了对你全部文件的追踪,并且像字典同样的有序存放。它存放在服务器上,共享给你的团队全部的人员。
But there has to be more.  If the definition in the previous paragraph were the whole story, then an SCM repository would be no more than a network file system.  A repository is much more than that.  A repository contains history.
可是那里确定还有更多的东西。若是前一段的定义是总体定义,那么配置库就仅仅是一个网络文件系统。但一个库显然不止这些,还包含了历史。
A file system is two-dimensional:  its space is defined by directories and files.  In contrast, a repository is three-dimensional:  it exists in a continuum defined by directories, files and time.  An SCM repository contains every version of your source code that has ever existed.  The additional dimension creates some rather interesting challenges in the architecture of a repository and the decisions about how it manages data.
一个文件系统是二维的:它的空间被定义为目录和文件。相对而言,一个库是三维的:它存在于一个对库、文件和时间的统一体里面。一个配置库包含了你的源代码已经存在的每一个版本。这个增长的维度为库的结构设计和数据管理增添了一些至关有趣的挑战。
How do we store all those old versions of everything?
咱们如何存储每一个文件的全部旧版本?
As a first guess, let's not be terribly clever.  We need to store every version of the source tree.  Why not just keep a complete copy of the entire tree for every change that has happened?
作第一假设,咱们不要过于聪明。咱们须要存储源代码树的每一个版本。那为何不能在发生每一个变动时恰好保留整棵树的一个彻底拷贝呢?
We obviously use Vault as the SCM tool for our own development of Vault.  We began development of Vault in the fall of 2001.  In the summer of 2002, we started " dogfooding ".  On October 25th, 2002 , we abandoned our repository history and started a fresh repository for the core components of Vault.  Since that day, this tree has been modified 4,686 times.
咱们显然用咱们本身开发的Vault作咱们的配置管理工具。咱们开始开发Vault是在2001年秋。在2002年夏天,咱们开始咱们的“ dogfooding (译者注:这是一个俚语,表示是一个自行测试的评估体系,是基于Beta或者发布版的候选软件).2002.10.25,咱们放弃了咱们的库历史,而后开始用一个全新的库来放Vault的关键组件。从那开始,这个树被修改过4686次。
This repository contains approximately 40 MB of source code.  If we chose to store the entire tree for every change, those 4,686 copies of the source tree would consume approximately 183 GB, without compression.  At today's prices for disk space, this option is worth considering.
这个库包含了大概 40M 的源代码。若是咱们选择保存这整棵树的每次变动,那这4686份源码树的拷贝不压缩的话就有大概 183G 。对于今天的硬盘价格来讲,这种方式却是值得考虑。
However, this particular repository is just not very large.  We have several others as well, but the sum total of all the code we have ever written still doesn't qualify as "large".  Many of our Vault customers have trees which are a lot bigger.
可是,这个特别的库并非很大。还不如咱们其余还有的几个大,但咱们全部写过的代码总和仍然不够“庞大”。许多咱们的Vault客户的版本的树要大些。
As an example, consider the source tree for OpenOffice.org.  This tree is approximately 634 MB.  Based on their claim of 270 developers and the fact that their repository is almost four years old, I'm going to conservatively estimate that they have made perhaps 20,000 checkins.  So, if we used the dumb approach of storing a full copy of their tree for every change, we'll need around 12 TB of disk space.  That's 12 terabytes.
举个例子,来考虑关于开放工做室组织的源码树。这棵树大概 634M 。基于他们宣称的270名开发人员和他们的库有4年的历史的事实。我保守的估计他们有2万次签入。那么,若是咱们在每次变动的时候用愚蠢的方式保留整个树的拷贝,那咱们须要大概12TB的硬盘空间。那个12个兆字节(译者注:1TB=1024GB)啊。
At this point, the argument that "disk space is cheap" starts to break down.  The disk space for 12 TB of data is cheaper than it has ever been in the history of the planet.  But this is mission critical data.  We have to consider things like performance and backups and RAID and administration.  The cost of storing 12 TB of ultra-important data is more than just the cost of the actual disk platters.
基于这点,“硬盘空间是便宜的”的观点就被颠覆了。12TB数据的硬盘空间比史上的行星要便宜点儿。可是这个是估计数据。咱们还要考虑了运行、备份和RAID(磁盘阵列)以及管理。因此存储12TB极为重要的数据所花费的比实际的大数据量硬盘还多。
So we actually do have an incentive to store this information a bit more efficiently.  Fortunately, there is an obvious reason why this is going to be easy to do.  We observe that tree N is often not terribly different from tree N-1.  By definition, each version of the tree is derived from its predecessor.  A checkin might be as simple as a one-line fix to a single file.  All of the other files are unchanged, so we don't really need to store another copy of them.
因此咱们实际上有动机来使信息存储有效率些。幸运的是,有一个很明显的缘由是为何这样作很容易。咱们发现,树N一般不是同树N-1差异特别大。定义中,每一个树的版本都是来自他的前一个版本。一个签入可能只是很简单的单线的修改一个文件。其余的文件并无变动过,那咱们就不用存储他们的拷贝。
So, we don't want to store the full contents of the tree for every single change.  Instead, we want a way to store a tree represented as a set of changes to another tree.  We call this a "delta".
那么,咱们也不用存储每次变动时树的全部注释。取而代之,咱们打算一种方式:存储一棵树,把一系列变动描绘成另外一棵树。咱们称之为“增量”。
Delta direction
增量方向
As we decide to store our repositories using deltas, we must be concerned about performance.  Retrieving a tree which is in a deltified representation requires more effort than retrieving one which is stored in full.  For example, let's suppose that version 1 of the tree is stored in full, but every subsequent revision is represented as a delta from its predecessor.  This means that in order to retrieve version 4,686, we must first retrieve version 1 and then apply 4,685 deltas. Obviously, this approach would mean that retrieving some versions will be faster than others. When using this approach we say that we are using "forward deltas", because each delta expresses the set of changes from one version to the next. 
当咱们决定用增量来存储咱们的库,咱们必须顾及到执行效率。得到一个增量定义的需求会比得到一个被存储的整个树有更多的成果。例如,咱们假设树的版本1被彻底存储,可是每一个后来的版本被从它的祖先开始以增量式表示。这意味着为了得到版本4686,咱们必须先取得版本1,而后应用4685个增量。显然,这个方式可能意味着取回一些版本会比其余的快。当使用这种方式的时候,咱们说咱们使用了“前向增量”,由于每一个增量表示了从一个版本的变动到下一个版本的变动。
We observe that not all versions of the tree are equally likely to be retrieved.  For example, version 83 of the Vault tree is not special in any way.  It is likely that we have not retrieved that version in over a year.  I suspect that we will never retrieve it again.  However, we retrieve the latest version of the tree many times per day.  In fact, as a broad generalization, we can say that at any given moment, the most recent version of the tree is probably the most likely one to be needed.
咱们发现不是这棵树的全部版本都恰好须要被取回。例如,Vault83版本不管如何都不是特殊的。好像咱们有超过一年没有取过那个版本。我假定咱们将永远不会再取它了,那么,咱们天天取这个树的最新版本不少次,实际上,做为一个普遍定义,咱们能够说随时,树的最好的最近版本可能恰好就是最须要的。
The simplistic use of forward deltas delivers its worst performance for the most common case.  Not good.
前向增量的过于简单的使用提交了一般状况下最坏的执行。很差。
Another idea is to use "reverse deltas".  In this approach, we store the most recent tree in full.  Every other tree N is represented as a set of differences from tree N+1.  This approach delivers its best performance for the most common case, but it can still take an awfully long time to retrieve older trees.
还有一个办法是使用“反向增量”。这种方式里面,咱们存储最近的这棵彻底树。每一个其余的树N都被描绘成一套不一样于N+1的树。这个方式提交了它对最普通的状况的最好的执行,可是它依然花掉很长的时间来取回旧的树。
Some SCM tools use some sort of a compromise design.  In one approach, instead of storing just one full tree and representing every other tree as a delta, we sprinkle a few more full trees along the way.  For example, suppose that we store a full tree for every 10th version.  This approach uses more disk space, but the SCM server never has to apply more than 9 deltas to retrieve any tree.
一些配置管理工具使用了一些折中的设计。一种方式是:取代恰好存储一棵完整的树并描述每棵其余的树为一个增量,沿着这种方式咱们散列分布了少数完整的树。例如,假设咱们每十个版本存储一棵完整的树。这个方式须要更多的磁盘空间,可是配置管理服务器不须要应用多于9个增量来得到任何树了。
What is a delta?
什么是增量?
I've been throwing around this concept of deltas, but I haven't stopped to describe them.
我已经抛出了增量这个概念,可是我没有停下来描述过它们。
A tree is a hierarchy of folders and files.  A delta is the difference between two trees.  In theory, those two trees do not need to be related.  However, in practice, the only reason we calculate the difference between them is because one of them is derived from the other.  Some developer started with tree N and made one or more changes, resulting in tree N+1.
一棵树就是一个目录和文件的层级结构。一个增量是两棵树之间的差异。理论上讲,这两棵树不须要相近。然而,事实上,咱们计算差异的惟一缘由是由于它们中的一个来源于另外一个。一些开发人员从树N开始制造变动,而后在树N+1计算结果。
We can think of the delta as a set of changes.  In fact, many SCM tools use the term "changeset" for exactly this purpose.  A changeset is merely a list of the changes which express the difference between two trees.
咱们能够认为增量就是一系列变化。事实上,不少配置管理工具使用了术语“changset(变动集合)”偏偏是为了这个目的。一个变动集合仅仅是变动的列表,列出了两棵树的差异。
For example, let's suppose that Wilbur starts with tree N and makes the following changes:
  1. He deletes $/top/subfolder/foo.c because it is no longer needed.
  2. He edits $/top/subfolder/Makefile to remove foo.c from the list of file names
  3. He edits $/top/bar.c to remove all the calls to the functions in foo.c
  4. He renames $/top/hello.c and gives it the new name hola.c
  5. He adds a new file called feature_creep.c to $/top/
  6. He edits $/top/Makefile to add feature_creep.c to the list of filenames
  7. He moves $/top/subfolder/readme.txt into $/top
例如,假设Wilbur从树N开始制造变动:
1.       他删除了$/top/subfolder/foo.c,由于这个文件不须要了
2.       他编辑$/top/subfolder/Makefile,删除文件列表中foo.c的名字
3.       他编辑$/top/bar.c,删除全部对foo.c中的功能的调用
4.       他重命名了$/top/hello.c,新的名字为hola.c
5.       他增长了一个名为feature_creep.c的新文件放到$/top/
6.       他编辑了$/top/Makefile来增长feature_creep.c到文件名列表
7.       他移动$/top/subfolder/readme.txt$/top
At this point, he commits all of these changes to the repository as a single transaction.  When the SCM server stores this delta, it must remember all of these changes.
这时,他提交了全部的变动到库里面,以一个单独的事务提交。当配置管理服务器存储这个增量的时候,它必须记住全部的变动。
For changeset item 1 above, the delete of foo.c is easily represented.  We simply remember that foo.c existed in tree N but does not exist in tree N+1.
对于变动集中的第1项,删除foo.c是很容易描述的,咱们简单的记住foo.c在树n中存在而不在树N+1存在。
For changeset item 4, the rename of hello.c is a bit more complex.  To handle renames, we need each object in the repository to have an identifier which never changes, even when the name or location of the item changes.
对于变动集中的第4项,重命名hello.c就要复杂些。为了处理重命名,咱们须要库中的对每一个象有一个是否变动的标示,甚至在文件名和位置变动的时候都有标示。
For changeset item 7, the move of readme.txt is another example of why repositories need IDs for each item.  If we simply remember every item by its path, we cannot remember the occasions when that path changes.
对于变动集中的第7项,移动readme.txt是另外一个为何库须要为每一个项分配ID的例子。若是咱们简单记住每一个项的路径,咱们就不能记住当路径变化时的情形。
Changeset item 5 is going to be a lot bulkier than some of the other items here.  For this item we need to remember that tree N+1 has a file called feature_creep.c which was never present in tree N.  However, a full representation of this changeset item needs to contain the entire contents of that file.
变动集中的第5项正变得比其余的项更大。对这个项,咱们须要记住树N+1有一个文件叫feature_creep.c, 历来没有在树N中出现过。而后,关于这个变动集合项的完整描述须要包含整个文件的内容。
Changeset items 2, 3 and 6 represent situations where a file which already existed has been modified in some way.  We could handle these items the same way as item 5, by storing the entire contents of the new version of the file.  However, we will be happier if we can do deltas at the file level just as we are doing deltas at the tree level.
变动集中的第236项,描述了一个已经存在并被用某种方式修改过的文件的状况。咱们可以用同第5项一样的方式来处理这几项,经过对文件的新版本的整个内容的存储。然而,咱们可以在文件层面作增量就像咱们在树的层面作增量的话,咱们会更高兴的。
File deltas
文件增量
A file delta merely expresses the difference between two files.  Once again, the reason we calculate a file delta is because we believe it will be smaller than the file itself, usually because one of the files is derived from the other.
一个文件的增量仅仅表达了两个文件的不一样。还有,咱们计算一个文件的增量是由于咱们相信它本身发生了一些小变化,一般由于一个文件来源于另外一个。
For text files, a well-known approach to the file delta problem is to compare line-by-line and output a list of lines which have been modified, inserted or changed.  This is the same kind of results which are produced by the Unix 'diff' command.  The bad news is that this approach only works for text files.  The good news is that software developers and web developers have a lot of text files.
对于文本文件,处理文件增量的著名的方式是一行一行的对比,而后输出被修改了的、插入的或变动了的行的列表。这同在UNIX环境下使用“diff”命令同样,生成一样类型的结果。很差的是这个方式只在文本格式有效。好的消息是软件或网络开发人员有不少文本文件。
CVS and Perforce use this approach for repository storage.  Text files are deltified using a line-oriented diff.  Binary files are not deltified at all, although Perforce does reduce the penalty somewhat by compressing them. 
CVS Perforce使用这种方式来存储库。文本文件被增量标示使用了一个线性导向的对比。二进制文件没有被完全增量标示,尽管Perforce经过压缩它们减小了点处罚。
Subversion and Vault are examples of tools which use binary file deltas for repository storage.  Vault uses a file delta algorithm called VCDiff, as described in RFC 3284.  This algorithm is byte-oriented, not line-oriented.  It outputs a list of byte ranges which have been changed.  This means it can handle any kind of file, binary or text.  As an ancillary benefit, the VCDiff algorithm compresses the data at the same time.
Subversion Vault是使用了二进制文件增量的存储库的工具实例。Vault使用一个叫VCDiff的文件增量运算法则,被在RFC 3284中进行了描述。这个运算法则是字节导向的,不是线性导向的。它输出了那些变动了的字节列表排序。这意味着它能够提交任何类型的文件,二进制或文本文件。做为一个辅助的益处,VCDiff运算法则同时压缩了数据。
Binary deltas are a critical feature for some SCM tool users, especially in situations where the binary files are large.  Consider the case where a user checks out a 10 MB file, changes a few bytes, and checks it back in.  In CVS, the size of the repository will increase by 10 MB.  In Subversion and Vault, the repository will only grow by a small amount.
二进制增量对配置管理工具用户是一个重要的特征,特别是当二进制文件很大的状况下。考虑到那种一个用户签出一个10兆的文件只变动几个字节就签入。在CVS里面,数据库会一样的增长十兆。在SubversionVault中,数据库会只增加一点点。
Deltas and diffs are different
增量和差异是不一样的
Please note that I make a distinction between the terms "delta" and "diff".
请注意,我在“增量”和“差异”之间使用了一个区别。
  • A "delta" is the difference between two versions.  If we have one full file and a delta, then we can construct the other full file.  A delta is used primarily because it is smaller than the full file, not because it is useful for a human being to read.  The purpose of a delta is efficiency.  When deltas are done at the level of bytes instead of textual lines, that efficiency becomes available to all kinds of files, not just text files.
  • 一个“增量”是两个版本之间的差别。若是咱们有一个完整的文件和一个增量,那么咱们可以构建另外一个完整的文件。一个增量被使用的首要缘由是它比整个的文件小,不是由于它是对人类阅读有益。增量的这个目的是有效的。当增量是在字节层面运做,取代了文本行级别,那效率就变得不只仅对二进制的文件而是全部类型有用了。
  • A "diff" is the human-readable difference between two versions of a text file.  It is usually line-oriented, but really cool visual diff tools can also highlight the specific characters on a line which differ.  The purpose of a diff is to show a developer exactly what has changed between two versions of a file.  Diffs are really useful for text files, because human beings tend to read text files.  Most human beings don't read binary files, and human-readable diffs of binary files are similarly uninteresting.
  • 差异是人类可读的两个版本之间的文本差别。它一般是线性的,可是真正很酷的视窗比较文具能够在一行上面高亮特殊的字段。差异的目的是显示一个开发人员恰好在两个版本之间变动了什么。差异是真正可用的文本文件,由于人们趋向于读文本文件。许多人不会读二进制文件,而人类可读的二进制文件的差异一样很无趣。
As mentioned above, some SCM tools use binary deltas for repository storage or to improve performance over slow network lines.  However, those tools also support textual diffs.  Deltas and diffs serve two distinct purposes, both of which are important.  It is merely coincidence that some SCM tools use textual diffs as their repository deltas.
如上面所提到,一些配置管理工具使用二进制增量来存储库或者提升低速网络的执行效率。然而,那些工具也支持文本的差异。增量和差异为两种不一样的目的服务,它们都很重要。这仅在一些配置管理工具直接使用文本的差异做为它们库的增量的时候一致。
The evolution of source control technology
源码控制技术的发展
At this point I should admit that I have presented a somewhat idealized view of the world.  Not all SCM tools work the way I have described.  In fact, I have presented things exactly backwards, discussing tree-wide deltas before file deltas.  That is not the way the history of the world unfolded.
在这点上,我要认可我提出过一个有点理想化的世界观。不是全部的配置管理工具都经过这种我描述过的方式进行工做。事实上,我也正确地向后描述过事情,在文件增量以前讨论过tree-wide增量。那不是这个世界展开过的历史之路。
Prehistoric ancestors of modern programmers had to live with extremely primitive tools.  Early version control systems like RCS only handled file deltas.  There was no way for the system to remember folder-level operations like add, renaming or deleting files.
现代编程的史前祖先曾经经过极其古老的工具生存,早点的版本控制系统,好比RCS,只是提交文件增量。这种系统没有其余的方式来记忆目录层级,好比增长、重命名或删除文件。
Over time, the design of SCM tools matured.  CVS is probably the most popular source control tool in the world today.  It was originally developed as a set of wrappers around RCS which essentially provided support for some folder-level operations.  Although CVS still has some important limitations, it was a big step forward.
时光流逝,配置管理工具的设计成熟了。CVS多是当今世界最流行的源码控制工具。它最开始是做为一套RCS的外壳来进行开发的,提供了支持目录层级的操做。尽管CVS仍然有一些重要的局限,可是配置管理工具向前发展了一大步。
Today, several modern source control systems are designed around the notion of tree-wide deltas.  By accurately remembering every possible operation which can happen to a repository, these tools provide a truly complete history of a project.
如今,一些流行的源码控制系统围绕tree-wide增量的概念来设计。经过精确的保留每一个对库可能产生的操做,这些工具提供了一个真正的项目的历史。
What can be stored in a repository?
什么能够被放到库里面?
Best Practice: Checkin all the canonical stuff, and nothing else
最佳实践:签入全部规范的素材,其余的所有不要
Although you can store anything you want in a repository, that doesn't mean you should. The best practice here is to store everything which is necessary to do a build, and nothing else. I call this "the canonical stuff".
尽管你能够在库里保存任何东西,可是那不意味着你就应该随便放。这里的最佳实践是:放入真正须要构建的东西,其余的都不要。我将这些称为“规范素材”。
To put this another way, I recommend that you do not store any file which is automatically generated. Checkin your hand-edited source code. Don't checkin EXEs and DLLs. If you use a code generation tool, checkin the input file, not the generated code file. If you generate your product documentation in several different formats, checkin the original format, the one that you manually edit.
为了经过另外的方式这样作,我建议你不要存储任何能够自动生成的文件。签入你手工编辑的源码。不要签入EXE文件和DLL文件。若是你使用一个代码生成工具,签入这个输入文件,不是生成的代码文件。若是你用几种不一样的格式生成你的产品文档,签入你手工编辑的原始格式。
If you have two files, one of which is automatically generated from the other, then you just don't need to checkin both of them. You would in effect be managing two expressions of the same thing. If one of them gets out of sync with the other, then you have a problem.
若是你有两个文件,一个是从另外一个文件自动生成的,那么你就不用签入两个文件。你能够有效的管理一样事情的两个表达方式。若是它们中的一个被取出来同另外一个同步,那你才会出一些问题。
People sometimes ask us what kind of things can be stored in a repository.  In general, the answer is: "Any file".  It is true that I am focusing on tools which are designed for software developers and web developers.  However, those tools don't really care what kind of file you store inside them.  Vault doesn't care.  Perforce, Subversion and CVS don't care.  Any of these tools will gratefully accept any file you want to store.
人们有的时候问咱们什么类型的东西能够放到库里面。一般答案都是:“任何文件”。这是真的,由于我集中精力在为软件和WEB开发人员设计工具上。然而,那些工具没有真正的关心哪一种文件能够放进库里。Vault也不关心。PerforceSubversionCVS都不关心。这些工具都积极的接受你要存储的文件。
If you will be storing a lot of binary files, it is helpful to know how your SCM tool handles them.  A tool which uses binary deltas in the repository may be a better choice.
若是你要存储不少二进制文件,这将对你了解配置管理工具如何提交他们有帮助。一个工具在配置库中使用了二进制增量多是一个更好的选择。
If all of your files are binary, you may want to explore other solutions.  Tools like Vault and Subversion were designed for programmers.  These products contain features designed specifically for use with source code, including diff and automerge.  You can use these systems to store all of your Excel spreadsheets, but they are probably not the best tool for the job.  Consider exploring "document management" systems instead.
若是你全部的文件都是二进制的,你打算用其余的方案来浏览。像VaultSubversion是为程序人员设计的工具。这些产品包含了特别的为源码设计的特性,包含了差别比较和自动合并。你可以使用这些系统来存储全部的你的Excel表格,可是他们可能不是最好的工具。你应该考虑使用“文件管理”系统。
How is the repository itself stored?
配置库本身是怎么存储的?
We need to descend through one more layer of abstraction before we turn our attention back to more practical matters.  So far I have been talking about how things are stored and managed within a repository, but I have not broached the subject of how the repository itself is stored.
在咱们将咱们的注意力回过来在更多的实际问题中,咱们须要下降更多提取的层次。目前为止,我谈过了文件在一个库里面是怎样被存储和管理的,可是我没有讨论配置库本身是怎么存储的。
A repository must store every version of every file.  It must remember the hierarchy of files and folders for every version of the tree.  It must remember metadata, information about every file and folder.  It must remember checkin comments, explanations provided by the developer for each checkin.  For large trees and trees with very many revisions, this can be a lot of data that needs to be managed efficiently and reliably.  There are several different ways of approaching the problem.
一个库必须存储任何文件的任何版本。它必须记住树中每一个版本的文件和目录的层级。它必须记住元数据,每一个文件和目录的信息。它必须记住签入的内容,开发人员每次签入的时候的注释。对于大的树和树的众多的版本,还须要有效可靠的管理大量的数据。有几种不一样的方式能够解决这个问题。
RCS kept one archive file for every file being managed.  If your file was called "foo.c" then the archive file was called "foo.c,v".  Usually these archive files were kept in a subdirectory of the working directory, just one level down.  RCS files were plain text, you could just look at them with any editor.  Inside the file you would find a bunch of metadata and a full copy of the latest version of the file, plus a series of line-oriented file deltas, one for each previous version.  (Please forgive me for speaking of RCS in the past tense.  Despite all the fond memories, that particular phase of my life is over.)
RCS 为每一个被管理的文件保留了一个档案文件。若是你的文件名是“foo.c”,那它的档案文件就是“foo.c,v”。一般这些档案文件被保存在工做目录的一个子目录中,就像一个下级目录同样。RCS文件是纯文本的,你能够用编辑器打开他们。在文件里面你能够看到一串元数据和文件最近版本的所有拷贝,加上一系列线性的针对以前每一个版本的文件增量。(请原谅我在过去的句子里谈到RCS。不管多么美好的记忆,都是我生命中已通过去的片段了。)
CVS uses a similar design, albeit with a lot more capabilities.  A CVS repository is distinct, completely separate from the working directory, but it still uses ",v" files just like RCS.  The directory structure of a CVS repository contains some additional metadata.
CVS 使用了一个相似的设计,虽然具备了更多的能力。一个CVS库是明显的、完全的同工做目录分离的,可是它仍然像RCS那样使用“,V”文件。CVS的目录结构包含了一些额外的元数据。
When managing larger and larger source trees, it becomes clear that the storage challenges of a repository are exactly the same as the storage challenges of a database.  For this reason, many SCM tools use an actual database as the backend data store.  Subversion uses Berkeley DB.  Vault uses SQL Server 2000.  The benefit of this approach is enormous, especially for SCM tools which support atomic transactions.  Microsoft has invested lots of time and money to ensure that SQL Server is a safe place to store important information.  Data corruption simply doesn't happen.  All of the ultra-tricky details of transactions are handled by the underlying database.
当管理愈来愈大的源码树的时候,事情变得愈来愈清晰:一个配置库存储的挑战一样是数据库存储的挑战。由于这个缘由,许多配置管理工具使用一个真正的数据库来存储数据。Subversion使用BerkeleyDBVault使用SQLSERVER2000。使用这种方式的好处是很巨大的,特别是对于那些支持原子事务的工具。微软已经投入不少时间和钱来保证SQLSERVER是一个存储重要信息的安全地方。数据崩溃一般不容易发生。全部关于事务是如何的提交的至关机警的讨论就在商用数据库中。
Perforce uses somewhat of a hybrid approach, storing all of the metadata in a database but keeping all of the actual file contents in RCS files.  This approach trades some safety for speed.  Since Perforce manages its own archive files, it has to take responsibility for all the strange things that threaten to corrupt them.  On the other hand, writing a file is a bit faster than writing a blob into a SQL database.  Perforce has the reputation of being one of the fastest SCM tools.
Perforce 使用比较混杂的方式,在数据库中存储全部的元数据,可是在RCS中保持全部的真实文件的内容。这种方式带来一个速度的安全性。自从Perforce管理它本身的档案文件,它不得不对全部奇怪的威胁到数据崩溃的事情负责。另外一方面,写一个文件比写一个blob字段到SQL中要快些。Perforce有最快的配置管理工具的声誉。
Managing repositories
管理配置库
Best Practice: Use separate repositories for things which are truly separate
最佳实践:对真正分离的事物使用分离的库
Most SCM tools offer the ability to have multiple distinct repositories. Vault can even host multiple repositories on the same Vault server. People often ask us when this capability should be used.
许多配置管理工具均可以创建许多不一样的库。Vault甚至能够在同一台Vault服务器上创建多个库。人们经常问咱们这有什么用。
In general, you should store related items in the same repository. Start a separate repository only in situations where the contents of the two are completely unrelated.  In a small ISV, it may be quite logical to have only one repository which contains every project. 
一般,你能够存储相似的项目到同一个库。创建一个分离的库仅仅是在两个项内容彻底不相关的状况下。在一个小的独立软件开发商那里,一个包含了全部项目的库是至关合理的。
Creating a source control repository is kind of a special event.  It's a little bit like adopting a cat.  People often get a cat without realizing the animal is going to be around for 10-20 years.  Your repository may have similar longevity, or even longer.
建立一个源码库是有点特殊的状况。有点象收养一只猫。人们一般收养一只猫的时候没有想过这个猫要在本身身边10-20年。你的库可能有相似的寿命,甚至更长。
Shortly after SourceGear was founded in 1997, we created a SourceSafe repository.  Over seven years later, that repository is still in use, almost every day.  (Along with a whole bunch of legacy projects, it contains the source code for SourceOffSite.  We never migrated that project to Vault because we wanted the SourceOffSite developers to continue eating their own dogfood.)
SourceGear 1997年被建立,咱们建立了一个SourceSafe的库。7年以后,那个库几乎是天天都还在使用。(它包含了SourceOffSite的源码,还伴随着遗留项目的整个树串。咱们历来没有移植那个项目到Vault上,由于咱们但愿SourceOffSite的开发人员继续去啃它们本身的狗骨头。)
That repository is well over a gigabyte in size (which is actually rather small, but then SourceGear has never been a very big company).  It contains thousands of files, thousands of checkins, and has been backed up thousands of times.
这个库在十亿字节的时候会溢出(这实际上至关小了,而SourceGear却已是一个很大的公司了)。它包含了数以千计的文件,数以千计的签入和数以千计的回滚。
Treat your repository well and it will serve you well:
对你的库好点它就会对你好点:
  • Obviously you should do regular backups.  That repository contains everything your fussy and expensive programmers have ever created.  Don't risk losing it. 
  • 显然你应该规范备份。库包含了你全部的琐碎的事情和程序人员宝贵的代码。不要冒丢失它的险。
  • Just for fun, take an hour this week and check your backup to see if it actually works.  It's shocking how many people are doing daily backups that cannot actually be restored when they are needed.
  • 可笑的是,要每周花一个小时来检查你的备份是否能够真正的可用。不少人在他们真正须要的时候却恐怖的发现作了每日备份可是备份却没有真正的被保存起来。
  • Put your repository on a reliable server.  If your repository goes down, your entire team is blocked from doing work.  Disk drives like to fail, so use RAID.  Power supplies like to fail, so get a server with redundant power supplies.  The electrical grid likes to fail, so get a good Uninterruptible Power Supply (UPS).
  • 把你的库放到一个可信的服务器上。若是你的库坏了,你整个团队工做就得停滞。硬盘喜欢坏掉,因此用RAID。供电电源也爱坏掉,那就让一个服务器拥有多个供电电源。电路也喜欢坏掉,那就用一个好的UPS
  • Be conservative in the way your SCM server machine is managed.  Don't put anything on that machine that doesn't need to be there.  Don't feel the need to install every single Service Pack on the day it gets released.  I've been shocked how many times one of our servers went south simply because we installed a service pack or hotfix from Windows Update.  Obviously I want our machines to be kept current with the latest security fixes, but I've been burned too many times not to be cautious.  Install those patches on some other machine before you put them on critical servers.
  • 让你的配置管理服务器被用传统的方式管理。不要放不须要的东西到那台机器上。不要以为有必要在SP发布的时候就马上去安装每一个SP。我遇到好屡次由于咱们安装了一个SP或者使用了Windows自动更新进行了自动修复,咱们的服务器就轻易的死掉了。显然我但愿咱们的服务器能保持一个有当前最新的安全性修复,可是我屡次由于没有当心而受处处罚。请在安装它们到正式服务器以前在其余机器上安装那些补丁。
  • Keep your SCM server inside a firewall.  If you need to allow your developers to access the repository from home, carefully poke a hole, but leave everything else as tight as you can.  Make sure your developers are using some sort of bulk encryption.  Vault uses SSL.  Tools like Perforce, CVS and Subversion can be tunneled through ssh or something similar.
  • 保证你的配置管理服务器同其余机器在一个防火墙内。若是你容许你的开发人员从家里就能够访问配置库,那就当心的开一个洞,不要再放其余的任何东西,能有多谨慎就多谨慎。确信你的开发人员在使用一些必须的加密协议。Vault使用SSL。象Perforce, CVS Subversion能够经过SSH或者相似的协议。
This brief list of tips is hardly a complete guide for administrators.  I am merely trying to describe the level of care and caution which should be used for your SCM repository.
这上面列出的还仅仅是一个管理员的指南。我只不过试图描述在你的配置管理库中须要关心和当心的程度。
Undo
撤销
As I have mentioned, one of the best things about source control is that it contains your entire history.  Every version of everything is stored.  Nothing is ever deleted.
如我所说过,源码控制最好的就是包含你整个的历史。每一个版本的每一个事件都被保存了,没有任何东西被删除。
However, sometimes this benefit can be a real pain.  What if I made a mistake and checked in something that should not be checked in?  My history contains something I would rather forget.  I want to pretend that it never happened.  Isn't there some way to really delete from a repository?
然而,有的时候这个益处恰是一个真正的痛苦。若是我产生了一个失误而且签入了不须要签入的东西的时候会发生什么?个人历史包含了我愿意遗忘的历史。我但愿它好像历来没有发上过。那有没有什么办法从库里面真正的删除它们?
In general, the recommended way to fix a problem is to checkin a new version which fixes it.  Try not to worry about the fact that your repository contains a full history of the error.  Your mistakes are a part of your past.  Accept them and move on with your life.
一般,解决这个问题的建议是在修改的时候签入一个新的版本。不要担忧你的库中包含了整个失误的历史。你的失误是你过去的一个部分。接受它们而后继续你的生命吧。
However, most SCM tools do provide one or more ways of dealing with this situation.  First, there is a command I call "rollback".  This command is essentially an "undo" for revisions of a file.  For example, let's say that a certain file is at version 7 and we want to go back to version 6.  In Vault, we select version 6 and choose the Rollback command.
然而,不少配置管理工具提供了一种或更多种方式来处理这种状况。首先,有一个我称为“回滚”的命令。这个命令实质上就是“撤销”一个文件的修订。例如,咱们说一个文件在版本7,而咱们但愿回到版本6。在Vault里面,咱们选择版本6而后使用回滚命令。
To be fair, I should admit that the rollback command is not always destructive.  In some SCM tools, the rollback feature really does make version 7 disappear forever.  Vault's rollback is non-destructive.  It simply creates a version 8 which is identical to version 6.  The designers of Vault are fanatical purists, or at the very least, one of them is.
为了公平,我容许回滚命令不是破坏性的。有些配置管理工具,回滚功能真的使版本7永远消失掉了。Vault的回滚功能是非破坏性的。它简单的建立一个同版本6同样的版本8Vault设计者都是狂热的理论爱好者,最起码他们中的一个是。
As a concession to those who are less fanatical, Vault does support a way to truly destroy things in a repository.  We call this feature "obliterate".  I believe Subversion and Perforce use the same term.  The obliterate command is the only way to delete something and make it truly gone forever.
做为一种对那些不那么狂热的人的让步,Vault也支持真正的在库里面破坏东西。咱们称这个功能为“删除”。我相信SubversionPerforce使用了一样的术语。删除命令是惟一的删除一些东西而且使它真正的消失的命令。
Best Practice: Never obliterate anything that was real work
最佳实践:不要删除真正工做的任何东西
The purist in me wants to recommend that nothing should ever be obliterated. However, my pragmatist side prevails. There are situations where obliterate is not sinful.
在我脑壳里理想化的一面但愿任何东西都不要被删除,可是个人现实的一面却成功了,有时有些地方被删除并无那么可怕。
However, obliterate should never be used to delete actual work. Don't obliterate a file simply because you discovered it to be a bad idea. Don't obliterate a file simply because you don't need it anymore. Obliterate is for situations where something in the repository should never have been there at all. For example, if you accidentally checkin a gigabyte of MP3s alongside your C++ include files, obliterate is a justifiable choice.
固然,删除应该决不用于删除真正的工做。不要由于你发现它很差就删除一个文件。也不要由于再也不须要就删除。删除是为了一些在库中根本不须要的。例如,若是你意外的签入一个MP3的文件到你的C++文件里面,那删除就是一个正确的选择。
In my original spec for Vault, I had decided that we would not implement any form of destructive delete.  We eventually decided to compromise and implement this command, but I really wanted to discourage its use.  SourceSafe makes it far too easy to rewrite history and pretend that something never happened.  In the Delete dialog box, SourceSafe includes a checkbox called "Destroy Permanently".  This is an atrocious design decision, roughly equivalent to leaving a sledgehammer next to the server machine so that people can bash the hard disks with it every once in a while.  This checkbox is almost irresistible.  It simply begs to be checked, even though it is very rarely the right thing to do.
Vault的原始规则里面,我曾经肯定咱们不会执行任何破坏性的删除。咱们最后决定妥协并使用这个命令,可是我真正的但愿阻止它的使用。SourceSafe使这个命令很简单快速的重写历史和假设什么都没有发生过。在删除对话框,SourceSafe包含了一个成为“永久破坏”的选择框。这是一个很凶悍的设计思想,粗糙的等于拿一个大的锤子让人们能够在硬盘旋转中去敲打服务器。这个选择框是至关有诱惑的。它简单的要求检查,尽管不多有正确的事情来作。
When we first designed the obliterate command for Vault, I wanted its user interface to somehow make the user feel guilty.  I argued that the obliterate dialog box should include a photograph of a 75-year old catholic nun scowling and holding a yardstick.
当咱们开始为Vault设计删除命令的时候,我但愿它的用户界面可以使用户莫名其妙的以为不舒服。我辩论说这个删除对话框包含了一个拿着一根绳子的75岁的修女。
The rest of the team agreed that we should discourage people from using this command, but in the end, we settled on a less graphical approach.  In Vault, the obliterate command is available only in the Admin client, not the regular client people use every day.  In effect, we made the obliterate command available, but inconvenient.  People who really need to obliterate can find the command and get it done.  Everyone else has to think twice before they try to rewrite history and pretend something never happened.
其余的团队成员赞成我应该劝阻人民不要使用这个命令,可是到最后,咱们决定采起了一个小的图形方式。在Vault里面,删除命令是仅仅在管理员端可使用的,不是其余的客户端的客户能够天天使用的。咱们还使这个命令可用,却并不方便。真正须要删除的人们能够找这个命令而后执行。其余的人在他们试图重写历史并假装什么事情都没有发生以前须要思考两次。
Kimchi again?
再来点韩国泡菜?
Recently when I asked my fifth grade daughter what she had learned in school, she proudly informed me that "everyone in Korea eats kimchi at every meal, every day".  In the world of a ten-year-old, things are simpler.  Rules don't have exceptions.  Generalizations always apply. 
最近我问我五年级的女儿她从学校学到了什么,她骄傲的告诉我“在韩国的人天天、每顿都吃韩国泡菜”。在一个十岁的年纪,事情很是简单。规则没有例外。一般老是被运用。
This is how we learn.  We understand the basic rules first and see the finer points later.  First we learn that memory leaks are impossible in the CLR.  Later, when our app consumes all available RAM, we learn more.
这就是咱们如何来学习。咱们首先了解了基本规则,而后再看重点。首先咱们认识到内存泄漏在语音录音器里面是不可能的。后来,当咱们的程序消耗了全部可用的RAM,咱们就学到了更多。
My habit as I write these chapters is to first present the basics in a "matter of fact" fashion, rarely acknowledging that there are exceptions to my broad generalizations.  I did this during the chapter on checkins, failing to mention the "edit-merge-commit" until I had thoroughly explored "checkout-edit-checkin".
个人习惯就象我写这些文章同样,首先以一种事实方式呈现基础,个人宽泛的归纳很罕见的获得承认。我在章节签入里面作这些事情,直到我完全的研究了“签出-编辑-签入”以前我都没有说起“编辑-合并-提交”。
In this chapter, I have written everything from the perspective of just one specific architecture.  SCM tools like Vault, Perforce, CVS and Subversion are based on the concept of a centralized server which hosts a single repository.  Each client has a working folder.  All clients contact the same server. 
在这个章节,我只以一个特定结构的见解去描述每件事情。配置管理工具,好比VaultPerforceCVSSubversion都是基于集中只有一个单独的库的服务器的概念。每一个客户端有一个工做目录,全部的客户端同同一台服务器联系。
I confess that not all SCM tools work this way.  Tools like BitKeeper and Arch are based on the concept of distributed repositories.  Instead of one repository, there can be several, or even many.  Things can be retrieved or committed to any repository at any time.  The repositories are synchronized by migrating changesets from one repository to another.  This results in a merge situation which is not altogether different from merging branches.
我认可不是全部的配置管理工具都是用那种方式工做。好比BitKeeper Arch都是基于分布式数据库的。一个库能够有好几个,甚至更多。工做可以在任什么时候间从任何库中得到或提交。这个库是经过从一个库移动变动到另外一个库同步的。在一个合并的地方这个结果不是同合并分支差别相同的。
From the perspective of this SCM geek, distributed repositories are an attractive concept.  Admittedly, they are advanced and complex, requiring a bit more of a learning curve on the part of the end user.  But for the power user, this paradigm for source control is very cool.
关于这个配置管理讨厌的见解,分布库是一个吸引人的概念。诚然,他们是高级和复杂的,须要终端用户更多的学习。可是对高级用户,这个例子对版本控制很是酷。
Having no experience in the implementation of these systems, I will not be explaining their behavior in any detail.  Suffice it to say that this approach is similar in some ways, but very different in others.  This series of articles will continue to focus on the more mainstream architecture for source control.
尚未执行这些系统的经验,我将不会解释他们的行为。有力的说明这个方式在某些地方是相同的,可是又同其余的很是不一样。这个系列文章将继续关注主流结构的版本控制工具。
 
Looking ahead
In this chapter, I discussed the details of repositories.  In the next chapter, I'll go back over to the client side and dive into the details of working folders.
这一章节,我论述了关于库的状况。下一章节,我将回头来描述客户端和深刻钻研工做目录
相关文章
相关标签/搜索