借 Git 的第一个 commit 探索 Git 原理

时间 2020-06-29

标签 git 第一个 commit 探索原理栏目 Git 繁體版

原文原文链接

最近想了解一下 Git 的实现，首先是看了 pro git 的 git 原理部分。看完以后对 git 的实现有了个大概的了解，但仍是不过瘾。因而把 git 的源码 clone 下来看源码是如何实现的，可是 git 的源码实在太多了，要耗费太多精力了。linux

我尝试把跳转到 Git 的第一个 commit，代码量就少了不少，只有差很少 1000 行左右。修改了一些代码把程序编译起来了。看了下初始版本的 git，发现不少概念仍是能够和如今的 git 的相通，有阅读的价值。git

以后还看到 jacob Stopak 的解析，对 git 的代码作了不少注释。看到的时候非常惊喜，由于网上对 git 这样的解析不多，看日期仍是今年的 5 月 22 号发布的，新鲜的 >_<。github

What Can We Learn from the Code in Git’s Initial Commit?shell

获取源代码并编译

在这里使用 Jacob Stopak 的项目，它对 Git 的第一个 commit 里面的代码添加了大量注释，还修改了一些代码方便咱们在现代操做系统上编译。数据库

获取代码bash

git clone https://bitbucket.org/jacobstopak/baby-git.git
复制代码

编译直接 make 就行了，不过这里仍是会出现编译错误，须要在 cache.h 文件修改下面的变量，加上 externapp

extern const char *sha1_file_directory;
extern struct cache_entry **active_cache;
extern unsigned int active_nr, active_alloc;
复制代码

直接 makedom

make
复制代码

若是你想要尝试使用 github 上 git 的源码来编译的话，能够进行下面操做函数

# 先 clone 源码
git clone https://github.com/git/git.git
# 根据 log 找到第一个 commit
git log --reverse
# 检出第一个 commit
git checkout e83c5163316f89bfbde7d9ab23ca2e25604af290
复制代码

这里的源码直接 make 的话会编译失败，还须要做出下面的修改工具

首先是 Makefile 文件

# 把 LIBS 这行修改为这样
LIBS= -lcrypto -lz
复制代码

而后在 cache.h 中

#添加头文件
#include <string.h>
复制代码

和前面一样的操做，把下面几个变量加上 extern

extern const char *sha1_file_directory;
extern struct cache_entry **active_cache;
extern unsigned int active_nr, active_alloc;
复制代码

以后就能够直接 make 了。或者你嫌麻烦能够用我 z.diff 文件，直接 apply 一下。

git apply z.diff
复制代码

效果同样。

源码结构

能够看到 git 的开始的时候只有 8 个 .c 文件和 1 个 .h 文件。

能够看到一下包含代码和注释行数只有 1037 行

cat *.c *.h | wc -l
1037
复制代码

其中 7 个 .c 文件能够对应如今的 7 个 git 命令

文件	当前 Git	做用
init-db	git init	初始化 git 仓库
update-cache	git add	添加文件到暂存区
write-tree	git write-tree	将暂存区的内容写入到一个 tree 对象到 git 的仓库
commit-tree	git commit	基于指定的 tree 对象建立一个 commit 对象到 git 的仓库
read-tree	git read-tree	显示 git 仓库的树对象内容
show-diff	diff	显示暂存的文件和工做目录的文件差别
cat-file	git cat-file	显示存储在 Git 仓库中的对象内容

至于 read-cache.c 文件则定义了一些程序一些公用的函数和几个全局变量。

概念分析

Linus Torvalds 在 readme 中对 git 的实现原理做出了一些解释。

首先它给出了为何要使用 git 这个名字的缘由

随机的三个字母组合，没有和其余的 unix 命令冲突
简单
当它好使的时候叫 global information tracker
很差使的时候叫 goddam idiotic truckload of sh*t

random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronounciation of "get" may or may not be relevant.

stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.

"global information tracker": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.

"goddamn idiotic truckload of sh*t": when it breaks

它有两个核心的概念

objects databases
current directory cache （至关于如今的暂存区）

Object Databases

object database 至关于一个基于文件系统的键值对数据库，用文件的 sha-1 值做为键，在 .dircache/objects 中存放各类类型的数据。

在这里介绍三种类型：

blob
tree
commit

blob 对象用来存储完整的文件内容，用于 git 追踪文件内容。blob 的结构

blob size\0blobdata(file content)
复制代码

git 除了要追踪文件内容外还要追踪文件名、文件存储路径和权限属性之类的信息。这个时候咱们引入 tree 对象来存储这些信息。tree 只会保留 blob 对象的 sha-1 值，不会追踪文件内容。（如今的 tree 对象还会能够 tree 对象嵌套 tree 对象，如今这里尚未）

大体的结构是这样的（省略了头）

100644 a.txt (6e666502660a7e810b276afd62523c56b34c1671)
100644 b.txt (b 的 sha-1 值)
100644 c.txt (c 的 sha-1 值)
复制代码

commit 对象，能够理解为咱们的一个 commit，它记录了特定时间的目录树（tree 对象）、父 commit 的 sha-id（能够有多个或 0 个）、做者和提交者信息和对应的 commit message。git 经过 commit 对象来追踪存储库的完整历史开发记录。

一样是大体结构：

tree 1c93ac491de01f734fedbe70f31c47ca965c93b6
parent 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
author  <zzk@archlinux> Sat Jun 27 14:41:43 2020
committer  <zzk@archlinux> Sat Jun 27 14:41:43 2020

second commit
复制代码

如今的 git 一样仍是一直在使用上面这些概念，经过管理这些对象来实现咱们的版本控制。

Git 把每一个文件的版本都彻底保存下来，会不会占用不少空间？是否是太暴力了？

git 会使用 zlib 对全部对象进行压缩，代码这类文本数据能够作到很高的压缩率（由于代码中不少都是重复的，好比关键字、函数调用）。在如今的 git 还引入了 packfiles 机制，会查找找命名及大小相近的文件，只保存文件不一样版本之间的差别，也是能够只保存 diff 的。关于暴力的话，就是空间换时间方式了，这样作在版本跳转的时候很快，不用一个一个地应用 diff，直接拿出来就行了。

Git 如何存储这些对象？

它们都被存放在 .dircache/objects 文件夹中，经过内容的 sha-1 值来索引对用的对象。在 dircache/objects 下还会生成 256 个子目录（00～ff），用来索引前两位的 sha-1 值。

Current directory cache

current directory cache 就是咱们使用的暂存区，它存储在 .dircache/index 文件中，能够看做是一个临时的 tree 对象。当执行 update-cache 时，就会建立对应文件的 blob 对象，并将树信息加到 .dircache/index 文件中。

如何使用

前面说的那些可能没有实际使用会不太好理解，下面咱们能够实战一下，使用一些初始版本的 git。

在这里咱们会利用这些现有的命令

把暂存区的文件取出来（至关于 git checkout -- file）
提交两个 commit 并在这两个 commit 的之间切换

先使用 init-db 建立存储库

$ ./init-db
复制代码

能够看到建立了 .dircache 文件夹，和如今的 .git 文件是同样的。

建立一个文件写入到暂存区，执行 update-cache 会在 .dircache/objects 文件夹中生成一个 blob 对象并把它加入到 .dircache/index 暂存区中。

$ echo "123456" > a.txt
$ ./update-cache a.txt
复制代码

使用 find 查看建立的 blob 对象

$ find .dircache/objects -type f
.dircache/objects/6e/666502660a7e810b276afd62523c56b34c1671
复制代码

查看一下

$ cat .dircache/objects/6e/666502660a7e810b276afd62523c56b34c1671
xKOR0g0426156%
复制代码

因为是用 zlib 压缩过的没解压看不出来啥，咱们可使用 cat-file 查看

$ ./cat-file 6e666502660a7e810b276afd62523c56b34c1671
temp_git_file_O5J3ZL: blob
复制代码

他会生成一个临时文件，里面是解压好的文件内容，能够看到咱们保存到存储库里 a.txt 的内容

$ cat temp_git_file_O5J3ZL
123456
复制代码

如今咱们把当前的暂存区的内容保存为 tree 对象到 .dircache/objects 中

$ ./write-tree
433aef473a665a9efe1cf21fbc617fbf833c71b5
复制代码

写入成功会打印对象的 SHA-1 值，咱们能够用 read-tree 查看这个 tree 对象有什么

$ ./read-tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
100644 a.txt (6e666502660a7e810b276afd62523c56b34c1671)
复制代码

如今能够提交咱们的第一个 commit 了，填写完 commit message 后按 crtrl+d 退出

$ ./commit-tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
Committing initial tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
first commit 
4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
复制代码

能够查看一下 commit 对象有什么东西

$ ./cat-file 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
temp_git_file_pNLmni: commit
$ cat temp_git_file_pNLmni
tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
author  <zzk@archlinux> Sat Jun 27 14:20:55 2020
committer  <zzk@archlinux> Sat Jun 27 14:20:55 2020

first commit
复制代码

咱们完成了第一个提交，如今咱们对 a.txt 作一下修改

$ echo "version2" > a.txt
复制代码

用 show-diff 能够查看和暂存区的版本的差别

$ ./show-diff
a.txt:  6e666502660a7e810b276afd62523c56b34c1671
--- -   2020-06-27 14:27:19.975168003 +0800
+++ a.txt       2020-06-27 14:27:18.048129094 +0800
@@ -1 +1 @@
-123456
+version2
复制代码

若是想取出暂存区的文件，有两个方法

使用生成的 diff，修改文件
用 cat-file 取出暂存区文件内容（第二行有文件的 SHA-1 值）

先用 diff 试试

$ ./show-diff > a.diff
$ patch --reverse a.txt a.diff
patching file a.txt
$ cat a.txt
123456
复制代码

能够看到文件又回到暂存区的版本了。如今把文件再修改回去，使用 cat-file 来试试

$ echo "version2" > a.txt
$ ./show-diff
a.txt:  6e666502660a7e810b276afd62523c56b34c1671
--- -   2020-06-27 14:33:53.795263485 +0800
+++ a.txt       2020-06-27 14:33:50.726893274 +0800
@@ -1 +1 @@
-123456
+version2
$ ./cat-file 6e666502660a7e810b276afd62523c56b34c1671
temp_git_file_d4cTZ7: blob
$ mv temp_git_file_d4cTZ7 a.txt
$ cat a.txt
123456
复制代码

如今咱们准备第二个 commit ，再执行一遍以前的操做

$ echo "verison2" > a.txt
$ ./update-cache a.txt
$ ./write-tree
1c93ac491de01f734fedbe70f31c47ca965c93b6

复制代码

执行 commit-tree 时候要注意，因为这是第二个提交，须要用 -p 指定一下父提交的 SHA-id 也就是第一个 commit

$ ./commit-tree 1c93ac491de01f734fedbe70f31c47ca965c93b6 -p 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
second commit
3d4340867cb1dd857987fc2db9b1b4bc14af7051

复制代码

看一下这个 commit

$ ./cat-file 3d4340867cb1dd857987fc2db9b1b4bc14af7051
temp_git_file_8Cguru: commit
$ cat temp_git_file_8Cguru 
tree 1c93ac491de01f734fedbe70f31c47ca965c93b6
parent 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
author  <zzk@archlinux> Sat Jun 27 14:41:43 2020
committer  <zzk@archlinux> Sat Jun 27 14:41:43 2020

second commit

复制代码

如今咱们要把 a.txt 的版本切换到上一个 commit，从上面的输出能够找出 parent 提交的 SHA-id

咱们利用这个父提交的 sha-id 找到对应的 tree object 的 sha-id，再从 tree object 找到对应 a.txt 的 blob object 的 sha-id。最后利用 cat-file 就能够还原出来原始 a.txt 的内容了。

$ ./cat-file 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
temp_git_file_MssRdY: commit
$ cat temp_git_file_MssRdY
tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
author  <zzk@archlinux> Sat Jun 27 14:20:55 2020
committer  <zzk@archlinux> Sat Jun 27 14:20:55 2020

first commit
$ ./read-tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
100644 a.txt (6e666502660a7e810b276afd62523c56b34c1671)
$ ./cat-file 6e666502660a7e810b276afd62523c56b34c1671
temp_git_file_2RlDAI: blob
$ mv temp_git_file_2RlDAI a.txt
$ cat a.txt
123456

复制代码

如今 a.txt 就被还原到了第一个 commit 时的状态。

总结：原始的 git 很难用。通过不断地打磨才成了今天这样子好用，如今的 git 不少人只要会 add、commit、push 就完事了，也不用了解 git 原理。

代码分析

这里对 git 的部分源码进行分析，因为前面已经写了不少篇幅了，这里就不会对太多的代码进行分析，里面的代码也是超简单，能够说是 linus 代码写得漂亮。你本身看估计也花不了一下子，并且 Jacob Stopak 对代码作超多的注释，真的没啥好说的。甚至你能够根据前面的知识脑补出大概源码了>_<。

在这里推荐一个阅读源码的工具 sourcetrail。看源码时很方便。

init-db.c

int main(int argc, char **argv)
{
	char *sha1_dir = getenv(DB_ENVIRONMENT), *path;
	int len, i, fd;
	
	// 建立 .dircache 的目录
	if (mkdir(".dircache", 0700) < 0) {
		perror("unable to create .dircache");
		exit(1);
	}

	/*
	 * If you want to, you can share the DB area with any number of branches.
	 * That has advantages: you can save space by sharing all the SHA1 objects.
	 * On the other hand, it might just make lookup slower and messier. You
	 * be the judge.
	 */
	sha1_dir = getenv(DB_ENVIRONMENT);
	if (sha1_dir) {
		struct stat st;
		if (!stat(sha1_dir, &st) < 0 && S_ISDIR(st.st_mode))
			return 1;
		fprintf(stderr, "DB_ENVIRONMENT set to bad directory %s: ", sha1_dir);
	}

	/*
	 * The default case is to have a DB per managed directory. 
	 */
	sha1_dir = DEFAULT_DB_ENVIRONMENT;
	fprintf(stderr, "defaulting to private storage area\n");
	len = strlen(sha1_dir);
	if (mkdir(sha1_dir, 0700) < 0) {
		if (errno != EEXIST) {
			perror(sha1_dir);
			exit(1);
		}
	}
	path = malloc(len + 40);
	memcpy(path, sha1_dir, len);
	// 建立 .dircache/objects 下的 256 个子目录
	for (i = 0; i < 256; i++) {
		sprintf(path+len, "/%02x", i);
		if (mkdir(path, 0700) < 0) {
			if (errno != EEXIST) {
				perror(path);
				exit(1);
			}
		}
	}
	return 0;
}

复制代码

init-db 就只是单纯建立好这些目录而已，和咱们前面看到的效果也同样。

这里介绍 update-cache.c 文件，主要说一下怎么建立 blob 和加入 .dircache/index 文件。

首先看 main

int main(int argc, char **argv) {
	int i, newfd, entries;
	
    // 将 .dircache/index 中的内容读入到 active_cache 这个全局变量中
	entries = read_cache();
	if (entries < 0) {
		perror("cache corrupted");
		return -1;
	}
	
    // 这里建立 lock 文件是为了避免让多个 update-cahce 同时运行的锁文件
	newfd = open(".dircache/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600);
	if (newfd < 0) {
		perror("unable to create new cachefile");
		return -1;
	}
	for (i = 1 ; i < argc; i++) {
		char *path = argv[i];
        // 验证路径
		if (!verify_path(path)) {
			fprintf(stderr, "Ignoring path %s\n", argv[i]);
			continue;
		}
        // 把文件添加到 object 数据库中
		if (add_file_to_cache(path)) {
			fprintf(stderr, "Unable to add %s to database\n", path);
			goto out;
		}
	}
    // 更新 index 文件
	if (!write_cache(newfd, active_cache, active_nr) && !rename(".dircache/index.lock", ".dircache/index"))
		return 0;
out:
	unlink(".dircache/index.lock");
}

复制代码

经过 main 函数能够看到生成 blob 应该是在 add_file_to_cache 中作的、而更新 index 则是在 write_cache 中作的。后面的能够本身追踪一下。我说一下大概作了啥

add_file_to_cache 中读取文件而后构造一个 blob 对象进行压缩，最后计算 sha-1 值，把压缩好的 blob 对象存到 sha-1 值对应的位置。（如今看文档好像是先计算 sha-1 值再进行压缩了）以后经过文件名二分查找放入 active_cache 中（也就是 .dircahce/index 暂存区，这个变量是个全局变量开始时将 .dircache/index 中的内容读入到里面），由于是暂存区是经过文件名进行排序的，因此确认一个文件是否在暂存区很快，二分只要 logn。

write_cache 中就是把 active_cache 再写回 .dircache/index.lock 文件，最后把锁文件重命名为 .dircache/index 暂存区文件。

这里能够访问个人博客