为何用ls和du显示出来的文件大小有差异？

时间 2020-01-02

原文原文链接

曾经有几回，我用ls和du查看一个文件的大小，发现两者显示出来的大小并不一致，例如： app

bl@d3:~/test/sparse_file$ ls -l fs.img
-rw-r--r-- 1 bl bl 1073741824 2012-02-17 05:09 fs.img
bl@d3:~/test/sparse_file$ du -sh fs.img
0       fs.img

这里ls显示出fs.img的大小是1073741824字节（1GB），而du显示出fs.img的大小是0。优化

原来一直没有深究这个问题，今天特来补上。 spa

形成这两者不一样的缘由主要有两点：指针

稀疏文件（sparse file）
ls和du显示出的size有不一样的含义

先来看一下稀疏文件。稀疏文件只文件中有“洞”（hole）的文件，例若有C写一个建立有“洞”的文件： code

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
    int fd = open("sparse.file", O_RDWR|O_CREAT);
    lseek(fd, 1024, SEEK_CUR);
    write(fd, "\0", 1);

    return 0;
}

从这个文件能够看出，建立一个有“洞”的文件主要是用lseek移动文件指针超过文件末尾，而后write，这样就造成了一个“洞”。 ip

用Shell也能够建立稀疏文件： ci

$ dd if=/dev/zero of=sparse_file.img bs=1M seek=1024 count=0
0+0 records in
0+0 records out

使用稀疏文件的优势以下（Wikipedia上的原文）： it

The advantage of sparse files is that storage is only allocated when actually needed: disk space is saved, and large files can be created even if there is insufficient free space on the file system. io

即稀疏文件中的“洞”能够不占存储空间。 class

再来看一下ls和du输出的文件大小的含义（Wikipedia上的原文）：

The du command which prints the occupied space, while ls print the apparent size.

换句话说，ls显示文件的“逻辑上”的size，而du显示文件“物理上”的size，即du显示的size是文件在硬盘上占据了多少个block计算出来的。举个例子：

bl@d3:~/test/sparse_file$ echo -n 1 > 1B.txt
bl@d3:~/test/sparse_file$ ls -l 1B.txt
-rw-r--r-- 1 bl bl 1 2012-02-19 05:17 1B.txt
bl@dl3:~/test/sparse_file$ du -h 1B.txt
4.0K    1B.txt

这里咱们先建立一个文件1B.txt，大小是一个字节，ls显示出的size就是1Byte，而1B.txt这个文件在硬盘上会占用N个block，而后根据每一个block的大小计算出来的。这里之因此用了N，而不是一个具体的数字，是由于隐藏在幕后的细节还不少，例如Fragment size，咱们之后再讨论。

固然，上述这些都是ls和du的缺省行为，ls和du分别提供了不一样参数来改变这些行为。好比ls的-s选项（print the allocated size of each file, in blocks）和du的--apparent-size选项（print apparent sizes, rather than disk usage; although the apparent size is usually smaller, it may be larger due to holes in (`sparse') files, internal fragmentation, indirect blocks, and the like）.

此外，对于拷贝稀疏文件，cp缺省状况下会作一些优化，以加快拷贝的速度。例如：

$ strace cp fs.img fs.img.copy >log 2>&1

stat("fs.img.copy", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("fs.img", {st_mode=S_IFREG|0644, st_size=1073741824, ...}) = 0
stat("fs.img.copy", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
open("fs.img", O_RDONLY)                = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1073741824, ...}) = 0
open("fs.img.copy", O_WRONLY|O_TRUNC)   = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 532480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90df965000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288
lseek(4, 524288, SEEK_CUR)              = 524288
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288
lseek(4, 524288, SEEK_CUR)              = 1048576
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 524288) = 524288
lseek(4, 524288, SEEK_CUR)              = 1572864

这和cp的关于sparse的选项有关，看cp的manpage：

By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.

看了一下cp的源代码，发现每次read以后，cp会判断读到的内容是否是都是0，若是是就只lseek而不write。

固然对于sparse文件的处理，对于用户都是透明的。