【hadoop】15.HDFS-其余功能

时间 2019-11-05

标签 hadoop 15.hdfs hdfs 其余功能栏目 Hadoop 繁體版

原文原文链接

简介

本章节咱们讲讲HDFS的一些其余杂项功能，他们都是做为辅助功能而存在的。java

一、集群间数据拷贝

咱们之间使用scp实现了两个远程主机之间的文件复制，该方式能够实现文件的推拉。node

scp -r hello.txt root@h133:~/hello.txt		//push
scp -r root@h134:/user/hello.txt  hello.txt //pull
scp -r root@h1333:/user/hello.txt root@h134:/user/   //经过本地主机中转实现两个远程主机的文件复制；若是在两个远程主机之间ssh没有配置的状况下可使用该方式。

咱们也能够采用discp命令实现两个hadoop集群之间的递归数据复制web

bin/hadoop distcp hdfs://h133:9000/user/hello.txt hdfs://h233:9000/user/hello.txt

咱们目前的环境只有一个集群，因此暂时没法演示。shell

二、Hadoop存档

每一个文件均按块存储，每一个块的元数据存储在namenode的内存中，所以hadoop存储小文件会很是低效。由于大量的小文件会耗尽namenode中的大部份内存。但注意，存储小文件所须要的磁盘容量和存储这些文件原始内容所须要的磁盘空间相比也不会增多。例如，一个1MB的文件以大小为128MB的块存储，使用的是1MB的磁盘空间，而不是128MB。编程

Hadoop存档文件或HAR文件，是一个更高效的文件存档工具，它将文件存入HDFS块，在减小namenode内存使用的同时，容许对文件进行透明的访问。安全

这样作的好处：Hadoop存档文件能够用做MapReduce的输入。ssh

一、须要启动yarn进程tcp

start-yarn.sh

二、归档文件归档成一个叫作xxx.har的文件夹，该文件夹下有相应的数据文件。Xx.har目录是一个总体，该目录当作是一个归档文件便可。ide

archive -archiveName <NAME>.har -p <parent path> [-r <replication factor>]<src>* <dest>

咱们练习归档，将/user这个目录进行归档。工具

#### 查看/user中的文件
[root@h135 current]# hadoop fs -ls /user
Found 7 items
-rw-r--r--   3 root supergroup          5 2019-01-03 16:38 /user/444.txt
-rw-r--r--   3 root supergroup         19 2019-01-04 13:24 /user/consisit.txt
-rw-r--r--   3 root supergroup         19 2019-01-03 16:27 /user/h132.txt
-rw-r--r--   3 root supergroup         19 2019-01-03 16:28 /user/h133.txt
-rw-r--r--   3 root supergroup         23 2019-01-05 15:11 /user/h134.txt
-rw-r--r--   3 root supergroup         19 2019-01-03 16:28 /user/h135.txt
drwxr-xr-x   - root supergroup          0 2019-01-03 15:58 /user/zhaoyi

#### 执行归档操做
[root@h135 current]# hadoop archive -archiveName myhar.har -p /user /

#### 查看归档以后目录变化
[root@h135 current]# hadoop fs -ls  /
Found 7 items
-rw-r--r--   3 root supergroup         19 2019-01-05 14:57 /h134.txt
drwxr-xr-x   - root supergroup          0 2019-01-06 10:23 /myhar.har
-rw-r--r--   3 root supergroup         23 2019-01-05 19:07 /newslaver.txt
-rw-r--r--   3 root supergroup          4 2019-01-04 15:50 /seen_txid
drwxr-xr-x   - root supergroup          0 2019-01-05 19:34 /system
drwx------   - root supergroup          0 2019-01-06 10:20 /tmp
drwxr-xr-x   - root supergroup          0 2019-01-05 15:11 /user


#### 查看归档生成的har文件内容
[root@h135 current]# hadoop fs -ls -R /myhar.har
-rw-r--r--   3 root supergroup          0 2019-01-06 10:23 /myhar.har/_SUCCESS
-rw-r--r--   5 root supergroup        699 2019-01-06 10:23 /myhar.har/_index
-rw-r--r--   5 root supergroup         23 2019-01-06 10:23 /myhar.har/_masterindex
-rw-r--r--   3 root supergroup        133 2019-01-06 10:23 /myhar.har/part-0

#### 查看归档文件原文件内容
[root@h135 current]# hadoop fs -ls -R har:///myhar.har
-rw-r--r--   3 root supergroup          5 2019-01-03 16:38 har:///myhar.har/444.txt
-rw-r--r--   3 root supergroup         19 2019-01-04 13:24 har:///myhar.har/consisit.txt
-rw-r--r--   3 root supergroup         19 2019-01-03 16:27 har:///myhar.har/h132.txt
-rw-r--r--   3 root supergroup         19 2019-01-03 16:28 har:///myhar.har/h133.txt
-rw-r--r--   3 root supergroup         23 2019-01-05 15:11 har:///myhar.har/h134.txt
-rw-r--r--   3 root supergroup         19 2019-01-03 16:28 har:///myhar.har/h135.txt
drwxr-xr-x   - root supergroup          0 2019-01-03 15:58 har:///myhar.har/zhaoyi
-rw-r--r--   3 root supergroup         29 2019-01-03 15:58 har:///myhar.har/zhaoyi/a.txt
drwxr-xr-x   - root supergroup          0 2019-01-03 15:14 har:///myhar.har/zhaoyi/input

#### 操做归档文件的内容（复制一个文件）
[root@h135 current]# hadoop fs -cp har:///myhar.har/444.txt /
[root@h135 current]# hadoop fs -ls  /
Found 8 items
-rw-r--r--   3 root supergroup          5 2019-01-06 10:27 /444.txt
-rw-r--r--   3 root supergroup         19 2019-01-05 14:57 /h134.txt
drwxr-xr-x   - root supergroup          0 2019-01-06 10:23 /myhar.har
-rw-r--r--   3 root supergroup         23 2019-01-05 19:07 /newslaver.txt
-rw-r--r--   3 root supergroup          4 2019-01-04 15:50 /seen_txid
drwxr-xr-x   - root supergroup          0 2019-01-05 19:34 /system
drwx------   - root supergroup          0 2019-01-06 10:20 /tmp
drwxr-xr-x   - root supergroup          0 2019-01-05 15:11 /user

能够看到，归档过程当中输出的日志记录代表后台的操做就是由mapreduce完成的，因此咱们事先需将yarn相关守护进程开启。

执行归档操做以后，会在/也就是咱们归档指定的目录生成har文件，而且原先的归档文件夹仍是存在的，能够理解这是一个安全的拷贝操做。归档完成以后，能够按本身的需求决定是否删除原归档文件。

假设咱们想要将归档文件“解压”出来（注意这其实和解压不同），咱们能够执行下面的命令，其实就是拷贝操做。

[root@h135 current]# hadoop fs -cp har:///myhar.har/* /

三、快照管理

快照至关于对目录作一个备份。并不会当即复制全部文件，而是指向同一个文件。当写入发生时，才会产生新文件。

快照相关语法:

（1）hdfs dfsadmin -allowSnapshot 路径   （功能描述：开启指定目录的快照功能）
（2）hdfs dfsadmin -disallowSnapshot 路径 （功能描述：禁用指定目录的快照功能，默认是禁用）
（3）hdfs dfs -createSnapshot 路径        （功能描述：对目录建立快照）
（4）hdfs dfs -createSnapshot 路径 名称   （功能描述：指定名称建立快照）
（5）hdfs dfs -renameSnapshot 路径 旧名称 新名称 （功能描述：重命名快照）
（6）hdfs lsSnapshottableDir         （功能描述：列出当前用户全部可快照目录）
（7）hdfs snapshotDiff 路径1 路径2 （功能描述：比较两个快照目录的不一样之处）
（8）hdfs dfs -deleteSnapshot <path> <snapshotName>  （功能描述：删除快照）

咱们练习使用一下快照一、在/目录下建立一个文件夹test，并往里面上传2个文件。

[root@h133 current]# hadoop fs -mkdir /test
[root@h133 ~]# hadoop fs -put a.txt /test
[root@h133 ~]# hadoop fs -put b.txt /test
[root@h133 ~]# hadoop fs -ls -R /
drwxr-xr-x   - root supergroup          0 2019-01-06 10:51 /test
-rw-r--r--   3 root supergroup         29 2019-01-06 10:51 /test/a.txt
-rw-r--r--   3 root supergroup         21 2019-01-06 10:51 /test/b.txt

二、为test目录建立快照，并重命名为test-snap

[root@h133 ~]# hdfs dfsadmin -allowSnapshot /test
Allowing snaphot on /test succeeded
[root@h133 ~]# hdfs dfs -createSnapshot /test test-snap
Created snapshot /test/.snapshot/test-snap

建立快照以前要开启对应目录的快照开启功能-allowSnapshot。

三、查看快照目录

[root@h133 ~]# hdfs lsSnapshottableDir 
drwxr-xr-x 0 root supergroup 0 2019-01-06 10:53 1 65536 /test

四、上传一个文件到/test，而后查看和快照之间的不一样

[root@h133 ~]# hadoop fs -put h133.txt /test
[root@h133 ~]# hdfs snapshotDiff /test . .snapshot/test-snap
Difference between current directory and snapshot test-snap under directory /test:
M	.
-	./h133.txt

能够看到，显示有所修改，而且快照相比于最新版本的目录少了一个h133.txt文件。

hdfs snapshotDiff /test . .snapshot/test-snap中，.表明的是当前状态，查看命令说明：

hdfs snapshotDiff <snapshotDir> <from> <to>:
	Get the difference between two snapshots, 
	or between a snapshot and the current tree of a directory.
	For <from>/<to>, users can use "." to present the current status,
	and use ".snapshot/snapshot_name" to present a snapshot,
	where ".snapshot/" can be omitted

该命令也能够用于不一样快照之间的差别性比较。

五、恢复快照文件

若是咱们想还原之间的快照版本，HDFS这个版本是没有提供任何命令的，简单的来讲，咱们能够经过cp命令直接copy快照文件到原路径便可。算是一种恢复快照文件的方案。

[root@h133 ~]# hadoop fs -rm -r /test/*
19/01/06 12:51:42 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /test/a.txt
19/01/06 12:51:42 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /test/b.txt
19/01/06 12:51:42 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /test/h133.txt
[root@h133 ~]# hadoop fs -ls /test


[root@h133 ~]# hdfs dfs -cp /test/.snapshot/test-snap/* /test
[root@h133 ~]# hadoop fs -ls /test
Found 2 items
-rw-r--r--   3 root supergroup         29 2019-01-06 12:55 /test/a.txt
-rw-r--r--   3 root supergroup         21 2019-01-06 12:55 /test/b.txt

上面的操做中，咱们先删除了/test文件夹内的3个文件，而后从快照文件中将快照文件又还原了回来。须要注意的是，咱们的rm操做并无将.snapshoot快照文件删除。

六、删除快照

[root@h133 ~]# hdfs dfs -deleteSnapshot /test test-snap

[root@h133 ~]#  hdfs dfs -cp /test/.snapshot/test-snap/* /test
cp: `/test/.snapshot/test-snap/*': No such file or directory

七、关闭文件目录的快照功能

[root@h133 ~]# hdfs lsSnapshottableDir
drwxr-xr-x 0 root supergroup 0 2019-01-06 12:55 0 65536 /test
[root@h133 ~]# hdfs dfsadmin -disallowSnapshot /test
Disallowing snaphot on /test succeeded
[root@h133 ~]# hdfs lsSnapshottableDir

hdfs dfsadmin -disallowSnapshot /test关闭以后，别忘了使用命令hdfs lsSnapshottableDir查看一下操做的目录是否还在。

四、回收站

4.一、配置简介

HDFS也有回收站机制，只不过默认状况下是关闭的。

与其相关的配置项主要有如下三个，他们能够从core-site.xml中查询到

<property>
  <name>fs.trash.interval</name>
  <value>0</value>
  <description>Number of minutes after which the checkpoint
  gets deleted.  If zero, the trash feature is disabled.
  This option may be configured both on the server and the
  client. If trash is disabled server side then the client
  side configuration is checked. If trash is enabled on the
  server side then the value configured on the server is
  used and the client configuration value is ignored.
  </description>
</property>

<property>
  <name>fs.trash.checkpoint.interval</name>
  <value>0</value>
  <description>Number of minutes between trash checkpoints.
  Should be smaller or equal to fs.trash.interval. If zero,
  the value is set to the value of fs.trash.interval.
  Every time the checkpointer runs it creates a new checkpoint 
  out of current and removes checkpoints created more than 
  fs.trash.interval minutes ago.
  </description>
</property>

<!-- Static Web User Filter properties. -->
<property>
  <description>
    The user name to filter as, on static web filters
    while rendering content. An example use is the HDFS
    web UI (user to be used for browsing files).
  </description>
  <name>hadoop.http.staticuser.user</name>
  <value>dr.who</value>
</property>

从描述文档里面咱们能够了解到，fs.trash.interval的默认值为0，0表示禁用回收站，能够设置删除文件的存活时间。注意配置项fs.trash.checkpoint.interval=0，他用于配置检查回收站的间隔时间的值。显然，咱们必须保证fs.trash.checkpoint.interval<=fs.trash.interval。

这里咱们不配置他，则其值会等于咱们配置的文件存活时间（fs.trash.interval）。

若是检查点已经启用，会按期使用时间戳重命名Current目录。.Trash中的文件在用户可配置的时间延迟后被永久删除。回收站中的文件和目录能够简单地经过将它们移动到.Trash目录以外的位置来恢复。

第三个配置项为hadoop.http.staticuser.user，咱们修改他的值为root(正常状况下应该是本身的hadoop拥有者帐户)。

4.二、测试

接下来咱们来测试一下回收站功能。

一、启用回收站

往core-site.xml文件中加入以下配置，设置文件的有效时间为1分钟，WEB浏览者权限用户名为root.

<property>
    <name>fs.trash.interval</name>
    <value>1</value>
</property>
<property>
  <name>hadoop.http.staticuser.user</name>
  <value>root</value>
</property>

hadoop.http.staticuser.user请配置为您当前系统的HDFS文件拥有者。

二、删除文件

[root@h133 ~]# hadoop fs -rm /test/a.txt
19/01/06 13:32:27 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://h133:8020/test/a.txt' to trash at: hdfs://h133:8020/user/root/.Trash/Current

三、查看回收站

drwx------   - root supergroup          0 2019-01-06 13:32 /user/root
[root@h133 ~]# hadoop fs -ls /user/root
Found 1 items
drwx------   - root supergroup          0 2019-01-06 13:32 /user/root/.Trash

四、若是经过WEB访问是没办法进入到此目录的，咱们重启一下集群

[root@h134 current]# stop-yarn.sh

[root@h133 ~]# stop-dfs.sh 

[root@h133 ~]# start-dfs.sh 

[root@h134 current]# start-yarn.sh

这时候就能够经过web端进行访问了。

五、恢复回收站数据经过mv命令能够移动回收站文件到正常的目录，但须要注意的是，文件会在删除后一分钟从回收站中完全删除。

[root@h133 ~]# hadoop fs -rm /test/a.txt
19/01/06 13:45:56 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://h133:8020/test/a.txt' to trash at: hdfs://h133:8020/user/root/.Trash/Current
[root@h133 ~]# hadoop fs -mv /user/root/.Trash/Current/a.txt /test
[root@h133 ~]# hadoop fs -mv /user/root/.Trash/190106134600/test/a.txt /test
[root@h133 ~]# hadoop fs -ls /test
Found 1 items
-rw-r--r--   3 root supergroup         29 2019-01-06 13:45 /test/a.txt

六、清空回收站

hdfs dfs -expunge

经过程序删除的文件不会通过回收站，须要调用moveToTrash()才进入回收站

Trash trash = New Trash(conf);
trash.moveToTrash(path);

4.3 回收站总结

回收站功能默认是禁用的。对于生产环境，建议启用回收站功能以免意外的删除操做。启用回收站提供了从用户操做删除或用户意外删除中恢复数据的机会。可是为fs.trash.interval和fs.trash.checkpoint.interval设置合适的值也是很是重要的，以使垃圾回收以你指望的方式运做。例如，若是你须要常常从HDFS上传和删除文件，则可能须要将fs.trash.interval设置为较小的值，不然检查点将占用太多空间。

当启用垃圾回收并删除一些文件时，HDFS容量不会增长，由于文件并未真正删除。HDFS不会回收空间，除非文件从回收站中删除，只有在检查点过时后才会发生。

回收站功能默认只适用于使用Hadoop shell删除的文件和目录。使用其余接口(例如WebHDFS或Java API)以编程的方式删除的文件或目录不会移动到回收站，即便已启用回收站，除非程序已经实现了对回收站功能的调用。

有时你可能想要在删除文件时临时禁用回收站，也就是删除的文件或目录不用放在回收站而直接删除，在这种状况下，可使用-skipTrash选项运行rm命令。

至此，关于HDFS的学习就完成了，接下来咱们学习最有趣也是最核心的功能模块：MapReduce——这关系到咱们的应用程序开发。