Hbase数据备份&&容灾方案

标签（空格分隔）： Hbasenode

1、Distcp

在使用distcp命令copy hdfs文件的方式实现备份时，须要禁用备份表确保copy时该表没有数据写入，对于在线服务的hbase集群，该方式不可用，将静态此目录distcp 到其余HDFS文件系统时候，能够经过在其余集群直接启动新Hbase 集群将全部数据恢复。

2、CopyTable

执行命令前，需在对端集群先建立表支持时间区间、row区间，改变表名称，改变列簇名称，指定是否copy删除数据等功能，例如：shell

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr= dstClusterZK:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable

一、同一个集群不一样表名称

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy  srcTable

二、跨集群copy表

hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase srcTable

跨集群copytable 必须注意是用推的方式，即从原集群运行此命令。apache

copytable eg

$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>

Options:
 rs.class     hbase.regionserver.class of the peer cluster,
              specify if different from current cluster
 rs.impl      hbase.regionserver.impl of the peer cluster,
 startrow     the start row
 stoprow      the stop row
 starttime    beginning of the time range (unixtime in millis)
              without endtime means from starttime to forever
 endtime      end of the time range.  Ignored if no starttime specified.
 versions     number of cell versions to copy
 new.name     new table's name
 peer.adr     Address of the peer cluster given in the format
              hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
 families     comma-separated list of families to copy
              To copy from cf1 to cf2, give sourceCfName:destCfName.
              To keep the same name, just give "cfName"
 all.cells    also copy delete markers and deleted cells

Args:
 tablename    Name of the table to copy

Examples:
 To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
 $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable

For performance consider the following general options:
  It is recommended that you set the following to >=100. A higher value uses more memory but
  decreases the round trip time to the server and may increase performance.
    -Dhbase.client.scanner.caching=100
  The following should always be set to false, to prevent writing data twice, which may produce
  inaccurate results.
    -Dmapred.map.tasks.speculative.execution=false

一些示例

hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1478448000000 –endtime=1478591994506 –peer.adr=VECS00001,VECS00002,VECS00003:2181:/hbase –families=txjl –new.name=hy_membercontacts_bk  hy_membercontacts

#根据时间范围备份
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1478448000000 –endtime=1478591994506 –new.name=hy_membercontacts_bk  hy_membercontacts
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –starttime=1477929600000 –endtime=1478591994506 –new.name=hy_linkman_tmp hy_linkman

#备份全表
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –new.name=hy_mobileblacklist_bk_before_del hy_mobileblacklist

#拓展根据时间范围查询
scan ‘hy_linkman’, {COLUMNS => ‘lxr:sguid’, TIMERANGE => [1478966400000, 1479052799000]}
scan ‘hy_mobileblacklist’, {COLUMNS => ‘mobhmd:sguid’, TIMERANGE => [1468719824000, 1468809824000]}
hbase org.apache.hadoop.hbase.mapreduce.CopyTable –new.name=hy_mobileblacklist_bk_before_del_20161228 hy_mobileblacklist

3、Export/Import(使用mapreduce)

##Export 执行导出命令可以使用-D命令自定义参数，此处限定表名、列族、开始结束RowKey、以及导出到HDFS的目录app

hbase org.apache.hadoop.hbase.mapreduce.Export -D hbase.mapreduce.scan.column.family=cf -D hbase.mapreduce.scan.row.start=0000001 -D hbase.mapreduce.scan.row.stop=1000000 table_name /tmp/hbase_export

###可选的-D参数配置项dom

Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> [<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]

  Note: -D properties will be applied to the conf used. 
  For example: 
   -D mapred.output.compress=true
   -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
   -D mapred.output.compression.type=BLOCK
  Additionally, the following SCAN properties can be specified
  to control/limit what is exported..
   -D hbase.mapreduce.scan.column.family=<familyName>
   -D hbase.mapreduce.include.deleted.rows=true
For performance consider the following properties:
   -Dhbase.client.scanner.caching=100
   -Dmapred.map.tasks.speculative.execution=false
   -Dmapred.reduce.tasks.speculative.execution=false
For tables with very wide rows consider setting the batch size as below:
   -Dhbase.export.scanner.batch=10

##Import 执行导入命令tcp

必须在导入前存在表 create 'table_name','cf'ide

###运行导入命令工具

hbase org.apache.hadoop.hbase.mapreduce.Import table_name hdfs://flashhadoop/tmp/hbase_export/

可选的-D参数配置项

Usage: Import [options] <tablename> <inputdir>

By default Import will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
  -Dimport.bulk.output=/path/for/output
 To apply a generic org.apache.hadoop.hbase.filter.Filter to the input, use
  -Dimport.filter.class=<name of filter class>
  -Dimport.filter.args=<comma separated list of args for filter
 NOTE: The filter will be applied BEFORE doing key renames via the HBASE_IMPORTER_RENAME_CFS property. Futher, filters will only use the Filter#filterRowKey(byte[] buffer, int offset, int length) method to identify  whether the current row needs to be ignored completely for processing and  Filter#filterKeyValue(KeyValue) method to determine if the KeyValue should be added; Filter.ReturnCode#INCLUDE and #INCLUDE_AND_NEXT_COL will be considered as including the KeyValue.
For performance consider the following options:
  -Dmapred.map.tasks.speculative.execution=false
  -Dmapred.reduce.tasks.speculative.execution=false
  -Dimport.wal.durability=<Used while writing data to hbase. Allowed values are the supported durability values like SKIP_WAL/ASYNC_WAL/SYNC_WAL/...>

4、Snapshot

即为Hbase 表的镜像。oop

须要提早开启Hbase 集群的snapshot 功能。

<property>
    <name>hbase.snapshot.enabled</name>
    <value>true</value>
</property>

在hbase shell中使用clone_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot命令但是是想建立快照，查看快照，经过快照恢复表，经过快照建立一个新的表等功能，ui

在建立snapshot后，能够经过ExportSnapshot工具把快照导出到另一个集群，实现数据备份或者数据迁移，ExportSnapshot工具的用法以下：(必须为推送的方式，即从现集群到目的集群)

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot table_name_snapshot -copy-to hdfs://flashhadoop_2/hbase -mappers 2

执行该命令后，在flashhadoop_2的hdfs中会把table_name_snapshot文件夹copy到/hbase/.hbase-snapshot文件下，进入flashhadoop_2这个hbase集群，执行list_snapshots会看到有一个快照：table_name_snapshot，经过命令clone_snapshot能够把该快照copy成一个新的表，不用提早建立表，新表的region个数等信息彻底与快照保持一致。也能够先建立一张与原表相同的表，而后经过restore snapshot的方式恢复表，但会多出一个region.这个region 将会失效。

在使用snapshot把一个集群的数据copy到新集群后，应用程序开启双写，而后能够使用Export工具把快照与双写之间的数据导入到新集群，从而实现数据迁移，为保障数据不丢失，Export导出时指定的时间范围能够适当放宽。

5、Replication

能够经过replication机制实现hbase集群的主从模式，或者能够说主主模式，也就是两边都作双向同步，具体步骤以下：一、若是主从hbase集群共用一个zk集群，则zookeeper.znode.parent不能都是默认的hbase，能够配置为hbase-master和hbase-slave，总之在zk 中的znode节点命名不能冲突。 2，在主,从hbase集群的hbase-site.xml中添加配置项：(其实作主从模式的话，只须要将从集群hbase.replication设置为true 便可，其余能够忽略。)

<property>
    <name>hbase.replication</name>
    <value>true</value>
</property>

<property>
    <name>replication.source.nb.capacity</name>
    <value>25000</value>
<description>主集群每次向从集群发送的entry最大的个数，默认值25000，可根据集群规模作出适当调整</description>
</property>

<property>
    <name>replication.source.size.capacity</name>
    <value>67108864</value>
    <description>主集群每次向从集群发送的entry的包的最大值大小，默认为64M</description>
</property>

<property>
    <name>replication.source.ratio</name>
    <value>1</value>
    <description>主集群使用的从集群的RS的数据百分比，默认为0.1,1.X.X版本默认0.15，需调整为1，充分利用从集群的RS</description>
</property>

<property>
    <name>replication.sleep.before.failover</name>
    <value>2000</value>
<description>主集群在RS宕机多长时间后进行failover，默认为2秒，具体的sleep时间是： sleepBeforeFailover + (long) (new Random().nextFloat() * sleepBeforeFailover) </description>
</property>

<property>
    <name>replication.executor.workers</name>
    <value>1</value>
    <description>从事replication的线程数，默认为1，若是写入量大，能够适当调大</description>
</property>

3，重启主从集群，新集群搭建请忽略重启，直接启动便可。
4，分别在主从集群hbase shell中

add_peer 'ID' 'CLUSTER_KEY'

The ID must be a short integer. To compose the CLUSTER_KEY, use the following template:

hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent

This will show you the help to setup the replication stream between both clusters. If both clusters use the same Zookeeper cluster, you have to use a different zookeeper.znode.parent since they can't write in the same folder.

1，

增长主Hbase 到容灾 Hbase 数据表 同步 
add_peer '1',  "VECS00840,VECS00841,VECS00842,VECS00843,VECS00844:2181:/hbase"

2，

增长容灾Hbase 到主 Hbase 数据表 同步 
add_peer '2',  "VECS00994,VECS00995,VECS00996,VECS00997,VECS00998:2181:/hbase"

3，而后在主，备集群建表结构，属性彻底相同的表。（注意，是彻底相同）

主从集群都创建。
hbase shell>
create 't_warehouse_track', {NAME => 'cf', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

4，在主集群hbase shell

enable_table_replication 't_warehouse_track'

5,在容灾集群hbase shell

disable 'your_table'
alter 'your_table', {NAME => 'family_name', REPLICATION_SCOPE => '1'}
enable 'your_table

此处的REPLICATION_SCOPE => '1'中的1，与第3步中设置到“ID”无关系，这个值只有0或者1，标示开启复制或者关闭。