HBase RowKey设计的那些事

时间 2019-11-12

原文原文链接

在说rowkey设计以前，先回答一下你们配置HBase时可能有的疑问，关于HBase是否须要单独的ZooKeeper托管？嗯，若是只是部署HBase，我建议不要用单独的ZooKeeper进行托管，用HBase自带的ZooKeeper就能够，假如要部署其余应用，好比Spark等能够单独部署一个ZooKeeper集群。好，废话很少说了，下面说说RowKey设计的事。 java

先谈HBase底层架构

对于新手来讲，RowKey的设计是比较陌生的一件事，看上去很简单的东西，其实很是复杂，RowKey的设计基本上能够划分红两大影响，分别是分析维度、查询性能。为何要这样分呢？咱们再回头看看HBase系统架构图：架构

这种设计看上去并无什么问题，可是这种设计隐藏了很是多陷阱，假如CompanyCode字段很是固定，而TimeStamp变化比较大的话，会形成单个Region连续地存储这些数据，数据量很是大的时候，这个Region会集中了这些数据，当有应用须要访问这些数据时，形成了RPC timeout，甚至应用程序直接报错，没法执行。 app

合理的RowKey设计方法

基于上面的缘由，咱们须要考虑单点集中以及数据查询两方面的因素，所以，在RowKey上咱们要针对这两个问题进行方案设计。分布式

首先是单点集中问题，咱们出现这样单点集中的缘由大概有如下几种：性能

l RowKey前面的字符过于固定测试

l 集群结点数量过少大数据

集群结点数量是由咱们自身硬件资源限制的，这个咱们不考虑在内，咱们主要考虑RowKey设计。既然是由于前面字符过于集中，那么咱们能够经过在RowKey前面添加随机的一个字符串，下面是引自《HBase Essential》里面的一个随机字符计算方法： spa

int saltNumber = new Long(new Long(timestamp).hashCode()) %<number of region servers> 设计

用这种方法，咱们在插入数据的时候能够人为地随机把一断时间内的数据打散，分布到各个RegionServer下的Region中，充分利用分布式的优点，这样作不紧能够加快数据的读写访问，也解决了数据集中的问题。 code

改良后的RowKey设计方案

经过上面的技术研讨，能够制定出如下的RowKey设计方案了：

随机字符(2位) + 时间位（14位）+ CompanyCode（4位）

我在实际测试过程当中，先后两种方案对比，前者的MR程序跑了1个小时，后者只花了5分钟。

合理地编写查询代码

咱们完成数据存储以后，假如要取出某部分数值，须要设置Scan查询，如下是我在实战中用到的部分代码，仅供参考：

public class HBaseTableDriver extends Configured implements Tool {

 

    public int run(String[] arg0) throws Exception {

       if(arg0.length < 4 || arg0.length > 5)

           throw new IllegalArgumentException("The input argument need:start && stop && farmid && turbineNum && calid");

       if(arg0[0].length() != 8 || arg0[1].length() != 8)

           throw new IllegalArgumentException("The date format should be yyyyMMdd");

      

       Configuration conf = HBaseConfiguration.create();

       conf.set("hbase.zookeeper.quorum", ConstantValues.QUOREM);

       conf.set("hbase.zookeeper.property.clientPort", ConstantValues.CLIENT_PORT);

      

       //extract table && tagid && start time && end time

       conf.set("start", arg0[0]);

       conf.set("stop", arg0[1]);

        conf.set("farmid", arg0[2]);

       conf.set("turbineNum", arg0[3]);

       conf.set("calid", arg0[4]);

       String startRow = "0" + arg0[0] + " 000000" + arg0[2] + "001";

       String stopRow = "2" + arg0[1] + " 235959" + arg0[2] + RowKeyGenerator.addZero(Integer.parseInt(arg0[3]));

      

       String targetKpiTableName = "kpi2";

      

       Job job = Job.getInstance(conf, "KPIExtractor");

        job.setJarByClass(KPIExtractor.class);

        job.setNumReduceTasks(6);

        Scan scan = new Scan();

        scan.addColumn("f".getBytes(), "v".getBytes());

        String regEx = "^\\d{1}(?:" + arg0[0].substring(0, 4) + "|" + arg0[1].substring(0, 4) + ")\\d{17}";

        switch(arg0[4]){

        case "1":

               regEx = regEx + "(?:823|834)$";

               startRow = startRow + "823";

               stopRow = stopRow + "834";

            break;

        case "2":

            regEx = regEx + "211$";

            startRow = startRow + "211";

           stopRow = stopRow + "211";

            break;

        case "3":

            regEx = regEx + "544$";

            startRow = startRow + "544";

           stopRow = stopRow + "544";

            break;

        case "4":

            regEx = regEx + "208$";

            startRow = startRow + "208";

           stopRow = stopRow + "208";

            break;

        case "5":

            regEx = regEx + "(?:739|823)$";

            startRow = startRow + "739";

           stopRow = stopRow + "823";

            break;

        case "6":

            regEx = regEx + "(?:211|823)$";

            startRow = startRow + "211";

           stopRow = stopRow + "823";

            break;

        case "7":

            regEx = regEx + "708$";

            startRow = startRow + "708";

           stopRow = stopRow + "708";

            break;

        case "8":

            regEx = regEx + "822$";

            startRow = startRow + "822";

           stopRow = stopRow + "822";

            break;

        case "9":

            regEx = regEx + "211$";

            startRow = startRow + "211";

           stopRow = stopRow + "211";

            break;

        default:

            throw new IllegalArgumentException("UnKnown Argument calid:"+arg0[4]+",it should be between 1~9");

        }

        scan.setStartRow(startRow.getBytes());

        scan.setStopRow(stopRow.getBytes());

        scan.setFilter(new RowFilter(CompareOp.EQUAL, new RegexStringComparator(regEx)));

        TableMapReduceUtil.initTableMapperJob("hellowrold", scan , KPIMapper.class, ImmutableBytesWritable.class, ImmutableBytesWritable.class, job);

        TableMapReduceUtil.initTableReducerJob(targetKpiTableName, KPIReducer.class, job);

        job.waitForCompletion(true);

       return 0;

    }

   

}

注意点：

l 这里主要用到了RowFilter对RowKey进行过滤，而且我在查阅相关资料的时候，别人建议不要在大数据量下使用ColumnFilter，性能很是低。

l 能够经过Configuration为Map/Reduce传输参数值。