Hbase合并Region的过程当中出现永久RIT的解决

在合并Region的过程当中出现永久RIT怎么办?笔者在生产环境中就遇到过这种状况,在批量合并Region的过程当中,出现了永久MERGING_NEW的状况,虽然这种状况不会影响现有集群的正常的服务能力,可是若是集群有某个节点发生重启,那么可能此时该RegionServer上的Region是无法均衡的。由于在RIT状态时,HBase是不会执行Region负载均衡的,即便手动执行balancer命令也是无效的。apache

若是不解决这种RIT状况,那么后续有HBase节点相继重启,这样会致使整个集群的Region验证不均衡,这是很致命的,对集群的性能将会影响很大。通过查询HBase JIRA单,发现这种MERGING_NEW永久RIT的状况是触发了HBASE-17682的BUG,须要打上该Patch来修复这个BUG,其实就是HBase源代码在判断业务逻辑时,没有对MERGING_NEW这种状态进行判断,直接进入到else流程中了。源代码以下:负载均衡

for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign.
          LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable)
          if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state +
              " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }

修复以后代码:性能

for (RegionState state : regionsInTransition.values()) { HRegionInfo hri = state.getRegion(); if (assignedRegions.contains(hri)) { // Region is open on this region server, but in transition. // This region must be moving away from this server, or splitting/merging. // SSH will handle it, either skip assigning, or re-assign.
          LOG.info("Transitioning " + state + " will be handled by ServerCrashProcedure for " + sn); } else if (sn.equals(state.getServerName())) { // Region is in transition on this region server, and this // region is not open on this server. So the region must be // moving to this server from another one (i.e. opening or // pending open on this server, was open on another one. // Offline state is also kind of pending open if the region is in // transition. The region could be in failed_close state too if we have // tried several times to open it while this region server is not reachable)
          if (state.isPendingOpenOrOpening() || state.isFailedClose() || state.isOffline()) { LOG.info("Found region in " + state +
              " to be reassigned by ServerCrashProcedure for " + sn); rits.add(hri); } else if(state.isSplittingNew()) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); } else if (isOneOfStates(state, State.SPLITTING_NEW, State.MERGING_NEW)) { regionsToCleanIfNoMetaEntry.add(state.getRegion()); }else { LOG.warn("THIS SHOULD NOT HAPPEN: unexpected " + state); } } }

可是,这里有一个问题,目前该JIRA单只是说了须要去修复BUG,打Patch。可是,实际生产状况下,面对这种RIT状况,是不可能长时间中止集群,影响应用程序读写的。那么,有没有临时的解决办法,先临时解决当前的MERGING_NEW这种永久RIT,以后在进行HBase版本升级操做。this

办法是有的,在分析了MERGE合并的流程以后,发现HBase在执行Region合并时,会先生成一个初始状态的MERGING_NEW。整个Region合并流程以下:url

从流程图中能够看到,MERGING_NEW是一个初始化状态,在Master的内存中,而处于Backup状态的Master内存中是没有这个新Region的MERGING_NEW状态的,那么能够经过对HBase的Master进行一个主备切换,来临时消除这个永久RIT状态。而HBase是一个高可用的集群,进行主备切换时对用户应用来讲是无感操做。所以,面对MERGING_NEW状态的永久RIT可使用对HBase进行主备切换的方式来作一个临时处理方案。以后,咱们在对HBase进行修复BUG,打Patch进行版本升级。spa

相关文章
相关标签/搜索