CDH使用Solr实现HBase二级索引

时间 2019-11-11

标签 cdh 使用 solr 实现 hbase 二级索引栏目 Hadoop 繁體版

原文原文链接

1、为何要使用Solr作二级索引
2、实时查询方案
3、部署流程
3.1 安装HBase、Solr
3.2 增长HBase复制功能
3.3建立相应的 SolrCloud 集合
3.4 建立 Lily HBase Indexer 配置
3.5建立 Morphline 配置文件
3.6 注册 Lily HBase Indexer Configuration 和 Lily HBase Indexer Service
3.7 同步数据
3.8批量同步索引
3.9 设置多个indexer
4、数据的增删改查
4.1 增长
4.2更新
4.3删除
4.4 总结
5、扩展命令
6、F&Q
6.1建立indexer失败，原来indexer已经存在
6.2建立indexer失败
6.3使用自带的indexer工具批量同步索引失败,提示找不到morphlines.conf
6.4使用自带的indexer工具批量同步索引失败,提示找不到solrconfig.xml
6.5使用自带的indexer工具批量同步索引失败,提示找不到Java heap space
6.6 HBaseIndexer启动后一下子就自动退出
6.7 HBaseIndexer同步的数据与Solr不一致
6.8 出现了6.7的问题以后，修改了read-row="never"后，丢失部分字段

1、为何要使用Solr作二级索引

在Hbase中,表的RowKey 按照字典排序, Region按照RowKey设置split point进行shard，经过这种方式实现的全局、分布式索引. 成为了其成功的最大的砝码。html

然而单一的经过RowKey检索数据的方式,再也不知足更多的需求，查询成为Hbase的瓶颈，人们更加但愿像Sql同样快速检索数据，但是，Hbase以前定位的是大表的存储，要进行这样的查询，每每是要经过相似Hive、Pig等系统进行全表的MapReduce计算，这种方式既浪费了机器的计算资源，又因高延迟使得应用黯然失色。因而，针对HBase Secondary Indexing的方案出现了。前端

Solrjava

Solr是一个独立的企业级搜索应用服务器，是Apache Lucene项目的开源企业搜索平台,git

其主要功能包括全文检索、命中标示、分面搜索、动态聚类、数据库集成，以及富文本（如Word、PDF）的处理。Solr是高度可扩展的，并提供了分布式搜索和索引复制。Solr 4还增长了NoSQL支持，以及基于Zookeeper的分布式扩展功能SolrCloud。SolrCloud的说明能够参看：SolrCloud分布式部署。它的主要特性包括：高效、灵活的缓存功能，垂直搜索功能，Solr是一个高性能，采用Java5开发，基于Lucene的全文搜索服务器。同时对其进行了扩展，提供了比Lucene更为丰富的查询语言，同时实现了可配置、可扩展并对查询性能进行了优化，而且提供了一个完善的功能管理界面，是一款很是优秀的全文搜索引擎。github

Solr能够高亮显示搜索结果，经过索引复制来提升可用，性，提供一套强大Data Schema来定义字段，类型和设置文本分析，提供基于Web的管理界面等。shell

Key-Value Store Indexer数据库

这个组件很是关键，是Hbase到Solr生成索引的中间工具。缓存

在CDH5.3.2中的Key-Value Indexer使用的是Lily HBase NRT Indexer服务.服务器

Lily HBase Indexer是一款灵活的、可扩展的、高容错的、事务性的，而且近实时的处理HBase列索引数据的分布式服务软件。它是NGDATA公司开发的Lily系统的一部分，已开放源代码。Lily HBase Indexer使用SolrCloud来存储HBase的索引数据，当HBase执行写入、更新或删除操做时，Indexer经过HBase的replication功能来把这些操做抽象成一系列的Event事件，并用来保证写入Solr中的HBase索引数据的一致性。而且Indexer支持用户自定义的抽取，转换规则来索引HBase列数据。Solr搜索结果会包含用户自定义的columnfamily:qualifier字段结果，这样应用程序就能够直接访问HBase的列数据。并且Indexer索引和搜索不会影响HBase运行的稳定性和HBase数据写入的吞吐量，由于索引和搜索过程是彻底分开而且异步的。Lily HBase Indexer在CDH5中运行必须依赖HBase、SolrCloud和Zookeeper服务。app

2、实时查询方案

Hbase —–> Key Value Store —> Solr ——-> Web前端实时查询展现

1.Hbase 提供海量数据存储

2.Solr提供索引构建与查询

3. Key Value Store 提供自动化索引构建(从Hbase到Solr)

3、部署流程

3.1 安装HBase、Solr

HBase的实例

Key-Value Store Indexer的实例（目录在/opt/cloudera/parcels/CDH/lib/hbase-solr）

Solr的实例

3.2 增长HBase复制功能

默认安装了Key-Value Store Indexer以后就会打开HBase的复制功能

接下来就是对HBase得表进行改造了
对于初次创建得表，可使用

create 'table',{NAME =>'cf', REPLICATION_SCOPE =>1}
#其中1表示开启replication功能，0表示不开启，默认为0

对于已经存在得表，能够

disable 'table'
alter 'table',{NAME =>'cf', REPLICATION_SCOPE =>1}
enable 'table'

这里，为了测试，我新建一张表，名字叫作

create 'HBase_Indexer_Test',{NAME => 'cf1', REPLICATION_SCOPE => 1}
并插入两条数据

put 'HBase_Indexer_Test','001','cf1:name','xiaoming'
put 'HBase_Indexer_Test','002','cf1:name','xiaohua'

3.3建立相应的 SolrCloud 集合

接下来在安装有Solr的机器上运行
这里得路径和用户名均可以本身定义

# 生成实体配置文件：
solrctl instancedir --generate $HOME/hbase-indexer/bqjr

此时会在home下生成hbase-indexer/bqjr文件夹，里面包含一个conf文件夹，咱们修改下面得schema.xml文件.
咱们新建一个filed字段

<fieldname="HBase_Indexer_Test_cf1_name"type="string"indexed="true"stored="true"/>

这里重点解释一下name字段，它对应了咱们后续须要修改Morphline.conf文件中的outputField属性。所以能够当作是hbase中须要建立索引的值。所以咱们建议将其与表名和列族结合。其对应关系以下

HBase	Solr
name	HBase_Indexer_Test_cf1_name

再修改solrconfig.xml文件，将硬提交打开（会影响部分性能）

# 建立 collection实例并将配置文件上传到 zookeeper：
solrctl instancedir --create bqjr $HOME/hbase-indexer/bqjr
# 上传到 zookeeper 以后，其余节点就能够从zookeeper下载配置文件。接下来建立 collection:
solrctl collection --create bqjr

若是但愿将数据分散到各个节点进行存储和检索，则须要建立多个shard，须要使用以下命令

solrctl collection --create bqjr -s 7-r 3-m 21

其中-s表示设置Shard数为7，-r表示设置的replica数为3,-m表示最大shards数目(7*3)

3.4 建立 Lily HBase Indexer 配置

在前面定义的$HOME/hbase-indexer/bqjr目录下，建立一个morphline-hbase-mapper.xml文件,内容以下：

<?xml version="1.0"?>


<indexertable="HBase_Indexer_Test"mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"read-row="never">


<paramname="morphlineFile"value="morphlines.conf"/>

<paramname="morphlineId"value="bqjrMap"/>
</indexer>

其中：
** indexer table="HBase_Indexer_Test"得table对应HBase的表HBase_Indexer_Test**
**对应了Morphlines.conf 中morphlines 属性id值**
read-row="never"详见 6.7 HBaseIndexer同步的数据与Solr不一致

3.5建立 Morphline 配置文件

经过CM页面进入到Key-Value Store Indexer的配置页面，里面有一个Morphlines文件。咱们编辑它
每一个Collection对应一个morphline-hbase-mapper.xml

SOLR_LOCATOR :{
# Name of solr collection
collection : bqjr
# ZooKeeper ensemble
zkHost :"$ZK_HOST"
}
#注意SOLR_LOCATOR只能设置单个collection，若是咱们须要配置多个怎么办呢？后面咱们会讲
morphlines :[
{
id : bqjrMap
importCommands :["org.kitesdk.**","com.ngdata.**"]
commands :[
{
extractHBaseCells {
mappings :[
{
inputColumn :"cf1:name"
outputField :"HBase_Indexer_Test_cf1_name"
type :string
source : value
}
]
}
}
{ logDebug { format :"output record: {}", args :["@{}"]}}
]
}
]

其中

** id:表示当前morphlines文件的ID名称。**

** importCommands:须要引入的命令包地址。**

** extractHBaseCells：该命令用来读取HBase列数据并写入到SolrInputDocument对象中，该命令必须包含零个或者多个mappings命令对象。**

** mappings:用来指定HBase列限定符的字段映射。**

** inputColumn:须要写入到solr中的HBase列字段。值包含列族和列限定符，并用‘ : ’分开。其中列限定符也可使用通配符‘’来表示，譬如可使用data:表示读取只要列族为data的全部hbase列数据，也能够经过data:my*来表示读取列族为data列限定符已my开头的字段值。

** outputField:用来表示morphline读取的记录须要输出的数据字段名称，该名称必须和solr中的schema.xml文件的字段名称保持一致，不然写入不正确。**

** type:用来定义读取HBase数据的数据类型，咱们知道HBase中的数据都是以byte[]的形式保存，可是全部的内容在Solr中索引为text形式，因此须要一个方法来把byte[]类型转换为实际的数据类型。type参数的值就是用来作这件事情的。如今支持的数据类型有：byte,int,long,string,boolean,float,double,short和bigdecimal。固然你也能够指定自定的数据类型，只须要实现com.ngdata.hbaseindexer.parse.ByteArrayValueMapper接口便可。**

** source:用来指定HBase的KeyValue那一部分做为索引输入数据，可选的有‘value’和'qualifier',当为value的时候表示使用HBase的列值做为索引输入，当为qualifier的时候表示使用HBase的列限定符做为索引输入。**

3.6 注册 Lily HBase Indexer Configuration 和 Lily HBase Indexer Service

当 Lily HBase Indexer 配置 XML文件的内容使人满意，将它注册到 Lily HBase Indexer Service。上传 Lily HBase Indexer 配置 XML文件至 ZooKeeper，由给定的 SolrCloud 集合完成此操做。

hbase-indexer add-indexer \
--name bqjrIndexer \
--indexer-conf $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--connection-param solr.zk=bqbps1.bqjr.cn:2181,bqbpm1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--connection-param solr.collection=bqjr \
--zookeeper bqbps1.bqjr.cn:2181,bqbpm1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181

再次运行hbase-indexer list-indexers查看。添加成功

3.7 同步数据

put 'HBase_Indexer_Test','003','cf1:name','xiaofang'
put 'HBase_Indexer_Test','004','cf1:name','xiaogang'

咱们进入Solr的查询界面，在q里面输入HBase_Indexer_Test_cf1_name:xiaogang能够看到对应得HBase得rowkey

咱们也可使用:查询所有数据

3.8批量同步索引

仔细观察3.7咱们会发现一个问题，咱们只记录了后面插入得数据，那原来就存在HBase的数据怎么办呢？

在运行命令的目录下必须有morphlines.conf文件，执行
find / |grep morphlines.conf$

通常咱们选择最新的那个process
进入到
/opt/cm-5.7.0/run/cloudera-scm-agent/process/1386-ks_indexer-HBASE_INDEXER/morphlines.conf
或者加上
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1501-ks_indexer-HBASE_INDEXER/morphlines.conf
执行下面的命令

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cdh5.7.0-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--zk-host bqbpm1.bqjr.cn:2181,bqbps1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--collection bqjr \
--go-live

提示找不到solrconfig.xml，这个问题纠结了好久。最终加上reducers--reducers 0就能够了

将修改的

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1501-ks_indexer-HBASE_INDEXER/morphlines.conf \
--zk-host bqbpm2.bqjr.cn:2181/solr \
--collection bqjr \
--reducers 0 \
--go-live

3.9 设置多个indexer

每个Hbase Table对应生成一个Solr的Collection索引，每一个索引对应一个Lily HBase Indexer 配置文件morphlines.conf和morphline配置文件morphline-hbase-mapper.xml，其中morphlines.conf可由CDH的Key-Value Store Indexer控制台管理，以id区分
可是咱们再CDH中没办法配置多个morphlines.conf文件的，那咱们怎么让indexer和collection关联呢？
其实咱们仔细回想增长indexer的时候有指定具体的collection，如--connection-param solr.collection=bqjr
因此咱们的morphlines.conf能够直接这么写

SOLR_LOCATOR :{
# ZooKeeper ensemble
zkHost :"$ZK_HOST"
}
morphlines :[
{
id : XDGL_ACCT_FEE_Map
importCommands :["org.kitesdk.**","com.ngdata.**"]
commands :[
{
extractHBaseCells {
mappings :[
{
inputColumn :"cf1:ETL_IN_DT"
outputField :"XDGL_ACCT_FEE_cf1_ETL_IN_DT"
type :string
source : value
}
]
}
}
{ logDebug { format :"output record: {}", args :["@{}"]}}
]
},
{
id : XDGL_ACCT_PAYMENT_LOG_Map
importCommands :["org.kitesdk.**","com.ngdata.**"]
commands :[
{
extractHBaseCells {
mappings :[
{
inputColumn :"cf1:ETL_IN_DT"
outputField :"XDGL_ACCT_PAYMENT_LOG_cf1_ETL_IN_DT"
type :string
source : value
}
]
}
}
{ logDebug { format :"output record: {}", args :["@{}"]}}
]
}
]

4、数据的增删改查

4.1 增长

put 'HBase_Indexer_Test','005','cf1:name','bob'

在Solr中新增了一条名为bob的索引

4.2更新

put 'HBase_Indexer_Test','005','cf1:name','Ash'

咱们尝试将bob改为Ash，过了几秒，发现Solr也随之更新了

4.3删除

deleteall 'HBase_Indexer_Test','005'

咱们删除刚刚插入的005的索引，Solr也跟着删除了

4.4 总结

经过Lily HBase Indexer工具同步到Solr的索引，会很智能的将增删改操做同步过去，彻底不用咱们操做。很是方便

5、扩展命令

#solrctl
solrctl instancedir --list
solrctl collection --list
# 更新coolection配置
solrctl instancedir --update User $HOME/hbase-indexer/User
solrctl collection --reload User
#删除instancedir
solrctl instancedir --deleteUser
#删除collection
solrctl collection --deleteUser
#删除collection全部doc
solrctl collection --deletedocs User
#删除User配置目录
rm -rf $HOME/hbase-indexer/User
# hbase-indexer
# 若修改了morphline-hbase-mapper.xml，需更新索引
hbase-indexer update-indexer -n userIndexer
# 删除索引
hbase-indexer delete-indexer -n userIndexer
#查看索引
hbase-indexer list-indexers

6、F&Q

6.1建立indexer失败，原来indexer已经存在

执行了hbase-indexer add-indexer命令后发现原来已经存在了indexer

使用hbase-indexer delete-indexer --name $IndxerName删除原来的indexer

6.2建立indexer失败

使用hbase-indexer list-indexers命令，查看是否建立成功

此时咱们发现，

说明咱们建立失败了。缘由是zookeeper我只设置了一个。
错误示例：

hbase-indexer add-indexer \
--name bqjrIndexer \
--indexer-conf $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--connection-param solr.zk=bqbpm2.bqjr.cn:2181/solr \
--connection-param solr.collection=bqjr \
--zookeeper bqbpm2.bqjr.cn:2181

正确示例

hbase-indexer add-indexer \
--name bqjrIndexer \
--indexer-conf $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--connection-param solr.zk=bqbps1.bqjr.cn:2181,bqbpm1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--connection-param solr.collection=bqjr \
--zookeeper bqbps1.bqjr.cn:2181,bqbpm1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181

再次运行hbase-indexer list-indexers查看。此次成功了

6.3使用自带的indexer工具批量同步索引失败,提示找不到morphlines.conf

首先，命令中要指定morphlines.conf文件路径和morphline-hbase-mapper.xml文件路径。执行：
find / |grep morphlines.conf$

通常咱们选择最新的那个process，咱们将其拷贝或者添加到配置项中
进入到
/opt/cm-5.7.0/run/cloudera-scm-agent/process/1386-ks_indexer-HBASE_INDEXER/morphlines.conf
或者加上
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1501-ks_indexer-HBASE_INDEXER/morphlines.conf
执行下面的命令

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cdh5.7.0-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1629-ks_indexer-HBASE_INDEXER/morphlines.conf \
--zk-host bqbpm1.bqjr.cn:2181,bqbps1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--collection bqjr \
--go-live

6.4使用自带的indexer工具批量同步索引失败,提示找不到solrconfig.xml

提示找不到solrconfig.xml，这个问题纠结了好久。最终加上reducers--reducers 0就能够了

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/bqjr/morphline-hbase-mapper.xml \
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1501-ks_indexer-HBASE_INDEXER/morphlines.conf \
--zk-host bqbpm2.bqjr.cn:2181/solr \
--collection bqjr \
--reducers 0 \
--go-live

可是为何会出现这个问题呢？其实咱们犯了一个错误，咱们add-indexer的时候，指定的zookeeper信息中有两个节点忘了加端口，写成了

hbase-indexer add-indexer \
--name XDGL_WITHHOLD_KFT_INFO \
--indexer-conf $HOME/hbase-indexer/XDGL_WITHHOLD_KFT_INFO/morphline-hbase-mapper.xml \
--connection-param solr.zk=bqbpm2.bqjr.cn:2181/solr \
--connection-param solr.collection=XDGL_WITHHOLD_KFT_INFO \
--zookeeper bqbps1.bqjr.cn,bqbpm1.bqjr.cn,bqbpm2.bqjr.cn:2181

因此在其余zookeeper节点找不到solrconfig.xml也正常，咱们添加正确后，运行又好了

hadoop --config /etc/hadoop/conf \
jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cdh5.7.0-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
--hbase-indexer-file $HOME/hbase-indexer/XDGL_ACCT_FEE/morphline-hbase-mapper.xml \
--morphline-file /opt/cm-5.7.0/run/cloudera-scm-agent/process/1629-ks_indexer-HBASE_INDEXER/morphlines.conf \
--zk-host bqbpm1.bqjr.cn:2181,bqbps1.bqjr.cn:2181,bqbpm2.bqjr.cn:2181/solr \
--collection XDGL_ACCT_FEE \
--go-live

6.5使用自带的indexer工具批量同步索引失败,提示找不到Java heap space

若是启动参数里面带有
-D 'mapred.child.java.opts=-Xmx500m'请删除它，或者调大一点好比-D 'mapred.child.java.opts=-Xmx3806m'，由于咱们通常设置了Mapreduce的运行参数的，因此不用再次设置这些参数

6.6 HBaseIndexer启动后一下子就自动退出

这个问题有不少缘由。一个是前面说的mappine文件不匹配，另外一种是因为内存溢出。

这里面能够看到错误日志
若是是内存溢出的问题，须要调大

6.7 HBaseIndexer同步的数据与Solr不一致

第一种是由于本身写的Spark同步和HBaseIndexer同时在跑，而数据是一直更新的，在批量插入的时候清空了数据会致使本来由HBaseIndexer的插入的数据删除掉了

第二种如HBase Indexer致使Solr与HBase数据不一致问题解决所说，因为HBase插入的WAL和实际数据是异步的，所以会产生“取不到数据”的状况，增长read-row="never"

详情参考:http://stackoverflow.com/questions/37267899/hbase-indexer-solr-numfound-different-from-hbase-table-rows-size

6.8 出现了6.7的问题以后，修改了read-row="never"后，丢失部分字段

因为设置了read-row以后数据不会再次从HBase中获取，所以只会读取WAL。假如修改了部分字段，HBaseIndexer就会提交相应的字段上去。例如
HBase中有name和age两个字段

put 'HBase_Indexer_Test','001','cf1:name','xiaoming'
put 'HBase_Indexer_Test','002','cf1:name','xiaohua'

此时的数据为

而后执行

put 'HBase_Indexer_Test','001','cf1:age','12'

最后只能看到

说明这种模式只从WAL获取数据，而且将获取的数据覆盖到了Solr里面。

解决办法有两个，一个是修改HBaseIndexer代码，使用原子更新到Solr。第二种方法修改Solr配置，让一个ID对应的数据能容纳多个版本，和HBase同样