elasticsearch与mongodb分布式集群环境下数据同步

时间 2019-11-25

标签 elasticsearch mongodb 分布式集群环境数据同步栏目日志分析繁體版

原文原文链接

1.ElasticSearch是什么

ElasticSearch 是一个基于Lucene构建的开源、分布式，RESTful搜索引擎。它的服务是为具备数据库和Web前端的应用程序提供附加的组件（便可搜索的存储库）。ElasticSearch为应用程序提供搜索算法和相关的基础架构，用户只须要将应用程序中的数据上载到ElasticSearch数据存储中，就能够经过RESTful URL与其交互。ElasticSearch的架构明显不一样于它以前的其余搜索引擎架构，由于它是经过水平伸缩的方式来构建的。不一样于Solr，它在设计之初的目标就是构建分布式平台，这使得它可以和云技术以及大数据技术的崛起完美吻合。ElasticSearch构建在更稳定的开源搜索引擎Lucene之上，它的工做方式与无模式的JSON文档数据很是相似。前端

ElasticSearch的关键特征java

RESTful风格

在全部的ElasticSearch的介绍中都不可避免的提到了它是一种具备RESTful特色的搜索引擎。那么什么是RESTful呢？REST（Representational State Transfer表述性状态转移）是一种针对网络应用的设计和开发方式，能够下降开发的复杂性并提升系统的可伸缩性。REST有一些设计概念和准则，凡是遵循这些准则所开发的应用即具有RESTful风格。在REST风格结构中，全部的请求都必须在一个由URL制定的具体地址的对象上进行。例如，若是用/schools/表明一系列学校的话，/schools/1就表明id为1的那所学校，依次类推。这种设计风格为用户提供了一种简单便捷的操做方式，用户能够经过curl等RESTful API与ElasticSearch进行交互，避免了管理XML配置文件的麻烦。下面将简单介绍linux

一下经过curl工具对ElasticSearch进行CRUD(增删改查)操做。git

l 索引构建github

为了对一个JSON对象进行索引建立，须要向REST API提交PUT请求，在请求中指定由索引名称，type名称和ID组成的URL。即redis

http://localhost:9200/<index>/<type>/[<id>]算法

例如：curl -XPUT "http://localhost:9200/movies/movie/1" -d'mongodb

{数据库

"title": "The Godfather",网络

"director": "Francis Ford Coppola",

"year":1972

l 经过ID得到索引数据

向已经构建的索引起送GET请求，即http://localhost:9200/<index>/<type>/<id>

例如：curl -XGET "http://localhost:9200/movies/movie/1" -d''

后面不带参数时 -d''不要也能够

l 删除文档

经过ID指定的索引删除单个文档。URL和索引建立、获取时相同。

例如：curl -XDELETE "http://localhost:9200/movies/movie/1" -d''

ElasticSearch采用Gateway的概念，使得全备份变得更简单。

因为ElasticSearch是专门为分布式环境设计的，因此怎么去对全部节点的索引信息进行持久化是个问题。固然，除了索引信息之外，还有集群信息，mapping和事务日志等都须要进行持久化。当你的节点出现故障或者集群重启的时候，这些信息就变得很是重要。ElasticSearch中有一个专门的gateway模块负责元信息的持久化存储。（Solr里边是否是经过Zookeeper在管理这部分？）

ElasticSearch支持facetting(facetedsearch,分面搜索)和precolating

分面是指事物的多维度属性。例如一本书包含主题、做者、年代等方面。而分面搜索是指经过事物的这些属性不断筛选、过滤搜索结果的方法。固然这点在Lucene中已经获得了实现，因此Solr也支持faceted searching。至于precolating特性则是ElasticSearch设计中的一大亮点。Precolator(过滤器)容许你在ElasticSearch中执行与上文档、创建索引、执行查询这样的常规操做偏偏相反的过程。经过Precolate API,能够在索引上注册许多查询，而后向指定的文档发送prelocate请求，返回匹配该文档的注册查询。举个简单的例子，假设咱们想获取全部包含了”elasticsearch”这个词的tweet，则能够在索引上注册一个query语句，在每一条tweet上过滤用户注册的查询，能够得到匹配每条tweet的那些查询。下面是一个简单的示例:

首先，创建一个索引：

curl –XPUT localhost:9200/test

接着，注册一个对test索引的precolator 查询，制定的名称为kuku

---该处在本机测试不成功，还没找到缘由---

curl –XPUT localhost:9200/_precolator/test/kuku –d’{

“query”:{

“term”:{

“field1”:”value1”

}

}’

如今，能够过滤一个文本看看哪些查询跟它是匹配的

crul –XGETlocalhost:9200/test/type/_precolate –d’{

“doc”:{

“filed1”:”value1”

}

}’

获得的返回结构以下

{“ok”: true, “matches”: [“kuku”]}

--end--

ElasticSearch的分布式特色

ElasticSearch不一样于Solr，从设计之初就是面向分布式的应用环境，所以具有不少便于搭建分布式应用的特色。例如索引能够被划分为多个分片，每一个分片能够有多个副本，每个节点能够持有一个或多个分片，自动实现负载均衡和分片副本的路由。另外，ElasticSearch具备self-contained的特色，没必要使用Tomcat等servlet容器。ElasticSearch的集群是自发现、自管理的（经过内置的Zen discovery模块实现），配置十分简单，只要在config/elasticsearch.yml中配置相同的cluster.name便可。

支持多种数据源

ElasticSearch有一个叫作river的插件式模块，能够将外部数据源中的数据导入elasticsearch并在上面创建索引。River在集群上是单例模式的，它被自动分配到一个节点上，当这个节点挂掉后，river会被自动分配到另外的一个节点上。目前支持的数据源包括：Wikipedia, MongoDB, CouchDB, RabbitMQ, RSS, Sofa, JDBC, FileSystem,Dropbox等。River有一些指定的规范，依照这些规范能够开发适合于本身的应用数据的插件。

2 elasticsearch如何创建的数据源链接？

ElasticSearch经过river创建与各个数据源之间的链接。例如mongodb,这种链接方式多半是以第三方插件的方式，由一些开源贡献者贡献出来的插件创建与各类类型的数据管理系统以及MQ等创建river,索引数据的。本文主要研究的是MONGODB与ES的结合，用的是richardwilly98开发的river。

https://github.com/richardwilly98/elasticsearch-river-mongodb

3 mongodb 集群环境搭建

详见：http://blog.csdn.net/huwei2003/article/details/40453159

4 elasticsearch 如何对真正分布式mongodb集群创建river，而且索引数据

1. 首先下载而且解压Elasticsearch

unzip elasticsearch-0.90.5.zip

2 下载而且解压elasticsearch-servicewrapper-master.zip

unzip elasticsearch-servicewrapper-master.zip

cd elasticsearch-servicewrapper-master

mv service /root/gy/elasticsearch-0.90.5/bin

3 启动elasticsearch

sh elasticsearch start

4 下载river插件

./plugin --install com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/1.7.1

这里值得一提的是river的版本必须与mongodb 和ElasticSearch匹配，若是不匹配，那么river的时候不能将mongodb里面全部的数据index进入es。

匹配规则请见下方：

本次测试用的是 es 1.1.2 + mongodb 2.4.6

https://github.com/richardwilly98/elasticsearch-river-mongodb

5 创建river

curl -XPUT "http://localhost:9200/_river/mongodb/_meta" -d'

{

    "type":"mongodb",

    "mongodb":{

        "servers":[{"host":“192.168.225.131","port":37017}],

        "db":"dbname",

        "collection":"collectionname",

        "gridfs":false,

        "options":{

            "include_fields":["_id","VERSION","ACCESSION","file"]

        }

    },

    "index":{

        "name":"indexname",

        "type":"meta"

    }

}'

注： index 里 name 为索引名要小写,type 里的meta 为 collection name
因为本次测试使用的是mongodb sharding 集群环境，因此在river链接时，使用mongos 路由，就可以正常的把mongo集群中的全部数据都创建索引。

gridfs,options 可不设置

#curl 方式创建river （并创建resume索引）
curl -XPUT "localhost:9200/_river/tbJobResume/_meta" -d '
{
"type": "mongodb",
"mongodb": {
"host": "192.168.225.131",
"port": "37017",
"db": "MongoModelJobResume",
"collection": "tbJobResume"
},
"index": {
"name": "resume",
"type": "tbJobResume"} }'

说明：_river/tbJobResume  tbJobResume 我用的是表名，建立每一个索引的时候最好不一样 -d 后面的 '内容'两个单引号不要丢了
type 后面是 mongodb 由于用的是 mongodb 数据库

mongodb: 分别是 ip,port,db(name),collection 就不用解释了

index: name 要创建的索引名，最好是小写（应该是必须）

index:type collection名，即该索引对应的数据集合名

验证：
curl "http://localhost:9200/_river/tbJobResume/_meta"
这样就建好了resume索引，mongodb若是有数据也会同步过来

特别注意：若是tbJobResume表中有字段是地理坐标，须要map成geo_point类型，在建立索引前设置mapping,以下:

curl -XPUT 'http://localhost:9200/resume' -d '
{
"mappings": {
"tbJobResume": {
"properties": {
"Location": {
"type": "geo_point"
}
}
}
}
}'

设置完后在建立索引

---下面是建的另一个索引---
curl -XPUT "localhost:9200/_river/tbJobPosition/_meta" -d '
{
"type": "mongodb",
"mongodb": {
"host": "192.168.225.131",
"port": "37017",
"db": "MongoModelJob",
"collection": "tbJobPosition"
},
"index": {
"name": "position",
"type": "tbJobPosition"} }'

curl "http://localhost:9200/_river/tbJobPosition/_meta"
---------------
#curl put索引数据

curl -XPUT "http://localhost:9200/customer/tbCustomer/1" -d'
{
"_id": 1,
"Name": "Francis Ford Coppola 1",
"Sex":1
}'
该方法会建立customer索引并put进一条数据,tbCustomer是type

curl -XPUT 'http://192.168.225.131:9200/dept/employee/32' -d '{ "empname": "emp32"}'
curl -XPUT 'http://192.168.225.131:9200/dept/employee/31' -d '{ "empname": "emp31"}'

该方法也会建立dept索引并put进一条数据,employee是type

建立river并索引的变准模版以下：

$ curl -XPUT "localhost:9200/_river/${es.river.name}/_meta" -d '
{
"type": "mongodb",
"mongodb": {
"servers":
[
{ "host": ${mongo.instance1.host}, "port": ${mongo.instance1.port} },
{ "host": ${mongo.instance2.host}, "port": ${mongo.instance2.port} }
],
"options": {
"secondary_read_preference" : true,
"drop_collection": ${mongo.drop.collection},
"exclude_fields": ${mongo.exclude.fields},
"include_fields": ${mongo.include.fields},
"include_collection": ${mongo.include.collection},
"import_all_collections": ${mongo.import.all.collections},
"initial_timestamp": {
"script_type": ${mongo.initial.timestamp.script.type},
"script": ${mongo.initial.timestamp.script}
},
"skip_initial_import" : ${mongo.skip.initial.import},
"store_statistics" : ${mongo.store.statistics},
},
"credentials":
[
{ "db": "local", "user": ${mongo.local.user}, "password": ${mongo.local.password} },
{ "db": "admin", "user": ${mongo.db.user}, "password": ${mongo.db.password} }
],
"db": ${mongo.db.name},
"collection": ${mongo.collection.name},
"gridfs": ${mongo.is.gridfs.collection},
"filter": ${mongo.filter}
},
"index": {
"name": ${es.index.name},
"throttle_size": ${es.throttle.size},
"bulk_size": ${es.bulk.size},
"type": ${es.type.name}
"bulk": {
"actions": ${es.bulk.actions},
"size": ${es.bulk.size},
"concurrent_requests": ${es.bulk.concurrent.requests},
"flush_interval": ${es.bulk.flush.interval}
}
}
}'

--template end--

--url--

本插件git地址：https://github.com/laigood/elasticsearch-river-mongodb

6 测试例子

链接mongo集群，meta collection数据量有22394792条数据

查看ES数据量

最后我在master1 master2 master3上都创建了ElasticSearch,而且3台es rebalance成功，而且数据的总数任然为22394792.

上一篇Elasticsarch及插件安装
下一篇linux下redis的安装及配置启动