Elasticsearch集群40亿级优化

目前架构:node

n台filebeat客户端来将每台应用上的日志传到kafka,3台kafka作集群用于日志队列,四台ES作集群,前两台存放近两天热数据日志,后两台存放两天前的历史日志,数据保存一个月,目前总数据量44亿,大小为6T。logstash与kibana与ES在一台机器上,kibana域名指向后端三个kibana作轮询。nginx


出现性能问题:后端

一、集群中只有第一台负载很高,其余节点负载一直都很低,偶尔同为hot数据节点的第二台负载也会稍微有点升高。架构

二、队列常常堵塞,kafka中uat,pet,prd三个环境的topic同在一个默认的logstash消费组。只要其中一个环境的列队积压,其余环境的队列就没法消费了。ide

三、Kibana登录后首页打开,须要至少半分钟,日志查询也很慢,至少几分钟才会出结果。
性能

四、有时候ES常因负载高而脱离集群,致使集群节点数据从新分配,集群状态颜色为RED,同时kibana页面打开时显示Red报错。kibana页面间断没法打开的状况约持续一两周。spa




目前ELK中发现有些索引查询有点慢,因而打开ES索引查询日志来记录慢查询,进而对慢查询日志进行分析,定位问题。慢日志内容以下:日志

[2017-08-28T11:21:02,377][WARN ][index.search.slowlog.query] [node-3] [logstash-nginx-2017.08.01][4] took[15s], took_millis[15029], types[], stats[], search
_type[QUERY_THEN_FETCH], total_shards[140], source[{"size":0,"query":{"bool":{"filter":[{"match_none":{"boost":1.0}},{"query_string":{"query":"NOT status:200  OR  NOT
status:304","fields":[],"use_dis_max":true,"tie_breaker":0.0,"default_operator":"or","auto_generate_phrase_queries":false,"max_determined_states":10000,"enable_position
_increment":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"analyze_wildcard":true,"escape":false,"split_on_whitespace":true,
"boost":1.0}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"3":{"terms":{"field":"status","size":5,"min_doc_count":0,"shard_min_doc_
count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_term":"asc"}]},"aggregations":{"2":{"date_histogram":{"field":"@timestamp","format":"epoch_mill
is","interval":"20m","offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0,"extended_bounds":{"min":"1503886846372","max":"1503890446372"}}}}}}}],
[2017-08-28T11:21:02,377][WARN ][index.search.slowlog.query] [node-3] [logstash-nginx-2017.08.01][2] took[15.7s], took_millis[15787], types[], stats[], sear
ch_type[QUERY_THEN_FETCH], total_shards[140], source[{"size":0,"query":{"bool":{"filter":[{"match_none":{"boost":1.0}},{"query_string":{"query":"NOT status:200  OR  NOT
  status:304","fields":[],"use_dis_max":true,"tie_breaker":0.0,"default_operator":"or","auto_generate_phrase_queries":false,"max_determined_states":10000,"enable_positi
on_increment":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"analyze_wildcard":true,"escape":false,"split_on_whitespace":tru
e,"boost":1.0}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"3":{"terms":{"field":"status","size":5,"min_doc_count":0,"shard_min_do
c_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_term":"asc"}]},"aggregations":{"2":{"date_histogram":{"field":"@timestamp","format":"epoch_mi
llis","interval":"20m","offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0,"extended_bounds":{"min":"1503886846372","max":"1503890446372"}}}}}}}],

下面进行分析:orm

待续
索引

相关文章
相关标签/搜索