配置filebeat kafka output(multiple topic)踩坑记录

时间 2020-01-29

标签配置 filebeat kafka output multiple topic 记录栏目 Kafka 繁體版

原文原文链接

背景

业务背景

从日志中收集数据，集中到离线的数据仓库提供给业务方html

技术背景

离线日志中心化场景，入库频次1次/天，每日生成日志(即filebeat须要收集的数据）量级为？，kafka接收到的数据量级为？,从kafka读取数据的程序吞吐量为？并发

需求实现思路

需求解析

需求比较明确，就是要搭建一条日志中心化（收集）链路，只须要知足日频次的分析需求，因此场景能够是实时也能够是离线ide

前期调研

目前业界对于日志中心化的解决方案通常是elk或者flume+kafka这样的实时链路。考虑filebeat的轻量级和具体的业务场景，采用filebeat -> kafka测试

实施过程

filebeat安装配置&启动

装以前先看了官方refrence的Getting Started和how filebeat works以及kafka output，大体了解一下filebeat的原理、使用方法和filebeat output to kafka的在我组业务场景下的可行性。
按照Getting Started的步骤下载包并配置了filebeat.yml，因为有对不一样的log输出到不一样的kafka topic的需求，因此又在网上找了一下output to multiple kafka topic的配置，最终的配置文件以下(是错滴)：ui

filebeat.prospectors:
- input_type: log
  paths:
    - /log/1.log*
  include_lines: [‘\[LOG\].+?_MATCH']
  fields:
    topic: topic_1
- input_type: log
  paths:
    - /log/2.log.*
  include_lines: [‘\[LOG\].+?_MATCH']
  fields:
    topic: topic_2
output.kafka:
  hosts: [“broker1:9092", “broker2:9092", “broker3:9092"]
  topic: ‘%{[fields.topic]}'
  required_acks: 1

执行./filebeat启动后遇到了一些问题日志

遇到的问题

Q1：执行启动filebeat的命令无日志
A：加上-e选项（意思是把stderr的内容重定向到stdout），或者在filebeat.yml中配置logging选项)
Q2：从filebeat执行后的console打印来看，配置的log文件时而能够正常注册到harvester并收集到spooler发送到kafka，时而一点反应都🈚️。
A：一开始考虑到filebeat会在registry文件中记录注册到prospector的文件的status，因此在reference里面翻翻拣拣找到了两个配置项（clean_xx）,然而并无效果【多是我配错了】；又试着把文件用mv命令重命名了一下，但仍是不能稳定地读到我配置的两个input path（只能读到其中一个）。
再次翻了一下官方refrence中关于propectors的部分，其中关于fields的解释是“在最终输出的文档中添加一个fields字段”，冥思苦想这个配置：code

跟区分output到哪一个位置没有什么关系
%{[fields.topic]}这种取值方法也未必能取到值【有待验证】

再次在网上搜了一下output to multiple kafka topic，找到了document_vtype这个配置项，在官方refrence里面能够搜索到，含义是“The event type to use for published lines read by harvesters. ”（filebeat里面的event指的是harvester从log文件中收集并发送新内容给spooler，见https://www.elastic.co/guide/... 第二段第三行），因此它的值应该能够用在output上。在prospector的官方refrence也能够看到关于document_type的示例。参考refrence + 这篇blog修改后的配置文件以下：htm

filebeat.prospectors:
- input_type: log
  paths:
    - /log/1.log*
  include_lines: [‘\[LOG\].+?_MATCH']
  document_type: topic1
- input_type: log
  paths:
    - /log/2.log.*
  include_lines: [‘\[LOG\].+?_MATCH']
  document_type: topic2
output.kafka:
  hosts: [“broker1:9092", “broker2:9092", “broker3:9092"]
  topic: ‘%{[type]}’
  required_acks: 1

删掉fielbeats安装目录的data目录下的registry文件，再执行./filebeat -e -c filebeat.yml能够正常收集blog

关于为何要删掉registry文件，能够参考 how filebeat works。这里的缘由是我拿来测试的文件一直是同一个，懒得改它的名字。和registry文件中记录的status相关的prospector配置项还有两个clean_xx，由于目前尚未找到它的正确配置方法，因此就先直接删掉registry文件让filebeat本身从新生成。

遗留问题：

%{[fields.topic]}这种取值方法若是能取到值，那应该第一种配置也能够
貌似必须先有kafka consumer，收集到的topic才能正常输出。若是等filebeat的日志显示收集完了再起kafka consumer，则consumer消费不到任何东西，不肯定是否是这个缘由致使的问题2
生产环境或者其余不能直接删除registry的场景下，若是遇到filebeat的下游消费出错，应该怎样重放？

Refrences

配置filebeat kafka output：https://www.cnblogs.com/linke...
filebeat 官方refrence：https://www.elastic.co/guide/...