flume1.8taildirSource

flume使用(一):入门demo html

flume使用(二):采集远程日志数据到MySql数据库 node

flume使用(三):实时log4j日志经过flume输出到MySql数据库 git

flume使用(四):taildirSource多文件监控实时采集 github

本文针对【flume使用(四):taildirSource多文件监控实时采集】一文中提出的两个flume的TailDirSource可能出现的问题进行解决。web

1、问题思考
(1)log4j的日志文件确定是会根据规则进行滚动的:当*.log满了就会滚动把前文件改名为*.log.1,而后从新进行*.log文件打印。这样flume就会把*.log.1文件看成新文件,又从新读取一遍,致使重复。正则表达式

(2)当flume监控的日志文件被移走或删除,flume仍然在监控中,并无释放资源,固然,在必定时间后会自动释放,这个时间根据官方文档设置默认值是120000ms。数据库

2、处理方式
我这里不叫解决方式,在其余人的文章中说这两个是bug,我的认为这都不是bug。你们都知道flume做为apache的顶级项目,真有这样的bug在它的托管网站上确定有相关pull而且确定会有尽快的解决。至少,在flume1.8上会解决掉。我的查看了flume1.8处理的bug和功能的增长list中,对于(1)(2)没有关于这样解决项。apache

官方文档1.8的release说明:只有这一项关于taildir,解决的是当flume关闭文件同时该文件正更新数据。bootstrap

官网:http://flume.apache.org/releases/1.8.0.htmlapp

(1)flume会把重命名的文件从新看成新文件读取是由于正则表达式的缘由,由于重命名后的文件名仍然符合正则表达式。因此第一,重命名后的文件仍然会被flume监控;第二,flume是根据文件inode&&文件绝对路径 、文件是否为null&&文件绝对路径,这样的条件来判断是不是同一个文件这个能够看源码:下  载  源码,放到maven项目(注意路径名称对应),找到taildirsource的包。

先看执行案例:

确实是有重复,而后看源码:flume-taildir-source工程

ReliableTaildirEventReader 类的 updateTailFiles 方法
  public List<Long> updateTailFiles(boolean skipToEnd) throws IOException {
    updateTime = System.currentTimeMillis();
    List<Long> updatedInodes = Lists.newArrayList();
 
    for (TaildirMatcher taildir : taildirCache) {
      Map<String, String> headers = headerTable.row(taildir.getFileGroup());
 
      for (File f : taildir.getMatchingFiles()) {
        long inode = getInode(f);
        TailFile tf = tailFiles.get(inode);
        if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) {
          long startPos = skipToEnd ? f.length() : 0;
          tf = openFile(f, headers, inode, startPos);
        } else {
          boolean updated = tf.getLastUpdated() < f.lastModified();
          if (updated) {
            if (tf.getRaf() == null) {
              tf = openFile(f, headers, inode, tf.getPos());
            }
            if (f.length() < tf.getPos()) {
              logger.info("Pos " + tf.getPos() + " is larger than file size! "
                  + "Restarting from pos 0, file: " + tf.getPath() + ", inode: " + inode);
              tf.updatePos(tf.getPath(), inode, 0);
            }
          }
          tf.setNeedTail(updated);
        }
        tailFiles.put(inode, tf);
        updatedInodes.add(inode);
      }
    }
    return updatedInodes;
  }
重点:
 for (File f : taildir.getMatchingFiles()) {
        long inode = getInode(f);
        TailFile tf = tailFiles.get(inode);
        if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) {
          long startPos = skipToEnd ? f.length() : 0;
          tf = openFile(f, headers, inode, startPos);
        } 
TailFile 类的 updatePos 方法:
  public boolean updatePos(String path, long inode, long pos) throws IOException {
    <strong>if (this.inode == inode && this.path.equals(path)) {</strong>
      setPos(pos);
      updateFilePos(pos);
      logger.info("Updated position, file: " + path + ", inode: " + inode + ", pos: " + pos);
      return true;
    }
    return false;
  }
这样带来的麻烦就是当文件改名后仍然符合正则表达式时,会被flume进行监控,即便inode相同而文件名不一样,flume就认为是新文件。

实际上这是开发者自身给本身形成的不便,彻底能够经过监控文件名的正则表达式来排除重命名的文件。

就如正则表达式:【.*.log.* 】这样的正则表达式固然文件由 .ac.log 重命名为.ac.log.1会带来重复读取的问题。

而正则表达式:【.*.log】 当文件由 .ac.log 重命名为 .ac.log.1 就不会被flume监控,就不会有重复读取的问题。

以上是针对这个问题并flume团队没有改正这个问题缘由的思考。

固然,若是相似【.*.log.* 】这样的正则表达式在实际生产中是很是必要使用的话,那么flume团队应该会根据github上issue的呼声大小来考虑是否修正到项目中。

那么实际生产中真须要这样的正则表达式来监控目录下的文件的话,为了不重复读取,就须要对flume1.7源码进行修改:

处理问题(1)方式
1.修改 ReliableTaildirEventReader
修改 ReliableTaildirEventReader 类的 updateTailFiles方法。

去除tf.getPath().equals(f.getAbsolutePath()) 。只用判断文件不为空便可,不用判断文件的名字,由于log4j 日志切分文件会重命名文件。

 if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) {
 
修改成:
 if (tf == null) {
2.修改TailFile
修改TailFile 类的 updatePos方法。

inode 已经可以肯定惟一的 文件,不用加 path 做为断定条件

    if (this.inode == inode && this.path.equals(path)) {
修改成:
    if (this.inode == inode) {
3.将修改过的代码打包为自定义source的jar 
能够直接打包taildirSource组件便可,而后替换该组件的jar

此时能够进行测试。

处理问题(2)
问题(2)说的是,当监控的文件不存在了,flume资源没有释放。

这个问题也不是问题,实际上,资源的确会释放,可是 是有必定时间等待。

查看flume1.7官方文档taildirSource说明:

可知,若是这个文件在默认值120000ms内都没有新行append,就会关闭资源;而当有新行append就自动打开该资源。

也就是说,默认120000ms--》2分钟后会自动关闭所谓没有释放的资源。

为了不这么长时间的资源浪费,能够把这个值调小一些。可是,官方给定的默认值为何这么大(相对于相似超时时间都是秒单位的,而这是分钟单位)?固然不能随心所欲的把这个值改小,频繁的开关文件资源形成系统资源的浪费更应该考虑。

通常没有很好的测试过性能的话,仍是按照默认值来就能够了。

https://www.cloudera.com/documentation/kafka/latest/topics/kafka_flume.html

Using Apache Kafka with Apache Flume

In CDH 5.2 and higher, Apache Flume contains an Apache Kafka source and sink. Use these to stream data from Kafka to Hadoop or from any Flume source to Kafka.

In CDH 5.7 and higher, the Flume connector to Kafka only works with Kafka 2.0 and higher.

Important: Do not configure a Kafka source to send data to a Kafka sink. If you do, the Kafka source sets the topic in the event header, overriding the sink configuration and creating an infinite loop, sending messages back and forth between the source and sink. If you need to use both a source and a sink, use an interceptor to modify the event header and set a different topic.

For information on configuring Kafka to securely communicate with Flume, see Configuring Flume Security with Kafka.

 

Kafka Source

Use the Kafka source to stream data in Kafka topics to Hadoop. The Kafka source can be combined with any Flume sink, making it easy to write Kafka data to HDFS, HBase, and Solr.

The following Flume configuration example uses a Kafka source to send data to an HDFS sink:

tier1.sources  = source1
 tier1.channels = channel1
 tier1.sinks = sink1
 
 tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
 tier1.sources.source1.kafka.bootstrap.servers = kafka-broker01.example.com:9092
 tier1.sources.source1.kafka.topics = weblogs
 tier1.sources.source1.kafka.consumer.group.id = flume
 tier1.sources.source1.channels = channel1
 tier1.sources.source1.interceptors = i1
 tier1.sources.source1.interceptors.i1.type = timestamp
 tier1.sources.source1.kafka.consumer.timeout.ms = 100
 
 tier1.channels.channel1.type = memory
 tier1.channels.channel1.capacity = 10000
 tier1.channels.channel1.transactionCapacity = 1000
 
 tier1.sinks.sink1.type = hdfs
 tier1.sinks.sink1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
 tier1.sinks.sink1.hdfs.rollInterval = 5
 tier1.sinks.sink1.hdfs.rollSize = 0
 tier1.sinks.sink1.hdfs.rollCount = 0
 tier1.sinks.sink1.hdfs.fileType = DataStream
 tier1.sinks.sink1.channel = channel1

For higher throughput, configure multiple Kafka sources to read from the same topic. If you configure all the sources with the same kafka.consumer.group.id, and the topic contains multiple partitions, each source reads data from a different set of partitions, improving the ingest rate.

For the list of Kafka Source properties, see Kafka Source Properties.

For the full list of Kafka consumer properties, see the Kafka documentation.

Tuning Notes

The Kafka source overrides two Kafka consumer parameters:

  1. auto.commit.enable is set to false by the source, and every batch is committed. For improved performance, set this to true using the kafka.auto.commit.enable setting. This can lead to data loss if the source goes down before committing.
  2. consumer.timeout.ms is set to 10, so when Flume polls Kafka for new data, it waits no more than 10 ms for the data to be available. Setting this to a higher value can reduce CPU utilization due to less frequent polling, but introduces latency in writing batches to the channel.

 

Kafka Sink

Use the Kafka sink to send data to Kafka from a Flume source. You can use the Kafka sink in addition to Flume sinks such as HBase or HDFS.

The following Flume configuration example uses a Kafka sink with an exec source:

tier1.sources  = source1
 tier1.channels = channel1
 tier1.sinks = sink1
 
 tier1.sources.source1.type = exec
 tier1.sources.source1.command = /usr/bin/vmstat 1
 tier1.sources.source1.channels = channel1
 
 tier1.channels.channel1.type = memory
 tier1.channels.channel1.capacity = 10000
 tier1.channels.channel1.transactionCapacity = 1000
 
 tier1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
 tier1.sinks.sink1.topic = sink1
 tier1.sinks.sink1.brokerList = kafka01.example.com:9092,kafka02.example.com:9092
 tier1.sinks.sink1.channel = channel1
 tier1.sinks.sink1.batchSize = 20

For the list of Kafka Sink properties, see Kafka Sink Properties.

For the full list of Kafka producer properties, see the Kafka documentation.

The Kafka sink uses the topic and key properties from the FlumeEvent headers to determine where to send events in Kafka. If the header contains the topic property, that event is sent to the designated topic, overriding the configured topic. If the header contains the key property, that key is used to partition events within the topic. Events with the same key are sent to the same partition. If the key parameter is not specified, events are distributed randomly to partitions. Use these properties to control the topics and partitions to which events are sent through the Flume source or interceptor.

 

Kafka Channel

CDH 5.3 and higher includes a Kafka channel to Flume in addition to the existing memory and file channels. You can use the Kafka channel:

  • To write to Hadoop directly from Kafka without using a source.
  • To write to Kafka directly from Flume sources without additional buffering.
  • As a reliable and highly available channel for any source/sink combination.

The following Flume configuration uses a Kafka channel with an exec source and hdfs sink:

tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1

tier1.sources.source1.type = exec
tier1.sources.source1.command = /usr/bin/vmstat 1
tier1.sources.source1.channels = channel1

tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.zookeeperConnect = zk01.example.com:2181
tier1.channels.channel1.parseAsFlumeEvent = false
tier1.channels.channel1.kafka.topic = channel2
tier1.channels.channel1.kafka.consumer.group.id = channel2-grp
tier1.channels.channel1.kafka.consumer.auto.offset.reset = earliest
tier1.channels.channel1.kafka.bootstrap.servers = kafka02.example.com:9092,kafka03.example.com:9092
tier1.channels.channel1.transactionCapacity = 1000
tier1.channels.channel1.kafka.consumer.max.partition.fetch.bytes=2097152

tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = /tmp/kafka/channel
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1

For the list of Kafka Channel properties, see Kafka Channel Properties.

For the full list of Kafka producer properties, see the Kafka documentation.

Categories: Flume | Kafka | All Categories

相关文章
相关标签/搜索
本站公众号
   欢迎关注本站公众号,获取更多信息