Flume做为Hadoop生态系统中的一员,能够说是功能最大的数据收集系统,Flume的模型也比较简单,经过agent不断级连,来打通数据源与最终目的地(通常为HDFS)。下图结构说明了Flume中的数据流。html
我今天要说的是Channel部分,具体来讲是MemoryChannel的分析,其余概念像source、sink你们能够去官方文档查看。java
注意:git
本文章中的Flume源码为1.6.0版本。
Event是Flume中对数据的抽象,分为两部分:header与body,和http中的header与body很相似。github
Flume中按Event为单位操做数据,不一样的source、sink在必要时会自动在原始数据与Event之间作转化。数据库
Channel充当了Source与Sink之间的缓冲区。Channel的引入,使得source与sink之间的藕合度下降,source只管像Channel发数据,sink只需从Channel取数据。
此外,有了Channel,不可贵出下面结论:apache
source与sink能够为N对N的关系api
source发数据的速度能够大于sink取数据的速度(在Channel不满的状况下)架构
Channel采用了Transaction
(事务)机制来保证数据的完整性,这里的事务和数据库中的事务概念相似,但并非彻底一致,其语义能够参考下面这个图:oracle
source端经过commit操做像Channel放置数据,sink端经过commit操做从Channel取数据。
那么事务是如何保证数据的完整性的呢?看下面有两个agent的状况:app
数据流程:
source 1
产生Event,经过“put”、“commit”操做将Event放到Channel 1
中
sink 1
经过“take”操做从Channel 1
中取出Event,并把它发送到Source 2
中
source 2
经过“put”、“commit”操做将Event放到Channel 2
中
source 2
向sink 1
发送成功信号,sink 1
“commit”步骤2中的“take”操做(其实就是删除Channel 1
中的Event)
说明:
在任什么时候刻,Event至少在一个Channel中是完整有效的
Flume中提供的Channel实现主要有三个:
Memory Channel,event保存在Java Heap中。若是容许数据小量丢失,推荐使用
File Channel,event保存在本地文件中,可靠性高,但吞吐量低于Memory Channel
JDBC Channel,event保存在关系数据中,通常不推荐使用
不一样的Channel主要在于Event存放的位置不一样,今天我着重讲一下比较简单的Memory Channel的源码。
首先看一下MemoryChannel中比较重要的成员变量:
// lock to guard queue, mainly needed to keep it locked down during resizes // it should never be held through a blocking operation private Object queueLock = new Object(); //queue为Memory Channel中存放Event的地方,这里用了LinkedBlockingDeque来实现 @GuardedBy(value = "queueLock") private LinkedBlockingDeque<Event> queue; //下面的两个信号量用来作同步操做,queueRemaining表示queue中的剩余空间,queueStored表示queue中的使用空间 // invariant that tracks the amount of space remaining in the queue(with all uncommitted takeLists deducted) // we maintain the remaining permits = queue.remaining - takeList.size() // this allows local threads waiting for space in the queue to commit without denying access to the // shared lock to threads that would make more space on the queue private Semaphore queueRemaining; // used to make "reservations" to grab data from the queue. // by using this we can block for a while to get data without locking all other threads out // like we would if we tried to use a blocking call on queue private Semaphore queueStored; //下面几个变量为配置文件中Memory Channel的配置项 // 一个事务中Event的最大数目 private volatile Integer transCapacity; // 向queue中添加、移除Event的等待时间 private volatile int keepAlive; // queue中,全部Event所能占用的最大空间 private volatile int byteCapacity; private volatile int lastByteCapacity; // queue中,全部Event的header所能占用的最大空间占byteCapacity的比例 private volatile int byteCapacityBufferPercentage; // 用于标示byteCapacity中剩余空间的信号量 private Semaphore bytesRemaining; // 用于记录Memory Channel的一些指标,后面能够经过配置监控来观察Flume的运行状况 private ChannelCounter channelCounter;
而后重点说下MemoryChannel里面的MemoryTransaction,它是Transaction类的子类,从其文档来看,一个Transaction的使用模式都是相似的:
Channel ch = ... Transaction tx = ch.getTransaction(); try { tx.begin(); ... // ch.put(event) or ch.take() ... tx.commit(); } catch (ChannelException ex) { tx.rollback(); ... } finally { tx.close(); }
能够看到一个Transaction主要有、put
、take
、commit
、rollback
这四个方法,咱们在实现其子类时,主要也是实现着四个方法。
Flume官方为了方便开发者实现本身的Transaction,定义了BasicTransactionSemantics,这时开发者只须要继承这个辅助类,而且实现其相应的、doPut
、doTake
、doCommit
、doRollback
方法便可,MemoryChannel
就是继承了这个辅助类。
private class MemoryTransaction extends BasicTransactionSemantics { //和MemoryChannel同样,内部使用LinkedBlockingDeque来保存没有commit的Event private LinkedBlockingDeque<Event> takeList; private LinkedBlockingDeque<Event> putList; private final ChannelCounter channelCounter; //下面两个变量用来表示put的Event的大小、take的Event的大小 private int putByteCounter = 0; private int takeByteCounter = 0; public MemoryTransaction(int transCapacity, ChannelCounter counter) { //用transCapacity来初始化put、take的队列 putList = new LinkedBlockingDeque<Event>(transCapacity); takeList = new LinkedBlockingDeque<Event>(transCapacity); channelCounter = counter; } @Override protected void doPut(Event event) throws InterruptedException { //doPut操做,先判断putList中是否还有剩余空间,有则把Event插入到该队列中,同时更新putByteCounter //没有剩余空间的话,直接报ChannelException channelCounter.incrementEventPutAttemptCount(); int eventByteSize = (int)Math.ceil(estimateEventSize(event)/byteCapacitySlotSize); if (!putList.offer(event)) { throw new ChannelException( "Put queue for MemoryTransaction of capacity " + putList.size() + " full, consider committing more frequently, " + "increasing capacity or increasing thread count"); } putByteCounter += eventByteSize; } @Override protected Event doTake() throws InterruptedException { //doTake操做,首先判断takeList中是否还有剩余空间 channelCounter.incrementEventTakeAttemptCount(); if(takeList.remainingCapacity() == 0) { throw new ChannelException("Take list for MemoryTransaction, capacity " + takeList.size() + " full, consider committing more frequently, " + "increasing capacity, or increasing thread count"); } //而后判断,该MemoryChannel中的queue中是否还有空间,这里经过信号量来判断 if(!queueStored.tryAcquire(keepAlive, TimeUnit.SECONDS)) { return null; } Event event; //从MemoryChannel中的queue中取出一个event synchronized(queueLock) { event = queue.poll(); } Preconditions.checkNotNull(event, "Queue.poll returned NULL despite semaphore " + "signalling existence of entry"); //放到takeList中,而后更新takeByteCounter变量 takeList.put(event); int eventByteSize = (int)Math.ceil(estimateEventSize(event)/byteCapacitySlotSize); takeByteCounter += eventByteSize; return event; } @Override protected void doCommit() throws InterruptedException { //该对应一个事务的提交 //首先判断putList与takeList的相对大小 int remainingChange = takeList.size() - putList.size(); //若是takeList小,说明向该MemoryChannel放的数据比取的数据要多,因此须要判断该MemoryChannel是否有空间来放 if(remainingChange < 0) { // 1. 首先经过信号量来判断是否还有剩余空间 if(!bytesRemaining.tryAcquire(putByteCounter, keepAlive, TimeUnit.SECONDS)) { throw new ChannelException("Cannot commit transaction. Byte capacity " + "allocated to store event body " + byteCapacity * byteCapacitySlotSize + "reached. Please increase heap space/byte capacity allocated to " + "the channel as the sinks may not be keeping up with the sources"); } // 2. 而后判断,在给定的keepAlive时间内,可否获取到充足的queue空间 if(!queueRemaining.tryAcquire(-remainingChange, keepAlive, TimeUnit.SECONDS)) { bytesRemaining.release(putByteCounter); throw new ChannelFullException("Space for commit to queue couldn't be acquired." + " Sinks are likely not keeping up with sources, or the buffer size is too tight"); } } int puts = putList.size(); int takes = takeList.size(); //若是上面的两个判断都过了,那么把putList中的Event放到该MemoryChannel中的queue中。 synchronized(queueLock) { if(puts > 0 ) { while(!putList.isEmpty()) { if(!queue.offer(putList.removeFirst())) { throw new RuntimeException("Queue add failed, this shouldn't be able to happen"); } } } //清空本次事务中用到的putList与takeList,释放资源 putList.clear(); takeList.clear(); } //更新控制queue大小的信号量bytesRemaining,由于把takeList清空了,因此直接把takeByteCounter加到bytesRemaining中。 bytesRemaining.release(takeByteCounter); takeByteCounter = 0; putByteCounter = 0; //由于把putList中的Event放到了MemoryChannel中的queue,因此把puts加到queueStored中去。 queueStored.release(puts); //若是takeList比putList大,说明该MemoryChannel中queue的数量应该是减小了,因此把(takeList-putList)的差值加到信号量queueRemaining if(remainingChange > 0) { queueRemaining.release(remainingChange); } if (puts > 0) { channelCounter.addToEventPutSuccessCount(puts); } if (takes > 0) { channelCounter.addToEventTakeSuccessCount(takes); } channelCounter.setChannelSize(queue.size()); } @Override protected void doRollback() { //当一个事务失败时,会进行回滚,即调用本方法 //首先把takeList中的Event放回到MemoryChannel中的queue中。 int takes = takeList.size(); synchronized(queueLock) { Preconditions.checkState(queue.remainingCapacity() >= takeList.size(), "Not enough space in memory channel " + "queue to rollback takes. This should never happen, please report"); while(!takeList.isEmpty()) { queue.addFirst(takeList.removeLast()); } //而后清空putList putList.clear(); } //由于清空了putList,因此须要把putList所占用的空间大小添加到bytesRemaining中 bytesRemaining.release(putByteCounter); putByteCounter = 0; takeByteCounter = 0; //由于把takeList中的Event回退到queue中去了,因此须要把takeList的大小添加到queueStored中 queueStored.release(takes); channelCounter.setChannelSize(queue.size()); } }
MemoryChannel
的逻辑相对简单,主要是经过MemoryTransaction
中的putList
、takeList
与MemoryChannel中的queue
打交道,这里的queue
至关于持久化层,只不过放到了内存中,若是是FileChannel
的话,会把这个queue
放到本地文件中。下面表示了Event在一个使用了MemoryChannel的agent中数据流向:
source ---> putList ---> queue ---> takeList ---> sink
还须要注意的一点是,这里的事务能够嵌套使用,以下图:
当有两个agent级连时,sink的事务中包含了一个source的事务,这也应证了前面所说的:
在任什么时候刻,Event至少在一个Channel中是完整有效的