zk源码阅读3: 持久化FileTxnLog

摘要

本节讲解html

zk的持久化框架
事务日志FileTxnLog日志结构
FileTxnLog源码
LogFormatter完成事务日志的反序列化
分析事务日志demo

持久化整体框架

持久化的类主要在包org.apache.zookeeper.server.persistence下，结构以下图java

TxnLog，接口类型，读取事务性日志的接口。
FileTxnLog，实现TxnLog接口，添加了访问该事务性日志的API。
Snapshot，接口类型，持久层快照接口。
FileSnap，实现Snapshot接口，负责存储、序列化、反序列化、访问快照。
FileTxnSnapLog，封装了TxnLog和SnapShot。
Util，工具类，提供持久化所需的API。

两种日志

zk主要存放了两类文件算法

snapshot(内存快照)
log(事务日志,相似MySQL的binlog，将全部与修改数据相关的操做记录在log中)

关于事务性日志的定义，能够参照refer,简而言之就是 zk事务日志文件用来记录事物操做，每个事务操做如添加，删除节点等等，都会在事务日志中记录一条记录，用来在zookeeper异常状况下恢复数据数据库

下面介绍事务日志apache

事务日志

正常运行过程当中，针对全部更新操做，在返回客户端“更新成功”的响应前，ZK会确保已经将本次更新操做的事务日志写到磁盘上，只有这样，整个更新操做才会生效。session

接口TxnLog

public interface TxnLog {
    
    /**
     * roll the current
     * log being appended to
     * @throws IOException 
     */
    // 滚动日志，从当前日志滚到下一个日志，不是回滚
    void rollLog() throws IOException;
    /**
     * Append a request to the transaction log
     * @param hdr the transaction header
     * @param r the transaction itself
     * returns true iff something appended, otw false 
     * @throws IOException
     */
    // 添加一个请求至事务性日志
    boolean append(TxnHeader hdr, Record r) throws IOException;

    /**
     * Start reading the transaction logs
     * from a given zxid
     * @param zxid
     * @return returns an iterator to read the 
     * next transaction in the logs.
     * @throws IOException
     */
    // 读取事务性日志
    TxnIterator read(long zxid) throws IOException;
    
    /**
     * the last zxid of the logged transactions.
     * @return the last zxid of the logged transactions.
     * @throws IOException
     */
    // 事务性操做的最新zxid
    long getLastLoggedZxid() throws IOException;
    
    /**
     * truncate the log to get in sync with the 
     * leader.
     * @param zxid the zxid to truncate at.
     * @throws IOException 
     */
    // 清空zxid之后的日志
    boolean truncate(long zxid) throws IOException;
    
    /**
     * the dbid for this transaction log. 
     * @return the dbid for this transaction log.
     * @throws IOException
     */
    // 获取数据库的id
    long getDbId() throws IOException;
    
    /**
     * commmit the trasaction and make sure
     * they are persisted
     * @throws IOException
     */
    // 提交事务并进行确认
    void commit() throws IOException;
   
    /** 
     * close the transactions logs
     */
    // 关闭事务性日志
    void close() throws IOException;
    /**
     * an iterating interface for reading 
     * transaction logs. 
     */
    // 读取事务日志的迭代器接口
    public interface TxnIterator {
        /**
         * return the transaction header.
         * @return return the transaction header.
         */
        // 获取事务头部
        TxnHeader getHeader();
        
        /**
         * return the transaction record.
         * @return return the transaction record.
         */
        // 获取事务
        Record getTxn();
     
        /**
         * go to the next transaction record.
         * @throws IOException
         */
        // 下个事务
        boolean next() throws IOException;
        
        /**
         * close files and release the 
         * resources
         * @throws IOException
         */
        // 关闭文件释放资源
        void close() throws IOException;
    }
}

实现类 FileTxnLog

文件结构

/**
 * The format of a Transactional log is as follows:
 * <blockquote><pre>
 * LogFile:
 *     FileHeader TxnList ZeroPad
 * 
 * FileHeader: {
 *     magic 4bytes (ZKLG)
 *     version 4bytes
 *     dbid 8bytes
 *   }
 * 
 * TxnList:
 *     Txn || Txn TxnList
 *     
 * Txn:
 *     checksum Txnlen TxnHeader Record 0x42
 * 
 * checksum: 8bytes Adler32 is currently used
 *   calculated across payload -- Txnlen, TxnHeader, Record and 0x42
 * 
 * Txnlen:
 *     len 4bytes
 * 
 * TxnHeader: {
 *     sessionid 8bytes
 *     cxid 4bytes
 *     zxid 8bytes
 *     time 8bytes
 *     type 4bytes
 *   }
 *     
 * Record:
 *     See Jute definition file for details on the various record types
 *      
 * ZeroPad:
 *     0 padded to EOF (filled during preallocation stage)
 * </pre></blockquote> 
 */

主要接口

append

//添加一条事务性日志
    public synchronized boolean append(TxnHeader hdr, Record txn)
        throws IOException
    {
        if (hdr != null) { // 事务头部不为空
            if (hdr.getZxid() <= lastZxidSeen) {
                LOG.warn("Current zxid " + hdr.getZxid()
                        + " is <= " + lastZxidSeen + " for "
                        + hdr.getType());
            }
            if (logStream==null) { //日志流为空
               if(LOG.isInfoEnabled()){
                    LOG.info("Creating new log file: log." +  
                            Long.toHexString(hdr.getZxid()));
               }
               //生成一个新的log文件
               logFileWrite = new File(logDir, ("log." + 
                       Long.toHexString(hdr.getZxid())));
               fos = new FileOutputStream(logFileWrite);
               logStream=new BufferedOutputStream(fos);
               oa = BinaryOutputArchive.getArchive(logStream);
               //用TXNLOG_MAGIC VERSION dbId来生成文件头
               FileHeader fhdr = new FileHeader(TXNLOG_MAGIC,VERSION, dbId);
               fhdr.serialize(oa, "fileheader");//序列化
               // Make sure that the magic number is written before padding.
               logStream.flush();
               currentSize = fos.getChannel().position();
               streamsToFlush.add(fos);
            }
            padFile(fos);//剩余空间不够4k时则填充文件64M
            byte[] buf = Util.marshallTxnEntry(hdr, txn);
            if (buf == null || buf.length == 0) {
                throw new IOException("Faulty serialization for header " +
                        "and txn");
            }
            Checksum crc = makeChecksumAlgorithm();//生成验证算法
            crc.update(buf, 0, buf.length);
            oa.writeLong(crc.getValue(), "txnEntryCRC");//将验证算法的值写入long
            Util.writeTxnBytes(oa, buf);//将序列化事务记录写入OutputArchive，以0x42('B')结束
            return true;
        }
        return false;
    }

)app

getLogFiles

//找出<=snapshot的中最大的zxid的logfile以及后续的logfile
    public static File[] getLogFiles(File[] logDirList,long snapshotZxid) {
        List<File> files = Util.sortDataDir(logDirList, "log", true);//按照后缀抽取zxid，按zxid升序排序
        long logZxid = 0;
        // Find the log file that starts before or at the same time as the
        // zxid of the snapshot
        for (File f : files) {
            long fzxid = Util.getZxidFromName(f.getName(), "log");
            if (fzxid > snapshotZxid) {
                continue;
            }
            // the files
            // are sorted with zxid's
            if (fzxid > logZxid) {
                logZxid = fzxid;
            }
        }
        List<File> v=new ArrayList<File>(5);
        for (File f : files) {
            long fzxid = Util.getZxidFromName(f.getName(), "log");
            if (fzxid < logZxid) {
                continue;
            }
            v.add(f);
        }
        return v.toArray(new File[0]);
    }

getLastLoggedZxid

//获取记录在log中的最后一个zxid
    public long getLastLoggedZxid() {
        File[] files = getLogFiles(logDir.listFiles(), 0);
        //找到最大的zxid所在的文件
        long maxLog=files.length>0?
                Util.getZxidFromName(files[files.length-1].getName(),"log"):-1;

        // if a log file is more recent we must scan it to find
        // the highest zxid
        long zxid = maxLog;
        TxnIterator itr = null;
        try {
            FileTxnLog txn = new FileTxnLog(logDir);
            itr = txn.read(maxLog);
            while (true) {
                if(!itr.next())
                    break;
                TxnHeader hdr = itr.getHeader();//遍历这个文件，找到最后一条事务日志记录
                zxid = hdr.getZxid();//取出zxid
            }
        } catch (IOException e) {
            LOG.warn("Unexpected exception", e);
        } finally {
            close(itr);
        }
        return zxid;
    }

commit

//提交事务日志至磁盘
    public synchronized void commit() throws IOException {
        if (logStream != null) {
            logStream.flush();// 强制刷到磁盘
        }
        for (FileOutputStream log : streamsToFlush) {
            log.flush();// 强制刷到磁盘
            if (forceSync) {
                long startSyncNS = System.nanoTime();

                log.getChannel().force(false);

                long syncElapsedMS =
                    TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startSyncNS);
                if (syncElapsedMS > fsyncWarningThresholdMS) {
                    LOG.warn("fsync-ing the write ahead log in "
                            + Thread.currentThread().getName()
                            + " took " + syncElapsedMS
                            + "ms which will adversely effect operation latency. "
                            + "See the ZooKeeper troubleshooting guide");
                }
            }
        }
        while (streamsToFlush.size() > 1) {
            streamsToFlush.removeFirst().close();// 移除流并关闭
        }
    }

truncate

//清空大于指定zxid的事务日志
    public boolean truncate(long zxid) throws IOException {
        FileTxnIterator itr = null;
        try {
            itr = new FileTxnIterator(this.logDir, zxid);//根据zxid找到迭代器
            PositionInputStream input = itr.inputStream;
            if(input == null) {
                throw new IOException("No log files found to truncate! This could " +
                        "happen if you still have snapshots from an old setup or " +
                        "log files were deleted accidentally or dataLogDir was changed in zoo.cfg.");
            }
            long pos = input.getPosition();
            // now, truncate at the current position
            RandomAccessFile raf = new RandomAccessFile(itr.logFile, "rw");
            raf.setLength(pos);//把当前log后面的部分(zxid更大的)截断
            raf.close();
            while (itr.goToNextLog()) {
                if (!itr.logFile.delete()) {//把后面的log文件都删除
                    LOG.warn("Unable to truncate {}", itr.logFile);
                }
            }
        } finally {
            close(itr);
        }
        return true;
    }

rollLog

这个必定要看注释，意思不是回滚日志，是从当前日志滚到下一个框架

/**
     * rollover the current log file to a new one.
     * @throws IOException
     */
    public synchronized void rollLog() throws IOException {
        if (logStream != null) {
            this.logStream.flush();
            this.logStream = null;
            oa = null;
        }
    }

事务日志可视化 LogFormatter

能够结合org.apache.zookeeper.server.persistence.FileTxnLog#append进行理解传入参数为对应的事务日志路径便可运维

public static void main(String[] args) throws Exception {
        if (args.length != 1) {
            System.err.println("USAGE: LogFormatter log_file");
            System.exit(2);
        }
        FileInputStream fis = new FileInputStream(args[0]);
        BinaryInputArchive logStream = BinaryInputArchive.getArchive(fis);
        FileHeader fhdr = new FileHeader();
        fhdr.deserialize(logStream, "fileheader");
        //反序列化header完成验证
        if (fhdr.getMagic() != FileTxnLog.TXNLOG_MAGIC) {
            System.err.println("Invalid magic number for " + args[0]);
            System.exit(2);
        }
        System.out.println("ZooKeeper Transactional Log File with dbid "
                + fhdr.getDbid() + " txnlog format version "
                + fhdr.getVersion());

        int count = 0;
        while (true) {
            long crcValue;
            byte[] bytes;
            try {
                crcValue = logStream.readLong("crcvalue");//获取反序列化的checksum

                bytes = logStream.readBuffer("txnEntry");
            } catch (EOFException e) {
                System.out.println("EOF reached after " + count + " txns.");
                return;
            }
            if (bytes.length == 0) {
                // Since we preallocate, we define EOF to be an
                // empty transaction
                System.out.println("EOF reached after " + count + " txns.");
                return;
            }
            Checksum crc = new Adler32();
            crc.update(bytes, 0, bytes.length);
            if (crcValue != crc.getValue()) {//比较本身生成的checksum与传递过来的checksum
                throw new IOException("CRC doesn't match " + crcValue +
                        " vs " + crc.getValue());
            }
            TxnHeader hdr = new TxnHeader();
            Record txn = SerializeUtils.deserializeTxn(bytes, hdr);//反序列化事务
            System.out.println(DateFormat.getDateTimeInstance(DateFormat.SHORT,
                    DateFormat.LONG).format(new Date(hdr.getTime()))
                    + " session 0x"
                    + Long.toHexString(hdr.getClientId())
                    + " cxid 0x"
                    + Long.toHexString(hdr.getCxid())
                    + " zxid 0x"
                    + Long.toHexString(hdr.getZxid())
                    + " " + TraceFormatter.op2String(hdr.getType()) + " " + txn);
            if (logStream.readByte("EOR") != 'B') {
                LOG.error("Last transaction was partial.");
                throw new EOFException("Last transaction was partial.");
            }
            count++;
        }
    }

事务日志可视化效果

针对http://www.jianshu.com/p/d1f8b9d6ad57贴出的demo 利用LogFormatter进行解析,事先把事务日志目录清空输出为dom

ZooKeeper Transactional Log File with dbid 0 txnlog format version 2
17-5-24 下午04时15分41秒 session 0x15c398687180000 cxid 0x0 zxid 0x1 createSession 20000

17-5-24 下午04时15分41秒 session 0x15c398687180000 cxid 0x2 zxid 0x2 create '/test1,#7a6e6f646531,v{s{31,s{'world,'anyone}}},T,1

17-5-24 下午04时15分41秒 session 0x15c398687180000 cxid 0x3 zxid 0x3 create '/test2,#7a6e6f646532,v{s{31,s{'world,'anyone}}},T,2

17-5-24 下午04时15分41秒 session 0x15c398687180000 cxid 0x4 zxid 0x4 create '/test3,#7a6e6f646533,v{s{31,s{'world,'anyone}}},T,3

17-5-24 下午04时15分43秒 session 0x15c398687180000 cxid 0x9 zxid 0x5 setData '/test2,#7a4e6f64653232,1

17-5-24 下午04时15分43秒 session 0x15c398687180000 cxid 0xb zxid 0x6 delete '/test2

17-5-24 下午04时15分43秒 session 0x15c398687180000 cxid 0xc zxid 0x7 delete '/test1

17-5-24 下午04时16分04秒 session 0x15c398687180000 cxid 0x0 zxid 0x8 closeSession null
EOF reached after 8 txns.

结合FileTxnLog#append很好理解

吐槽

tag不匹配

序列化时
org.apache.zookeeper.server.persistence.FileTxnLog#append里面是
oa.writeLong(crc.getValue(), "txnEntryCRC");//将验证算法的值写入long
反序列化，解析的时候是
org.apache.zookeeper.server.LogFormatter#main
crcValue = logStream.readLong("crcvalue");
这俩tag都不同，虽然并不影响运行！！！

FileTxnLog#getLogFiles效率低

都已经按zxid升序排序了，一次循环就该搞定了

思考

文件后缀是按照zxid来生成的

logFileWrite = new File(logDir, ("log." +  Long.toHexString(hdr.getZxid())));
这对于定位文件，zxid都提供了一些便利
好比在getLastLoggedZxid中的调用

rollLog函数的意义

函数没有参数必定要注意，是从当前日志，滚到下一个日志(好比日志量太大了之类的场景) 不是回滚日志里面的记录，试想回滚怎么能不告诉回滚的zxid呢

能够比较一下，rollLog函数形成logstream为null，append函数便会生成新的文件logFileWrite，新的流logStream

commit和rollLog两个函数都调用了flush，区别是什么

涉及到FileChannel，nio相关，

写入FileChannel调用链以下 org.apache.zookeeper.server.persistence.FileTxnLog#append org.apache.zookeeper.server.persistence.FileTxnLog#padFile org.apache.zookeeper.server.persistence.Util#padLogFile java.nio.channels.FileChannel#write(java.nio.ByteBuffer, long)

用了FileChannel的write方法

在commit函数中调用了 log.getChannel().force(false); 即java.nio.channels.FileChannel#force

查阅相关资料如 https://java-nio.avenwu.net/java-nio-filechannel.html 说明了

force方法会把全部未写磁盘的数据都强制写入磁盘。
这是由于在操做系统中出于性能考虑回把数据放入缓冲区，因此不能保证数据在调用write写入文件通道后就及时写到磁盘上了，除非手动调用force方法。 
force方法须要一个布尔参数，表明是否把meta data也一并强制写入。

也就是只有commit方法会进行真正的写入磁盘，rollLog并无

事务日志何时会调用truncate 清空部分日志

集群版learner向leader同步的时候，leader告诉learner须要回滚同步调用方Learner#syncWithLeader，后面40节会讲

问题

rollLog函数调用flush的做用

上面讲了commit和rollLog两个函数的区别 rollLog调用flush，最后的效果是什么呢？又没有写入磁盘(不然不会再调用commit) 写入了内存吗？又没有调用FileChannel的相关方法。

refer

http://www.cnblogs.com/leesf456/p/6279956.html 如何查看事务日志 FileTxnLog 什么是事务性日志 ZooKeeper运维之数据文件和事务日志