一次鞭辟入里的 Log4j2 异步日志输出阻塞问题的定位

「本文已参与好文召集令活动,点击查看:后端、大前端双赛道投稿,2万元奖池等你挑战!html

一次鞭辟入里的 Log4j2 日志输出阻塞问题的定位

问题现象

线上某个应用的某个实例忽然出现某些次请求服务响应极慢的状况,有几回请求超过 60s 才返回,而且经过日志发现,服务线程并无作什么很重的操做。这种状况断断续续持续了半小时左右。前端

核心问题定位

因为半小时后,服务实例恢复正常,而且出现请求极慢的次数并不太多,没有能及时打印出现问题的时候线程堆栈采集相关信息。可是,咱们有本身的法宝,JFR(关于 JFR,请参考个人另外一系列JFR全解)。java

JFR 很是适合用来过后复盘定位问题,而且配置得当的话,性能损耗极小,而且不像 APM 系统同样须要额外的采集以及须要统一的服务进程进行整合展示。咱们随用随取便可,可是 JFR 面向的是单进程的问题定位,找到问题进程,定位跨进程业务问题链路追踪仍是须要 APM 系统的。git

咱们经过以下命令采集对应时间段的 JFR:github

jcmd 进程pid JFR.dump begin=2021-07-04T17:30:00.000Z end=2021-07-04T18:30:00.000Z
复制代码

咱们选择的时间跨度比出问题的时间跨度大一些,这样保证底层问题出现的时候也被采集写入这个 dump 文件。web

对于历史某些请求响应慢,我通常按照以下流程去看:数据库

  1. 是否有 STW(Stop-the-world,参考个人另外一篇文章:JVM相关 - SafePoint 与 Stop The World 全解):
  2. 是否有 GC 致使的长时间 STW
  3. 是否有其余缘由致使进程全部线程进入 safepoint 致使 STW
  4. 是否 IO 花了太长时间,例如调用其余微服务,访问各类存储(硬盘,数据库,缓存等等)
  5. 是否在某些锁上面阻塞太长时间?
  6. 是否 CPU 占用太高,哪些线程致使的?

接下来咱们来详细分析,首先是 GC,从下图看,并无能严重影响线程 STW 很长时间的: imageapache

而后查看 safepoint 相关,发现也没有什么影响:编程

image

接着,咱们查看 IO 相关的事件,也并无发现业务相关的阻塞读:后端

image

image

而后,咱们查看是否在某些锁上面阻塞太长时间,终于发现异常:

image

经过堆栈,咱们发现是** log4j2 打印日志卡住了**。

问题分析

1. Java Monitor Blocked 的 Monitor Address 对于同一个对象也是会改变的

首先,对于 Java Monitor Blocked 这个事件,咱们能够经过 Monitor Address 来判断是不是同一个锁。例如这里咱们发现这些线程都是阻塞在 FFFF 4C09 3188 这个地址的对象,这个地址是基于对象在 Java 堆中的相对地址得出的。

可是对于同一个对象,这个地址并不会是不变的,在每次 GC 扫描整理到这个对象对象的时候,若是有回收内存,通常这个地址就会改变。因为咱们使用的是 G1,因此并非每次 GC 都会扫描到这个对象,可是若是地址发生改变,必定是由于发生了 GC

2. log4j2 异步日志原理简介

在咱们的应用配置中,采用的日志框架是 log4j2,采用的配置是异步日志配置。这里简单说一下 Log4j2 异步日志的原理:Log4j2 异步日志基于高性能数据结构 Disruptor,Disruptor 是一个环形 buffer,作了不少性能优化(具体原理能够参考个人另外一系列:高并发数据结构(disruptor)),Log4j2 对此的应用以下所示:

image

简单来讲,多线程经过 log4j2 的门面类 org.apache.logging.log4j.Logger 进行日志输出,被封装成为一个 org.apache.logging.log4j.core.LogEvent,放入到 Disruptor 的环形 buffer 中。在消费端有一个单线程消费这些 LogEvent 写入对应的 Appender,咱们这里只有一个 Appender,其配置是:

<RollingFile name="file" append="true"
             filePattern="./app.log-%d{yyyy.MM.dd.HH}">
    <PatternLayout pattern="${logFormat}"/>
    <Policies>
        <TimeBasedTriggeringPolicy interval="1" modulate="true"/>
    </Policies>
    <DirectWriteRolloverStrategy maxFiles="72"/>
</RollingFile>
复制代码

异步日志 logger 配置为(配置 includeLocation 为 false,避免每次打印日志须要获取调用堆栈的性能损耗):

<Asyncroot level="info" includeLocation="false">
    <appender-ref ref="file"/>
</Asyncroot>
复制代码

log4j component 额外配置为:

log4j2.component.properties:

# 当没有日志的时候,消费线程经过 Block 等待日志事件到来,这样 CPU 占用最少
AsyncLoggerConfig.WaitStrategy=Block
复制代码

3. log4j2 disruptor 的 RingBuffer 的大小

既然是一个环形 Buffer,它的容量是有限的,咱们这里没有修改它的大小,走的是默认配置,查看其源码:

AsyncLoggerConfigDisruptor.java

public synchronized void start() {
    //省略其余代码
    ringBufferSize = DisruptorUtil.calculateRingBufferSize("AsyncLoggerConfig.RingBufferSize");
    disruptor = new Disruptor<>(factory, ringBufferSize, threadFactory, ProducerType.MULTI, waitStrategy);
}
复制代码

DisruptorUtil.java

private static final int RINGBUFFER_MIN_SIZE = 128;
private static final int RINGBUFFER_DEFAULT_SIZE = 256 * 1024;
private static final int RINGBUFFER_NO_GC_DEFAULT_SIZE = 4 * 1024;

static int calculateRingBufferSize(final String propertyName) {
    //是否启用了 ThreadLocal,若是是则为 4 kB,不是则为 256 kB
    int ringBufferSize = Constants.ENABLE_THREADLOCALS ? RINGBUFFER_NO_GC_DEFAULT_SIZE : RINGBUFFER_DEFAULT_SIZE;
    //读取系统变量,以及 log4j2.component.properties 文件获取 propertyName(这里是 AsyncLoggerConfig.RingBufferSize) 这个配置
    final String userPreferredRBSize = PropertiesUtil.getProperties().getStringProperty(propertyName,
            String.valueOf(ringBufferSize));
    try {
        int size = Integer.parseInt(userPreferredRBSize);
        //若是小于 128 字节则按照 128 字节设置
        if (size < RINGBUFFER_MIN_SIZE) {
            size = RINGBUFFER_MIN_SIZE;
            LOGGER.warn("Invalid RingBufferSize {}, using minimum size {}.", userPreferredRBSize,
                    RINGBUFFER_MIN_SIZE);
        }
        ringBufferSize = size;
    } catch (final Exception ex) {
        LOGGER.warn("Invalid RingBufferSize {}, using default size {}.", userPreferredRBSize, ringBufferSize);
    }
    //取最近的 2 的 n 次方,由于对于 2 的 n 次方取余等于对于 2^n-1 取与运算,这样更快
    return Integers.ceilingNextPowerOfTwo(ringBufferSize);
}
复制代码

若是启用了 ThreadLocal 这种方式生成 LogEvent,每次不新生成的 LogEvent 用以前的,用 ThreadLocal 存储的,这样避免了建立新的 LogEvent。可是考虑下面这种状况:

logger.info("{}", someObj);
复制代码

这样会形成强引用,致使若是线程没有新的日志,这个 someObj 一直不能回收。因此针对 Web 应用,log4j2 默认是不启用 ThreadLocal 的 方式生成 LogEvent

Constants.java

public static final boolean IS_WEB_APP = PropertiesUtil.getProperties().getBooleanProperty(
            "log4j2.is.webapp", isClassAvailable("javax.servlet.Servlet"));
public static final boolean ENABLE_THREADLOCALS = !IS_WEB_APP && PropertiesUtil.getProperties().getBooleanProperty(
            "log4j2.enable.threadlocals", true);
复制代码

由此,能够看出,咱们的 RingBuffer 的大小为 256 kB

4. RingBuffer 满了 log4j2 会发生什么?

当 RingBuffer 满了,若是在 log4j2.component.properties 配置了 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=false,则会 Wait(实际上是 park) 在 Disruptor 的 produce 方法上,等待消费出下一个能够生产的环形 buffer 槽;默认这个配置为 true,即全部生产日志的线程尝试获取全局中的同一个锁(private final Object queueFullEnqueueLock = new Object();):

DisruptorUtil.java

static final boolean ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
        .getBooleanProperty("AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull", true);
复制代码
private boolean synchronizeEnqueueWhenQueueFull() {
    return DisruptorUtil.ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL
            // Background thread must never block
            && backgroundThreadId != Thread.currentThread().getId();
}

private final Object queueFullEnqueueLock = new Object();

private void enqueue(final LogEvent logEvent, final AsyncLoggerConfig asyncLoggerConfig) {
    //若是 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=true,默认就是 true
    if (synchronizeEnqueueWhenQueueFull()) {
        synchronized (queueFullEnqueueLock) {
            disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
        }
    } else {
        //若是 AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull=false
        disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
    }
}
复制代码

默认配置的时候,异常堆栈和咱们 JFR 中看到的同样,举个例子:

"Thread-0" #27 [13136] prio=5 os_prio=0 cpu=0.00ms elapsed=141.08s tid=0x0000022d6f2fbcc0 nid=0x3350 waiting for monitor entry  [0x000000399bcfe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:375)
	- waiting to lock <merged>(a java.lang.Object)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
	at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
	at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
	at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
	at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)
	省略业务方法堆栈
复制代码

配置为 false 的时候,堆栈是这个样子的:

"Thread-0" #27 [18152] prio=5 os_prio=0 cpu=0.00ms elapsed=5.68s tid=0x000002c1fa120e00 nid=0x46e8 runnable  [0x000000eda8efe000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17-loom/Native Method)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17-loom/LockSupport.java:410)
	at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
	at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
	at com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:524)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(AsyncLoggerConfigDisruptor.java:379)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueueEvent(AsyncLoggerConfigDisruptor.java:330)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logInBackgroundThread(AsyncLoggerConfig.java:159)
	at org.apache.logging.log4j.core.async.EventRoute$1.logMessage(EventRoute.java:46)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.handleQueueFull(AsyncLoggerConfig.java:149)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:136)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
	at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
	at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
	at org.apache.logging.log4j.spi.AbstractLogger.info(AbstractLogger.java:1312)
复制代码

5. 为什么会满,咱们的消费线程当时在干啥?

咱们来看下当时的消费线程是否有异常,因为当时硬盘 io 看系统监控并无异常因此这个线程很大多是 Runnable 的,不断的在写入日志。同时,咱们知道,写入文件 java 底层基于 native 调用,因此咱们查看 JFR 的 native 方法采集。使用事件查看器中的 Method Profiling Sample Native,右键点击,并选择“使用所选类型事件建立新页”:

image

在建立出来的新页中,按照线程分组查看,查看 Log4j2 的 disruptor 消费线程,能够得出下图: image

能够看出:在出问题的时间段,采集到的写入日志文件的事件数量明显增多不少,而且,经过堆栈能够看到:

"Log4j2-TF-2-AsyncLoggerConfig-1" #26 [23680] daemon prio=5 os_prio=0 cpu=406.25ms elapsed=2.89s tid=0x0000022d6f3d4080 nid=0x5c80 runnable  [0x000000399bbfe000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(java.base@17-loom/Native Method)
	at java.io.FileOutputStream.write(java.base@17-loom/FileOutputStream.java:365)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:261)
	- eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.FileManager.writeToDestination(FileManager.java:272)
	- eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.writeToDestination(RollingFileManager.java:236)
	- eliminated <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:293)
	- locked <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:302)
	- locked <0x000000070ee0af40> (a org.apache.logging.log4j.core.appender.rolling.RollingFileManager)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:199)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:190)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:181)
	at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:312)
	at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
	at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:543)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.callAppenders(AsyncLoggerConfig.java:127)
	at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:502)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:485)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:121)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncLoggerConfigsOnCurrentThread(AsyncLoggerConfig.java:169)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:111)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:97)
	at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
	at java.lang.Thread.run(java.base@17-loom/Thread.java:1521)
复制代码

log4j2 调用了不少次 flush, JFR 显示采集到的事件,每次都调用了 flush。每次调用 flush,都会形成文件真正写的 native 调用。而消费慢的缘由,其实就是这种 native 调用太多,系统写入不过来了。

问题解决

咱们能够经过如下四个方向解决这个问题:

  1. 减小日志输出,精简日志,这是比较基础的方式,也是比较简单的方式,可是身为一个技术人,咱们不能知足于此
  2. 增长硬盘 io,这个也是比较基础简单的解决方式
  3. 咱们是否能够减小这个 flush 呢?答案是能够的,咱们能够配置 Appender 的 immediateFlush 为 false
  4. 增长监控,针对堆栈包含 org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue 的 java monitor block 事件进行监控,若是发现时间过长或者数量不少的事件则报警或者重建进程

1. 配置 Appender 的 immediateFlush 为 false

咱们能够配置 Appender 的 immediateFlush 为 false,例如:

<RollingFile name="file" append="true"
             filePattern="./app.log-%d{yyyy.MM.dd.HH}"
             immediateFlush="false">
    <PatternLayout pattern="${logFormat}"/>
    <Policies>
        <TimeBasedTriggeringPolicy interval="1" modulate="true"/>
    </Policies>
    <DirectWriteRolloverStrategy maxFiles="72"/>
</RollingFile>
复制代码

这里的原理对应源码:

AbstractOutputStreamAppender.java

protected void directEncodeEvent(final LogEvent event) {
    getLayout().encode(event, manager);
    //若是配置了 immdiateFlush (默认为 true)或者当前事件是 EndOfBatch
    if (this.immediateFlush || event.isEndOfBatch()) {
        manager.flush();
    }
}
复制代码

那么对于 Log4j2 Disruptor 异步日志来讲,何时 LogEventEndOfBatch 呢?是在消费到的 index 等于生产发布到的最大 index 的时候,这也是比较符合性能设计考虑,即在没有消费完的时候,尽量地不 flush,消费完当前全部的时候再去 flush:

BatchEventProcessor.java

private void processEvents()
{
    T event = null;
    long nextSequence = sequence.get() + 1L;

    while (true)
    {
        try
        {
            final long availableSequence = sequenceBarrier.waitFor(nextSequence);
            if (batchStartAware != null)
            {
                batchStartAware.onBatchStart(availableSequence - nextSequence + 1);
            }

            while (nextSequence <= availableSequence)
            {
                event = dataProvider.get(nextSequence);
                //这里 nextSequence == availableSequence 就是 EndOfBatch
                eventHandler.onEvent(event, nextSequence, nextSequence == availableSequence);
                nextSequence++;
            }

            sequence.set(availableSequence);
        }
        catch (final TimeoutException e)
        {
            notifyTimeout(sequence.get());
        }
        catch (final AlertException ex)
        {
            if (running.get() != RUNNING)
            {
                break;
            }
        }
        catch (final Throwable ex)
        {
            exceptionHandler.handleEventException(ex, nextSequence, event);
            sequence.set(nextSequence);
            nextSequence++;
        }
    }
}

复制代码

2. 增长基于 JFR 事件监控

这个须要 Java 14 以上的版本

Configuration config = Configuration.getConfiguration("default");
//设置监控的锁 block 时间超过多少就会采集
config.getSettings().put("jdk.JavaMonitorEnter#threshold", "1s");
try (var es = new RecordingStream(config)) {
    es.onEvent("jdk.JavaMonitorEnter", recordedEvent -> {
        //若是堆栈包含咱们关注的,则报警
        if (recordedEvent.getStackTrace().getFrames().stream().anyMatch(recordedFrame -> recordedFrame.toString().contains("org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue")))  {
            System.out.println("Alarm: " + recordedEvent);
        }
    });
    es.start();
}
复制代码

微信搜索“个人编程喵”关注公众号,每日一刷,轻松提高技术,斩获各类offer

相关文章
相关标签/搜索