记一次解决线上OOM的心路历程（配置中心）

时间 2019-12-05

标签一次解决线上 oom 心路历程配置中心繁體版

原文原文链接

背景：随着Best Diamond的不断推广、成熟，内部使用其来进行配置统一管理的项目愈来愈多，在各自的测试环境中测试达到他们的预期后，逐渐将其投入生产环境使用。前端

事故：有一个公司内部的核心系统生产发布时Best Diamond client日志输出链接server端失败，致使项目配置文件不能及时拉取（因为配置中心的客户端在拉取到服务端的配置后会将其存入服务器本地，当因网络或者其余缘由致使没有拉取到server端的配置时会自动读取本地已有的配置，所以并无影响该项目的正常启用，但这是侥幸的，由于万一该项目在Best Diamond的服务端修改了某一变量值，则后果不堪设想）。java

事故排查：bash

在日志平台查看Best Diamond服务端的日志输出，看到某一台服务端的日志中有：OutOfMemoryError： GC overhead limit exceeded。看到上面这个错时基本已经判定是内存泄漏引发。服务器

临时处理：因为是生产环境，为不同用户的生产环境的正常使用第一时间对server端进行重启（此处未将当时的内存先dump出来再重启，属于严重的错误操做，为后续的排查带来一些困难）。网络

缘由分析：架构

一、测试环境中为什么从未出现过该问题，测试环境中client的链接数远比生产环境中的大，测试环境和生产环境有哪里不同。运维

二、生产环境为什么上线一年多了从未出现此问题，却在此时出现了。ide

继续排查：看代码提交记录及生产发版日志，拉取线上版本代码（注：代码多人在维护），正如上面所提，并未将当时的现场内存dump到文件中（算是给之后一个警醒），此时再分析是有必定难度的，但不能听任无论或者等到下一次宕机再来排查，处理：既然怀疑是内存泄漏致使的，那毕竟是代码有处理不到的地方，因而dump出了另外一台生产环境同版本的服务端内存，命令为：工具

jmap -dump:live,format=b,file=dump.hprof 24971（PID）

将dump出来的内存文件借助jvisualvm进行分析：测试

有此能够看到String、Date、ClinetInfo这类型的实例数均过百万（char[] 是由于String内部是有char[]实现，从实例数看程序单独使用char[]的状况排除），占用了虚拟机60%的内存，由此咱们怀疑ClientInfo实例存在内存泄漏，因而查看代码：

package com.best.diamond.model.netty;

import java.util.Date;

public class ClientInfo {

    private String address;

    private Date connectTime;

    public ClientInfo(String address, Date connectTime) {
        this.address = address;
        this.connectTime = connectTime;
    }

    public String getAddress() {
        return address;
    }

    public void setAddress(String address) {
        this.address = address;
    }

    public Date getConnectTime() {
        return connectTime;
    }

    public void setConnectTime(Date connectTime) {
        this.connectTime = connectTime;
    }

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result
                + ((address == null) ? 0 : address.hashCode());
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        ClientInfo other = (ClientInfo) obj;
        if (address == null) {
            if (other.address != null)
                return false;
        }
        else if (!address.equals(other.address))
            return false;
        return true;
    }

}

果不其然，ClinetInfo中有String、Date两类属性，因而便判定ClientInfo是致使内存泄漏的罪魁祸首。

继续看代码发现有不少地方都使用到了ClientInfo，进一步步排查、排除，最终锁定了自实现Netty的一个ChannelHandler,DiamondServerHandler该类主要代码以下：

@Sharable
public class DiamondServerHandler extends SimpleChannelInboundHandler<String> {

    private final static String HEARTBEAT = "heartbeat";

    private final static String DIAMOND = "bestdiamond=";

    private final static Logger logger = LoggerFactory.getLogger(DiamondServerHandler.class);

    private final static Charset CHARSET = Charset.forName("UTF-8");

    public static ConcurrentHashMap<ClientKey, List<ClientInfo>> clients = new ConcurrentHashMap<>();

    private ConcurrentHashMap<String /*client address*/, ChannelHandlerContext> channels = new ConcurrentHashMap<String, ChannelHandlerContext>();

    @Autowired
    private ProjectConfigService projectConfigService;

    @Autowired
    private ProjectModuleService projectModuleService;

            modules = new ArrayList<>(Arrays.asList(StringUtils.split(facet.getModules(), Constants.COMMA)));
        }
        if (CollectionUtils.isEmpty(modules)) {
            throw new ServiceException(StatusCode.ILLEGAL_ARGUMENT, "Module has not been maintained yet.");
        }
        ClientKey key = new ClientKey();
        key.setProjCode(facet.getProjCode());
        key.setProfile(facet.getProfile());
        key.setModules(modules);
        List<ClientInfo> addrs = clients.get(key);
        if (addrs == null) {
            addrs = new ArrayList<>();
        }
        String clientAddress = ctx.channel().remoteAddress().toString().substring(1);
        ClientInfo clientInfo = new ClientInfo(clientAddress, new Date());
        addrs.add(clientInfo);
        clients.put(key, addrs);
        channels.put(clientAddress, ctx);
        return facet;
    }

    @Override
    }

    @Override
    public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
        ctx.close();
    }

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
        super.channelInactive(ctx);
        String address = ctx.channel().remoteAddress().toString();
        channels.remove(address);
        String watchConfigs = removeClientConnectionInfo(ctx);
        removeConfigWatchNode(watchConfigs, address);
        for (List<ClientInfo> infoList : clients.values()) {
            for (ClientInfo client : infoList) {
                if (address.equals(client.getAddress())) {
                    infoList.remove(client);
                    break;
                }
            }
        }
        logger.info(ctx.channel().remoteAddress() + " 断开链接。");
    }

从代码咱们能够看出ClinetInfo主要用来Server对于Client的部分信息的存储，客户端在链接上服务端时建立ClientInfo实例，链接断开时释放，心细的同窗应该发现了，此处用于存储和删除ClientInfo的key--address变量处理是有问题的，存储时：

String clientAddress = ctx.channel().remoteAddress().toString().substring(1);

删除时：

String address = ctx.channel().remoteAddress().toString();

这样ClientInfo中对于断开后的Clinet信息是永远不会删除的，由此便找到了内存泄漏的地方。（说明：因为DiamondServerHandler使用了Netty的@Sharable注解，它将被全部channel公用，引用一直不会失效，所以它一直不会被回收）。

调整后的代码：

@Sharable
public class DiamondServerHandler extends SimpleChannelInboundHandler<String> {

    private final static String HEARTBEAT = "heartbeat";

    private final static String DIAMOND = "bestdiamond=";

    private final static Logger logger = LoggerFactory.getLogger(DiamondServerHandler.class);

    private final static Charset CHARSET = Charset.forName("UTF-8");

    private static Object locker = new Object();

    public static ConcurrentHashMap<ClientKey, List<ClientInfo>> clients = new ConcurrentHashMap<>();

    private ConcurrentHashMap<String /*client address*/, ChannelHandlerContext> channels = new ConcurrentHashMap<String, ChannelHandlerContext>();

    @Autowired
    private ProjectConfigService projectConfigService;

    @Autowired
    private ProjectModuleService projectModuleService;

            modules = new ArrayList<>(Arrays.asList(StringUtils.split(facet.getModules(), Constants.COMMA)));
        }
        if (CollectionUtils.isEmpty(modules)) {
            throw new ServiceException(StatusCode.ILLEGAL_ARGUMENT, "Module has not been maintained yet.");
        }
        ClientKey key = new ClientKey();
        key.setProjCode(facet.getProjCode());
        key.setProfile(facet.getProfile());
        key.setModules(modules);
        List<ClientInfo> addrs = clients.get(key);
        synchronized (locker) {
            if (null == addrs) {
                addrs = new ArrayList<>();
            }
        }
        String clientAddress = ctx.channel().remoteAddress().toString().substring(1);
        ClientInfo clientInfo = new ClientInfo(clientAddress, new Date());
        addrs.add(clientInfo);
        clients.put(key, addrs);
        channels.put(clientAddress, ctx);
        return facet;
    }

    @Override
    }

    @Override
    public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
        ctx.close();
    }

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
        super.channelInactive(ctx);
        String address = ctx.channel().remoteAddress().toString().substring(1);
        channels.remove(address);
        removeConfigWatchNode(watchConfigs, address.substring(1));
        removeConfigWatchNode(watchConfigs, address);
        for (List<ClientInfo> infoList : clients.values()) {
            for (ClientInfo client : infoList) {
                if (address.equals(client.getAddress())) {
                    infoList.remove(client);
                    break;
                }
            }
        }
        logger.info(ctx.channel().remoteAddress() + " 断开链接。");
    }

调整后发版测试，问题解决。

其实尚未结束，即便这里会存在内存泄漏可是超过百万的实例数仍是有点太多了，进过排查、肯定发现是历史问题致使的，前期是使用了公司的HA作的服务端的高可用，HA代理机每分钟便会断开链接，再从新链接，咨询运维同窗说这个是HA的一种机制。

此处坑：咱们当初使用Netty原本就是为了服务端与客户端之间维护长链接，HA的这种机制与其相违背，后来咱们架构上有所调整，不在借助前端的HA来代理客户端的链接，这样也合理一点，最终解决了问题。

总结：

一、出现OOM时绝大部分是代码的问题，第一时间须要dump出当时虚机的内存快照，便于定位问题。

二、借助工具去分析dump出的内存文件，能够提升排查的效率，此处jvisualvm其实只是最基础的排查工具，后面咱们有使用了其余的可视化工具解决了其余宕机状况（后续在写）。

三、定位到问题解决后，依然须要找到提交代码的同窗，给予警醒。