一个作BI
数据展现的服务在一个晚上重启了5次,因为是经过k8s容器编排,服务挂了之后会自动重启,因此服务还能继续提供服务。java
第一时间先上日志系统查看错误日志,发现以下报错:react
java.lang.OutOfMemoryError ERROR java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.start(CloseableHttpAsyncClientBase.java:83) at org.elasticsearch.client.RestClientBuilder.build(RestClientBuilder.java:190) at com.xiaohongshu.fls.jbds.handler.JbdsImpl.get_realtime_trend_data_from_es(JbdsImpl.java:637) at com.xiaohongshu.fls.jbds.handler.JbdsImpl.get_seller_trend(JbdsImpl.java:316) at com.xiaohongshu.fls.jbds.handler.JbdsImpl$$FastClassBySpringCGLIB$$bd2466f7.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:738) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157) at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:85) at com.xiaohongshu.fls.jbds.aspect.RpcMethodExceptionAspect.requestControllerLog(RpcMethodExceptionAspect.java:33) at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:629) at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:618) at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:673) at com.xiaohongshu.fls.jbds.handler.JbdsImpl$$EnhancerBySpringCGLIB$$d374b954.get_seller_trend(<generated>) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.xiaohongshu.infra.rpc.core.ThriftServerBaseProxy.call(ThriftServerBaseProxy.java:119) at com.xiaohongshu.infra.rpc.core.ThriftServerCGlibProxy.intercept(ThriftServerCGlibProxy.java:27) at com.xiaohongshu.fls.jbds.handler.JbdsImpl$$EnhancerByCGLIB$$19473b8f.get_seller_trend(<generated>) at com.xiaohongshu.fls.rpc.jbds.JBusinessDataService$Processor$get_seller_trend.getResult(JBusinessDataService.java:1450) at com.xiaohongshu.fls.rpc.jbds.JBusinessDataService$Processor$get_seller_trend.getResult(JBusinessDataService.java:1435) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:518) at org.apache.thrift.server.Invocation.run(Invocation.java:18) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
发现是OOM
的错误,而且有unable to create new native thread
的错误信息,笔者的第一直觉是建立了大量线程从而致使堆外内存空间不足。spring
随即去CAT
监控系统上去查看线程的活跃状况,以下图:apache
发现服务的活跃线程数达到了1W左右,这显然有问题。服务自己是一个Thrift Server
,开起的worker
线程数为200,再加上一些其余IO
线程,总的线程数不会超过300(服务内本身没有显示建立线程或者使用线程池)。segmentfault
笔者登陆到线上的Docker实例上,经过jmap -histo:live pid
命令,查看JVM中存活的对象,输出以下内容:安全
num #instances #bytes class name ---------------------------------------------- 1: 19303 656995672 [B 2: 116670 13942144 [C 3: 15645 5298648 [I 4: 10098 3796848 java.lang.Thread 5: 75817 3639216 java.util.HashMap 6: 113889 2733336 java.lang.String 7: 27521 2421848 java.lang.reflect.Method 8: 68662 2197184 java.util.concurrent.ConcurrentHashMap$Node 9: 10041 1767032 [J 10: 372 1525568 [Ljava.nio.ByteBuffer; 11: 36807 1472280 java.util.LinkedHashMap$Entry 12: 16504 1352488 [Ljava.util.HashMap$Node; 13: 40687 1301984 java.lang.ThreadLocal$ThreadLocalMap$Entry 14: 11495 1277344 java.lang.Class 15: 29903 1196120 java.util.WeakHashMap$Entry 16: 70474 1127584 java.lang.Object 17: 15346 918592 [Ljava.lang.Object; 18: 51556 824896 java.util.HashSet 19: 25528 816896 java.util.HashMap$Node 20: 9989 812176 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry; 21: 9661 792032 [Ljava.util.WeakHashMap$Entry; 22: 9504 760320 org.apache.http.impl.nio.reactor.BaseIOReactor 23: 30551 733224 java.util.concurrent.ConcurrentLinkedQueue$Node 24: 9914 713808 sun.nio.ch.EPollArrayWrapper 25: 9914 713808 sun.nio.ch.EPollSelectorImpl 26: 29682 712368 java.util.concurrent.ConcurrentLinkedQueue 27: 3525 694672 [Ljava.util.concurrent.ConcurrentHashMap$Node; 28: 15664 501248 java.lang.ref.WeakReference 29: 8364 468384 java.util.LinkedHashMap 30: 9656 463488 java.util.WeakHashMap 31: 20064 436536 [Ljava.lang.Class; 32: 9267 370680 java.security.AccessControlContext 33: 4856 349632 java.lang.reflect.Field 34: 10278 328896 java.lang.ref.ReferenceQueue 35: 9914 317248 sun.nio.ch.AllocatedNativeObject 36: 4955 317120 java.util.concurrent.ConcurrentHashMap 37: 7597 303880 java.lang.ref.SoftReference 38: 9245 293568 [Ljava.security.ProtectionDomain; 39: 11484 275616 java.util.ArrayList 40: 3330 239760 org.springframework.core.annotation.AnnotationAttributes 41: 9989 239736 java.lang.ThreadLocal$ThreadLocalMap 42: 9951 238824 java.util.Collections$SynchronizedSet 43: 9922 238128 java.util.BitSet 44: 9504 228096 org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker 45: 5460 207944 [Ljava.lang.String; 46: 2459 196720 java.lang.reflect.Constructor 47: 12103 193648 java.util.HashMap$KeySet 48: 8003 192072 java.beans.MethodRef 49: 3657 175536 org.aspectj.weaver.reflect.ShadowMatchImpl 50: 10339 165424 java.util.concurrent.atomic.AtomicBoolean 51: 10281 164496 java.lang.ref.ReferenceQueue$Lock 52: 10203 163248 java.util.Collections$UnmodifiableSet 53: 2913 163128 java.beans.MethodDescriptor 54: 2547 161536 [Ljava.lang.reflect.Method; 55: 9914 158624 java.nio.channels.spi.AbstractSelector$1 56: 9913 158608 sun.nio.ch.Util$3 57: 5481 131544 org.springframework.core.MethodClassKey 58: 3707 118624 java.util.LinkedList 59: 3637 116384 org.aspectj.weaver.patterns.ExposedState 60: 2047 114632 java.lang.Class$ReflectionData 61: 1155 110880 org.springframework.beans.GenericTypeAwarePropertyDescriptor 62: 399 109984 [Ljava.lang.Thread; 63: 4387 105288 sun.reflect.generics.tree.SimpleClassTypeSignature 64: 1388 99936 java.beans.PropertyDescriptor 65: 4087 90560 [Ljava.lang.reflect.Type; 66: 5361 85776 org.springframework.core.annotation.AnnotationUtils$DefaultValueHolder 67: 1296 82944 io.netty.buffer.PoolSubpage 68: 4387 82520 [Lsun.reflect.generics.tree.TypeArgument; 69: 674 75488 org.springframework.boot.loader.jar.JarEntry 70: 1848 73920 java.util.TreeMap$Entry 71: 3061 73464 sun.reflect.annotation.AnnotationInvocationHandler 72: 2737 65688 java.util.LinkedList$Node 73: 3989 63824 sun.reflect.generics.tree.ClassTypeSignature 74: 858 61776 org.apache.ibatis.mapping.ResultMapping 75: 832 53248 org.springframework.core.MethodParameter 76: 78 51168 io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue 77: 2120 50880 java.util.Collections$UnmodifiableRandomAccessList 78: 3165 50640 java.util.LinkedHashMap$LinkedKeySet 79: 2419 49728 [Lsun.reflect.generics.tree.FieldTypeSignature;
标红的这个类org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker
很是值得怀疑,有9504个实例,从类名看是一个实现了Reactor
模式的http
客户端,了解Reactor
模式的同窗都知道,其包含了一个accept
线程处理链接事件,N
个IO bounding
的线程处理read
、write
事件,和M
个worker
线程处理业务逻辑。多线程
接下来须要查找哪里使用了这个http客户端,在build.gradle的文件里看到依赖了ES
的http client
app
group: 'org.elasticsearch.client', name: 'elasticsearch-rest-client', version: '6.3.2'
经过全局搜索,发现了以下的代码:dom
private String get_realtime_data_from_es (String seller_id) throws IOException{ Date current_date = new Date( ); SimpleDateFormat sdf = new SimpleDateFormat ("yyyyMMdd"); String date=sdf.format(current_date); RestClient restClient = RestClient.builder( new HttpHost("xxxxxxxxx", 9200, "http"), new HttpHost("xxxxxxxxx", 9201, "http")).build(); org.elasticsearch.client.Response response = restClient.performRequest("GET", "/seller/"+date+"/"+seller_id); restClient.close(); return EntityUtils.toString(response.getEntity()); }
笔者发现这里调用ES
的RestClient
的很是可疑,竟然是每次调用建立一个对象,而不是使用单例模式。elasticsearch
使用过RestTemplate
或者HttpClient
的同窗都知道,Http
客户端通常都是经过单例的方式注入到Spring
容器中,客户端都是线程安全的,多线程状况下不会出现串包的状况。(具体线程安全的实现机制,可参考笔者以前的博客:《Http请求链接池-HttpClient的AbstractConnPool源码分析》,地址:https://segmentfault.com/a/11...)
直接进到RestClientBuilder
类的build()
方法中一探究竟,代码以下:
public RestClient build() { if (failureListener == null) { failureListener = new RestClient.FailureListener(); } CloseableHttpAsyncClient httpClient = AccessController.doPrivileged(new PrivilegedAction<CloseableHttpAsyncClient>() { @Override public CloseableHttpAsyncClient run() { return createHttpClient(); } }); RestClient restClient = new RestClient(httpClient, maxRetryTimeout, defaultHeaders, hosts, pathPrefix, failureListener); httpClient.start(); return restClient; }
build
方法是实际建立RestClient
的地方,设置了超时时间、host等参数。再点进start
方法,他是抽象类CloseableHttpAsyncClient
的一个抽象方法,具体实如今实现类CloseableHttpAsyncClientBase
中,以下:
@Override public void start() { if (this.status.compareAndSet(Status.INACTIVE, Status.ACTIVE)) { if (this.reactorThread != null) { this.reactorThread.start(); } } }
其调用了成员变量reactorThread
的start
方法,而成员变量reactorThread
是一个Thread类,它在构造方法中初始化,以下:
public CloseableHttpAsyncClientBase( final NHttpClientConnectionManager connmgr, final ThreadFactory threadFactory, final NHttpClientEventHandler handler) { super(); this.connmgr = connmgr; if (threadFactory != null && handler != null) { this.reactorThread = threadFactory.newThread(new Runnable() { @Override public void run() { try { final IOEventDispatch ioEventDispatch = new InternalIODispatch(handler); connmgr.execute(ioEventDispatch); } catch (final Exception ex) { log.error("I/O reactor terminated abnormally", ex); } finally { status.set(Status.STOPPED); } } }); } else { this.reactorThread = null; } this.status = new AtomicReference<Status>(Status.INACTIVE); }
可见,每建立一个CloseableHttpAsyncClient
对象,就会建立一个reactorThread
线程,而connmgr.execute(ioEventDispatch)
是一个永久for循环执行的方法,因此线程的run方法不会主动退出,即,reactorThread
线程不会销毁。
找到bug所在后,结局方案很简单,只需将ES
的RestClient
经过单例的方式注入到服务中,便可解决堆外内存泄漏的问题。
查找bug时,须要从多个方面去定位,经过尽量多的现场信息去定量分析,监控、线上机器、源码的查看都须要,深刻下去,会有很多收获。