线上内存溢出问题排查案例

1.遇到问题?java

如今问题反馈加固服务使用一段时间以后,前置向平台拉取策略失败,排查加固日志服务看着是正常运行,有些error日志只是socket连接断开的错误,操做管理平台页面发现页面功能没法使用,重启服务后恢复正常,由此判断服务出现宕机假死现象。mysql

2.处理思路?sql

由于根据日志已经没法肯定具体形成服务宕机的具体缘由,对于目前遇到的状况猜想服务中发生了死锁或是内存溢出,因而在问题发生时抓取了服务中心异常线程的堆栈信息和内存快照分析问题。数据库

3.实例说明?apache

a.查看Java进程 jps -lvm服务器

[root@xdja ~]# jps -lvm网络

25501 org.apache.catalina.startup.Bootstrap start -=/usr/local/oldTomcat-topsec/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djdk.tls.ephemeralDHKeySize=2048 -Djava.endorsed.dirs=/usr/local/oldTomcat-topsec/endorsed -Dcatalina.base=/usr/local/oldTomcat-topsec -Dcatalina.home=/usr/local/oldTomcat-topsec -Djava.io.tmpdir=/usr/local/oldTomcat-topsec/tempjvm

注意:socket

现场服务器不少都是使用OpenJdk不支持以下命令,能够经过下载解压jdk到bin目录下执行命令便可。工具

b.查看Java进程服务下各个线程的cpu使用状况 top -Hp pid

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

25394 root 20 0 12.393g 964220 14248 S 0.0 3.0 0:00.00 java

25395 root 20 0 12.393g 964220 14248 S 0.0 3.0 0:00.46 java

25396 root 20 0 12.393g 964220 14248 S 0.0 3.0 0:00.08 java

25397 root 20 0 12.393g 964220 14248 S 0.0 3.0 0:00.09 java

25398 root 20 0 12.393g 964220 14248 S 0.0 3.0 0:00.08 java

 

c.找到cup占用高的线程pid转为16进制的新的pid(好比是:0x7ab9)

printf "%x\n" pid =16进制的新的pid

d.查看到该线程的堆栈信息 jstack 服务进程id |grep 0x7ab9 -A 100>aa.txt将指定异常线程信息输出到文件中

或者jstack pid>aa.txt输出整个项目进程信息

2020-05-20 09:39:13 Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.71-b01 mixed mode):

"Thread-1467" prio=10 tid=0x00007f022c003000 nid=0x7b3e runnable [0x00007f0286475000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked <0x00000007ff931fa0> (a java.io.BufferedInputStream) at com.xdja.topsec.socket.server.BSThread.run(BSThread.java:46)

"Attach Listener" daemon prio=10 tid=0x00007f024c001000 nid=0x6c4e waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"topsecTaskExecutor-10" prio=10 tid=0x00007f0188003000 nid=0x56cf waiting on condition [0x00007f0284359000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

  • parking to wait for <0x00000006c13ddd18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)

    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1085) 

    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807) 

    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

    "topsecTaskExecutor-9" prio=10 tid=0x00007f01f0137800 nid=0x56c7 waiting on condition [0x00007f0284657000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

  • parking to wait for <0x00000006c13ddd18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

    •  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) 

      at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1079) 

      at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)

       at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) 

      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

       at java.lang.Thread.run(Thread.java:745)

"topsecTaskExecutor-8" prio=10 tid=0x00007f0188001800 nid=0x56be waiting on condition [0x00007f0284758000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

- parking to wait for  <0x00000006c13ddd18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)

at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)

at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1085)

at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)

at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

 

"Abandoned connection cleanup thread" daemon prio=10 tid=0x00007f029c3d5000 nid=0x563e in Object.wait() [0x00007f0286576000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method)

waiting on <0x00000006c17e04b0> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)

  • locked <0x00000006c17e04b0> (a java.lang.ref.ReferenceQueue$Lock) at com.mysql.jdbc.AbandonedConnectionCleanupThread.run(AbandonedConnectionCleanupThread.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

"Finalizer" daemon prio=10 tid=0x00007f029c077800 nid=0x5629 in Object.wait() [0x00007f0295320000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)

waiting on <0x00000006c0010e80> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)

  • locked <0x00000006c0010e80> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

注释:

1.死锁,Deadlock(重点关注)

2.执行中,Runnable

3.等待资源,Waiting on condition(重点关注)

4.等待获取监视器,Waiting on monitor entry(重点关注)

5.暂停,Suspended

6.对象等待中,Object.wait() 或 TIMED_WAITING

7.阻塞,Blocked(重点关注)

8.中止,Parked

 

含义以下所示:

Deadlock:死锁线程,通常指多个线程调用间,进入相互资源占用,致使一直等待没法释放的状况。

Runnable:通常指该线程正在执行状态中,该线程占用了资源,正在处理某个请求,有可能正在传递SQL到数据库执行,有可能在对某个文件操做,有可能进行数据类型等转换。

Waiting on condition:等待资源,或等待某个条件的发生。具体缘由需结合 stacktrace来分析。 一种状况是网络很是忙,几乎消耗了全部的带宽,仍然有大量数据等待网络读写; 另外一种状况也多是网络空闲,但因为路由等问题,致使包没法正常的到达。
若是堆栈信息明确是应用代码,则证实该线程正在等待资源。通常是大量读取某资源,且该资源采用了资源锁的状况下,线程进入等待状态,等待资源的读取。 又或者,正在等待其余线程的执行等。

若是发现有大量的线程都在处在 Wait on condition,从线程 stack看,正等待网络读写,这多是一个网络瓶颈的征兆。由于网络阻塞致使线程没法执行。 另一种出现 Wait on condition的常见状况是该线程在 sleep,
等待 sleep的时间到了时候,将被唤醒。 Blocked:线程阻塞,是指当前线程执行过程当中,所须要的资源长时间等待却一直未能获取到,被容器的线程管理器标识为阻塞状态,能够理解为等待资源超时的线程。

Waiting for monitor entry 和 in Object.wait():Monitor 是 Java中用以实现线程之间的互斥与协做的主要手段,它能够当作是对象或者 Class的锁。每个对象都有,也仅有一个 monitor。从下图1中能够看出,每一个 Monitor在某个时刻,
只能被一个线程拥有,该线程就是 “Active Thread”,而其它线程都是 “Waiting Thread”,分别在两个队列 “ Entry Set”和 “Wait Set”里面等候。在 “Entry Set”中等待的线程状态是 “Waiting for monitor entry”,
而在 “Wait Set”中等待的线程状态是 “in Object.wait()”。

"Abandoned connection cleanup thread" daemon prio=10 tid=0x00007f029c3d5000 nid=0x563e in Object.wait() [0x00007f0286576000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method)

  • 1)waiting on <0x00000006c17e04b0> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)

    • locked <0x00000006c17e04b0> (a java.lang.ref.ReferenceQueue$Lock) at com.mysql.jdbc.AbandonedConnectionCleanupThread.run(AbandonedConnectionCleanupThread.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

    • 1)线程状态是Blocked,阻塞状态。说明线程等待资源超时!

    2)“waiting to lock <0x00000000acf4d0c0>”指,线程在等待给这个 0x00000000acf4d0c0 地址上锁3)在

    3)“waiting for monitor entry”说明此线程经过 synchronized(obj) {……} 申请进入了临界区,从而进入了下图1中的“Entry Set”队列,但该 obj 对应的 monitor 被其余线程拥有,因此本线程在 Entry Set 队列中等待。

    4)第一行里,"RMI TCP Connection(267865)-172.16.5.25"是 Thread Name 。tid指Java Thread id。nid指native线程的id。prio是线程优先级。[0x00007fd4f8684000]是线程栈起始地址。

    综合示范二:Waiting on condition 和 TIMED_WAITING

实例以下:

Thread-6" prio=10 tid=0x00007f029cc66800 nid=0x5634 waiting on condition [0x00007f02873e3000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at com.xdja.topsec.socket.handler.SysCsHandler.sendAllUrlOfAllSn(SysCsHandler.java:40) at com.xdja.topsec.socket.server.SysCsThread.run(SysCsThread.java:24)1)

“TIMED_WAITING (parking)”中的 timed_waiting 指等待状态,但这里指定了时间,自身项目中多处线程用到了sleep。

 

“TIMED_WAITING (on object monitor)”,对于本例而言,是由于本线程调用了 java.lang.Object.wait(long timeout) 而进入等待状态。

“Wait Set”中等待的线程状态就是“in Object.wait() ”。 当线程得到了 Monitor,进入了临界区以后,若是发现线程继续运行的条件没有知足,它则调用对象(通常就是被 synchronized 的对象)的 wait() 方法,放弃了 Monitor,进入 “Wait Set”队列。只有当别的线程在该对象上调用了 notify() 或者 notifyAll() ,“ Wait Set”队列中线程才获得机会去竞争,可是只有一个线程得到对象的 Monitor,恢复到运行态。

对于以上信息排查没有发现服务中出现死锁现象因而转向内存使用排查。

 

5.查看内存占用高的线程快照信息

获取JVM参数配置及当前使用状况信息:jmap -heap pid

using thread-local object allocation. Parallel GC with 8 thread(s)

 

Heap Configuration:

MinHeapFreeRatio = 0

MaxHeapFreeRatio = 100

MaxHeapSize = 5368709120 (5120.0MB)

NewSize = 1310720 (1.25MB)

MaxNewSize = 17592186044415 MB

OldSize = 5439488 (5.1875MB)

NewRatio = 2

SurvivorRatio = 8

PermSize = 134217728 (128.0MB)

MaxPermSize = 268435456 (256.0MB)

G1HeapRegionSize = 0 (0.0MB)

 

Heap Usage:

PS Young Generation

Eden Space:

capacity = 1783627776 (1701.0MB)

used = 1783627776 (1701.0MB)

free = 0 (0.0MB) 100.0% used

 

From Space:

capacity = 3145728 (3.0MB)

used = 0 (0.0MB)

free = 3145728 (3.0MB) 0.0%used

 

To Space:

capacity = 3145728 (3.0MB)

used = 0 (0.0MB)

free = 3145728 (3.0MB) 0.0% used

 

PS Old Generation

capacity = 3578789888 (3413.0MB)

used = 3578654856 (3412.871223449707MB)

free = 135032 (0.12877655029296875MB) 99.99622688103449%used

 

PS Perm Generation

capacity = 134217728 (128.0MB)

used = 52435368 (50.006263732910156MB)

free = 81782360 (77.99373626708984MB)

39.06739354133606% used

27553 interned Strings occupying 3230968 bytes.

由此能够看出Eden已经内存使用100%这样就会把内存较大的对象放入Old Generation

当old区使用到峰值时就致使了Fgc会致使整个业务没法正常使用。此时能够经过调整yong区的

参数通常是占用1/3jvm内存,这样能够减小fgc的次数和时间提升系统稳定性。

 

获取内存中全部实例及大小:jmap -histo pid

num     #instances     #bytes class name

1: 6092078 5181006784 [B

2: 6057175 145372200 java.util.concurrent.ConcurrentLinkedQueue$Node

3: 91395 13623944 <constMethodKlass>

4: 129628 11938344 [C

5: 91395 11710064 <methodKlass>

6: 8119 9829080 <constantPoolKlass>

7: 8119 5940624 <instanceKlassKlass>

8: 6713 5421824 <constantPoolCacheKlass>

9: 125684 3016416 java.lang.String

10: 4207 2198744 <methodDataKlass>

11: 44205 2121840 org.jfaster.mango.parser.Token
12: 20820 1665600 java.lang.reflect.Method
13: 32642 1305680 java.util.LinkedHashMap$Entry
14: 11385 990832 [Ljava.util.HashMap$Entry;
15: 17313 969528 org.jfaster.mango.parser.ASTBlank
16: 17308 969248 org.jfaster.mango.parser.ASTText
17: 29001 928032 java.util.HashMap$Entry
18: 8726 850248 java.lang.Class
19: 26414 845248 java.util.concurrent.ConcurrentHashMap$HashEntry
20: 11436 738272 [S
21: 22050 705600 org.jfaster.mango.stat.atomic.LongAdder
 
能够根据内存中的实例对象和占用的内存信息做为参考来优化本身的业务代码
获取内存快照信息:jmap -dump:format=b,file=tmp.hprof pid
a.获取到内存快照文件后能够经过Java自带的jvisualvm.exe导入tmp.hprof文件查看相关信息
推荐使用IBM HeapAnalyzer工具排查
启动IBM HeapAnalyzer:java -Xmx2g -jar .\ha457.jar
-Xmx2g为启动时分配内存参数
HeapAnalyzer启动后,经过菜单File->Open打开生成的dump文件tmp.hprof。
dump文件加载后,经过菜单Analysis->Tree View展示出来。
经过dump信息分析内存泄漏的对象。主要思路是堆内存占用比例,比例越大说明堆内存消耗越多。
这个例子很简单,经过以下图很快就可定位出内存泄漏的地方。

由此能够看出java/util/concurrent/ConcurrentLinkedQueue$Node
ConcurrentLinkedQueue对象占用了5个多G的内存已经占用了99%的Java内存这样就会致使
发生Fgc或者内存溢出致使整个系统假死。
6.解决方案
优化业务代码减小内存开销
使用Redis中间件替换队列