nm ConcurrentModificationException crash的问题

  最近线上的的nm 有crash的问题,查看错误日志:
java

2014-06-19 00:01:22,308 FATAL
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Error: Shutting downjava.util.
ConcurrentModificationException
        at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:761)
        at java.util.LinkedList$ListItr.next(LinkedList.java:696)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource.toString(LocalizedResource.java:120)
        at java.lang.String.valueOf(String.java:2826)
        at java.lang.StringBuilder.append(StringBuilder.java:115)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:656)
2014-06-19 00:01:22,308 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting
2014-06-19 00:03:40,685 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://bipcluster/tmp/hive-hdfs/hive_2014-06-19_00-05-51_049_5891972191087895437/-mr-10004/a1495555-b0dc-4356-8b68-1c881012e123, 1403107405580, FILE, null }
2014-06-19 00:03:40,685 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
java.util.concurrent.RejectedExecutionException
        at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
        at java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:152)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:618)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:514)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:456)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:128)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
        at java.lang.Thread.run(Thread.java:662)
2014-06-19 00:03:40,685 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye.

是在作resource  localize时多线程的并发更新问题致使nm异常退出
这是一个bug,bug id:
https://issues.apache.org/jira/browse/YARN-573
bug描述:
node

Shared data structures in Public Localizer and Private Localizer are not Thread safe.
PublicLocalizer
1) pending accessed by addResource (part of event handling) and run method (as a part of PublicLocalizer.run() ).
PrivateLocalizer (LocalizerRunner?)
1) pending accessed by addResource (part of event handling) and findNextResource (i.remove()). 
Also update method should be fixed. It too is sharing pending list.

控制resource localize的有两个线程
PublicLocalizer 和 LocalizerRunner,一个用来控制public文件的下载,一个用来控制private文件的下载,二者都会操做pending,fix的方法就是增长同步,这个bug已经在cdh5.2.0的yarn中fix了。
关于触发java.util.ConcurrentModificationException的异常能够参考:
apache

http://examples.javacodegeeks.com/java-basics/exceptions/java-util-concurrentmodificationexception-how-to-handle-concurrent-modification-exception/
相关文章
相关标签/搜索