问题描述
最近通知应用在近三个月内出现过2次DNS缓存的问题,第一次在重启以后一直没有出现过问题,因此也没有去重视,可是最近又出现过一次,看来颇有必要完全排查一次;具体的错误日志以下:java
2018-03-16 18:53:59,501 ERROR [DefaultMessageListenerContainer-1] (com.bill99.asap.service.CryptoClient.seal(CryptoClient.java:34))- null java.lang.NullPointerException at java.net.InetAddress$Cache.put(InetAddress.java:779) ~[?:1.7.0_79] at java.net.InetAddress.cacheAddresses(InetAddress.java:858) ~[?:1.7.0_79] at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1334) ~[?:1.7.0_79] at java.net.InetAddress.getAllByName0(InetAddress.java:1248) ~[?:1.7.0_79] at java.net.InetAddress.getAllByName(InetAddress.java:1164) ~[?:1.7.0_79] at java.net.InetAddress.getAllByName(InetAddress.java:1098) ~[?:1.7.0_79] at java.net.InetAddress.getByName(InetAddress.java:1048) ~[?:1.7.0_79] at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220) ~[?:1.7.0_79] at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.New(HttpClient.java:308) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.New(HttpClient.java:326) ~[?:1.7.0_79] at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:997) ~[?:1.7.0_79] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:933) ~[?:1.7.0_79] at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:851) ~[?:1.7.0_79] at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1092) ~[?:1.7.0_79] at org.springframework.ws.transport.http.HttpUrlConnection.getRequestOutputStream(HttpUrlConnection.java:81) ~[spring-ws-core.jar:1.5.6] at org.springframework.ws.transport.AbstractSenderConnection$RequestTransportOutputStream.createOutputStream(AbstractSenderConnection.java:101) ~[spring-ws-core.jar:1.5.6] at org.springframework.ws.transport.TransportOutputStream.getOutputStream(TransportOutputStream.java:41) ~[spring-ws-core.jar:1.5.6] at org.springframework.ws.transport.TransportOutputStream.write(TransportOutputStream.java:60) ~[spring-ws-core.jar:1.5.6]
具体表现就是出现此异常以后连续的出现大量此异常,同时系统节点不可用;git
问题分析
1.既然InetAddress$Cache.put报空指针,那就具体看一下源代码:github
if (policy != InetAddressCachePolicy.FOREVER) { // As we iterate in insertion order we can // terminate when a non-expired entry is found. LinkedList<String> expired = new LinkedList<>(); long now = System.currentTimeMillis(); for (String key : ) { CacheEntry entry = cache.get(key); if (entry.expiration >= 0 && entry.expiration < now) { expired.add(key); } else { break; } } for (String key : expired) { cache.remove(key); } }
报空指针的的地方就是entry.expiration,也就是说从cache取出来的entry为null,能够查看cache写入的地方:spring
CacheEntry entry = new CacheEntry(addresses, expiration); cache.put(host, entry);
每次都是new一个CacheEntry而后再put到cache中,不会写入null进去;此时猜想是多线程引起的问题,cache.keySet()在遍历的时候同时也进行了remove操做,致使cache.get(key)到一个空值,查看源代码能够发现一共有两次对cache进行remove的地方,分别是put方法和get方法,put方法代码如上,每次在遍历的时候检测是否过时,而后统一进行remove操做;还有一处就是get方法,代码以下:缓存
public CacheEntry get(String host) { int policy = getPolicy(); if (policy == InetAddressCachePolicy.NEVER) { return null; } CacheEntry entry = cache.get(host); // check if entry has expired if (entry != null && policy != InetAddressCachePolicy.FOREVER) { if (entry.expiration >= 0 && entry.expiration < System.currentTimeMillis()) { cache.remove(host); entry = null; } } return entry; }
相似put方法也是每次在get的时候进行有效期检测,而后进行remove操做;
因此若是出现多线程问题大概就是:1.同时调用put,get方法,2.多个线程都调用put方法;继续查看源码调用put和get的地方,一共有三处分别是:安全
private static void cacheInitIfNeeded() { assert Thread.holdsLock(addressCache); if (addressCacheInit) { return; } unknown_array = new InetAddress[1]; unknown_array[0] = impl.anyLocalAddress(); addressCache.put(impl.anyLocalAddress().getHostName(), unknown_array); addressCacheInit = true; } /* * Cache the given hostname and addresses. */ private static void cacheAddresses(String hostname, InetAddress[] addresses, boolean success) { hostname = hostname.toLowerCase(); synchronized (addressCache) { cacheInitIfNeeded(); if (success) { addressCache.put(hostname, addresses); } else { negativeCache.put(hostname, addresses); } } } /* * Lookup hostname in cache (positive & negative cache). If * found return addresses, null if not found. */ private static InetAddress[] getCachedAddresses(String hostname) { hostname = hostname.toLowerCase(); // search both positive & negative caches synchronized (addressCache) { cacheInitIfNeeded(); CacheEntry entry = addressCache.get(hostname); if (entry == null) { entry = negativeCache.get(hostname); } if (entry != null) { return entry.addresses; } } // not found return null; }
cacheInitIfNeeded只在cacheAddresses和getCachedAddresses方法中被调用,用来检测cache是否已经被初始化了;而另外两个方法都加了对象锁addressCache,因此不会多线程问题;多线程
2.猜想外部直接调用了addressCache,没有使用内部提供的方法
查看源码能够发现addressCache自己是私有属性,也不存在对外的访问方法dom
private static Cache addressCache = new Cache(Cache.Type.Positive);
那业务代码中应该也不能直接使用,除非使用反射的方式,随手搜了一下全局代码查看关键字”addressCache”,搜到了相似以下代码:ide
static{ Class clazz = java.net.InetAddress.class; final Field cacheField = clazz.getDeclaredField("addressCache"); cacheField.setAccessible(true); final Object o = cacheField.get(clazz); Class clazz2 = o.getClass(); final Field cacheMapField = clazz2.getDeclaredField("cache"); cacheMapField.setAccessible(true); final Map cacheMap = (Map)cacheMapField.get(o); }
经过反射的方式获取了addressCache对象,而后又获取了cache对象(cache是一个LinkedHashMap),同时提供了一个相似以下的方法:this
public class TEst { public static void main(String[] args) throws IOException, InterruptedException { final LinkedHashMap<Integer, HH> map = new LinkedHashMap<>(); new Thread(new Runnable() { @Override public void run() { for (int i = 0; i < 2000; i++) { map.put(new Random().nextInt(1000), new HH(new Random(100).nextInt())); } } }).start(); for (int i = 0; i < 100; i++) { new Thread(new Runnable() { @Override public void run() { for (int i = 0; i < 500; i++) { map.remove(new Random().nextInt(1000)); } } }).start(); } Thread.sleep(2000); System.out.println("size=" + map.keySet().size() + "," + map.keySet()); for (Integer s : map.keySet()) { System.out.println(map.get(s)); } } } class HH { private int k; public HH(int k) { this.k = k; } public int getK() { return k; } public void setK(int k) { this.k = k; } }
模拟单线程put操做,业务端会有多条线程同时remove操做,执行看输出结果(能够执行屡次看结果):
size=0,[121, 517, 208] null null null
能够发现会出现猜想的状况,HashMap中的size属性自己不是线程安全的,因此多线程的状况下有可能出现0,这样致使get方法获取都为null,固然HashMap还有不少其余的多线程问题,由于HashMap也不是为多线程准备的,至此大概了解了缘由。
问题解决
给反射获取的cache对象加上和cacheAddresses方法一样的锁,或者直接不在业务代码中处理cache对象;能够借鉴一下阿里在github开源的操做dns缓存的项目:https://github.com/alibaba/ja...
总结本次排查问题花了一些时间在排查是否是jdk提供的类是否是有bug,这实际上是有些浪费时间的;还有就是在排查问题中不要放过任何一种可能每每问题就发生在那些理所固然的地方。