在线上环境发现了一个工做线程异常终止,看日志先是一些SocketTimeoutException,而后忽然有一个ClassCastExceptionjava
redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: Read timed out ... java.lang.ClassCastException: [B cannot be cast to java.lang.Long at redis.clients.jedis.Connection.getIntegerReply(Connection.java:208) at redis.clients.jedis.Jedis.sismember(Jedis.java:1307)
通过在本地人工模拟网络异常的情境,最终复现了线上的这一异常。又通过深刻分析(提出假设-->验证假设),最终找出了致使这一问题的缘由。见以下示例代码:git
JedisPool pool = ...; Jedis jedis = pool.getResource(); String value = jedis.get("foo"); System.out.println("Make SocketTimeoutException"); System.in.read(); //等待制造SocketTimeoutException try { value = jedis.get("foo"); System.out.println(value); } catch (JedisConnectionException e) { e.printStackTrace(); } System.out.println("Recover from SocketTimeoutException"); System.in.read(); //等待恢复 Thread.sleep(5000); // 继续休眠一段时间 等待网络彻底恢复 boolean isMember = jedis.sismember("urls", "baidu.com");
以及日志输出:github
bar Make SocketTimeoutException redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: Read timed out Recover from SocketTimeoutException at redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:210) at redis.clients.util.RedisInputStream.readByte(RedisInputStream.java:47) at redis.clients.jedis.Protocol.process(Protocol.java:131) at redis.clients.jedis.Protocol.read(Protocol.java:196) at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:283) at redis.clients.jedis.Connection.getBinaryBulkReply(Connection.java:202) at redis.clients.jedis.Connection.getBulkReply(Connection.java:191) at redis.clients.jedis.Jedis.get(Jedis.java:101) at com.tcl.recipevideohunter.JedisTest.main(JedisTest.java:23) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.net.SocketInputStream.read(SocketInputStream.java:108) at redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:204) ... 8 more Exception in thread "main" java.lang.ClassCastException: [B cannot be cast to java.lang.Long at redis.clients.jedis.Connection.getIntegerReply(Connection.java:208) at redis.clients.jedis.Jedis.sismember(Jedis.java:1307) at com.tcl.recipevideohunter.JedisTest.main(JedisTest.java:32)
分析:redis
等执行第二遍的get("foo")时,网络超时,并未实际发送 get foo 命令,等执行sismember时,网络已恢复正常,而且是同一个jedis实例,因而将以前的get foo命令(已在输出流缓存中)一并发送。spring
执行顺序以下所示:sql
127.0.0.1:9379> get foo "bar" 127.0.0.1:9379> sismember urls baidu.com (integer) 1
故在上述示例代码中最后的sismember获得的结果是get foo的结果,即一个字符串,而sismember须要的是一个Long型,故致使了ClassCastException。shell
为何线上会出现这一问题呢?缘由是其执行redis的逻辑相似这样:缓存
while(true){ Jedis jedis = null; try { jedis = pool.getResource(); //some redis operation here. } catch (Exception e) { logger.error(e); } finally { pool.returnResource(jedis); } }
因如果网络异常的话,pool.returnResource(jedis)仍能成功执行,即能将其返回到池中(这时jedis并不为空)。等网络恢复后,并是多线程环境,致使后续其余某个线程得到了同一个Jedis实例(pool.getResource()),网络
若该线程中的jedis操做返回类型与该jedis实例在网络异常期间第一条未执行成功的jedis操做的返回类型不匹配(如一个是get,一个是sismember),则就会出现ClassCastException异常。多线程
这还算幸运的,若返回的是同一类型的话(如lpop("queue_order_pay_failed"),lpop("queue_order_pay_success")),那我真不敢想象。
如在上述示例代码中的sismember前插入一get("nonexist-key")(redis中不存在该key,即应该返回空).
value = jedis.get("nonexist-key"); System.out.println(value); boolean isMember = jedis.sismember("urls", "baidu.com"); System.out.println(isMember);
实际的日志输出为:
bar Exception in thread "main" java.lang.NullPointerException at redis.clients.jedis.Jedis.sismember(Jedis.java:1307) at com.tcl.recipevideohunter.JedisTest.main(JedisTest.java:37)
分析:
get("nonexist-key")获得是以前的get("foo")的结果, 而sismember获得的是get("nonexist-key")的结果,而get("nonexist-key")返回为空,因而这时是报空指针异常了.
解决方法:不能无论什么状况都一概使用returnResource。更健壮可靠以及优雅的处理方式以下所示:
while(true){ Jedis jedis = null; boolean broken = false; try { jedis = jedisPool.getResource(); return jedisAction.action(jedis); //模板方法 } catch (JedisException e) { broken = handleJedisException(e); throw e; } finally { closeResource(jedis, broken); } } /** * Handle jedisException, write log and return whether the connection is broken. */ protected boolean handleJedisException(JedisException jedisException) { if (jedisException instanceof JedisConnectionException) { logger.error("Redis connection " + jedisPool.getAddress() + " lost.", jedisException); } else if (jedisException instanceof JedisDataException) { if ((jedisException.getMessage() != null) && (jedisException.getMessage().indexOf("READONLY") != -1)) { logger.error("Redis connection " + jedisPool.getAddress() + " are read-only slave.", jedisException); } else { // dataException, isBroken=false return false; } } else { logger.error("Jedis exception happen.", jedisException); } return true; } /** * Return jedis connection to the pool, call different return methods depends on the conectionBroken status. */ protected void closeResource(Jedis jedis, boolean conectionBroken) { try { if (conectionBroken) { jedisPool.returnBrokenResource(jedis); } else { jedisPool.returnResource(jedis); } } catch (Exception e) { logger.error("return back jedis failed, will fore close the jedis.", e); JedisUtils.destroyJedis(jedis); } }
补充:
Ubuntu本地模拟访问redis网络超时:
sudo iptables -A INPUT -p tcp --dport 6379 -j DROP
恢复网络:
sudo iptables -F
补充:
若jedis操做逻辑相似下面所示的话,
Jedis jedis = null; try { jedis = jedisSentinelPool.getResource(); return jedis.get(key); }catch(JedisConnectionException e) { jedisSentinelPool.returnBrokenResource(jedis); logger.error("", e); throw e; }catch (Exception e) { logger.error("", e); throw e; } finally { jedisSentinelPool.returnResource(jedis); }
若一旦发生了JedisConnectionException,如网络异常,会先执行returnBrokenResource,这时jedis已被destroy了。而后进入了finally,再一次执行returnResource,这时会报错:
redis.clients.jedis.exceptions.JedisException: Could not return the resource to the pool at redis.clients.util.Pool.returnResourceObject(Pool.java:65) at redis.clients.jedis.JedisSentinelPool.returnResource(JedisSentinelPool.java:221)
临时解决方法:
jedisSentinelPool.returnBrokenResource(jedis); jedis=null; //这时不会实际执行returnResource中的相关动做了
但不建议这样处理,更严谨的释放资源方法见前文所述。