kafka0.8.1java
10月22号应用系统突然报错:spring
[2014/12/22 11:52:32.738]java.net.SocketException: 打开的文件过多apache
[2014/12/22 11:52:32.738] at java.net.Socket.createImpl(Socket.java:447)服务器
[2014/12/22 11:52:32.738] at java.net.Socket.connect(Socket.java:577)并发
[2014/12/22 11:52:32.738] at java.net.Socket.connect(Socket.java:528)app
[2014/12/22 11:52:32.738] at sun.net.NetworkClient.doConnect(NetworkClient.java:180)异步
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)socket
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)ide
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)fetch
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.New(HttpClient.java:308)
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.New(HttpClient.java:326)
[2014/12/22 11:52:32.738] at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
[2014/12/22 11:52:32.738] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
[2014/12/22 11:52:32.738] at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
[2014/12/22 11:52:32.738] at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
[2014/12/22 11:52:32.738] at com.rbc.http.client.RbcHttpClient.callCptService(RbcHttpClient.java:50)
[2014/12/22 11:52:32.738] at com.rbc.http.client.RbcHttpClient.callCptService(RbcHttpClient.java:144)
[2014/12/22 11:52:32.738] at com.rbc.http.bean.BaseBean.call(BaseBean.java:54)
[2014/12/22 11:52:32.739] at app.package.APPClass.send2Center(APPClass.java:852)
[2014/12/22 11:52:32.739] at sun.reflect.GeneratedMethodAccessor4312.invoke(Unknown Source)
[2014/12/22 11:52:32.739] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2014/12/22 11:52:32.739] at java.lang.reflect.Method.invoke(Method.java:606)
[2014/12/22 11:52:32.739] at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
[2014/12/22 11:52:32.739] at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:198)
[2014/12/22 11:52:32.739] at com.sun.proxy.$Proxy157.send2Center(Unknown Source)
[2014/12/22 11:52:32.739] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
[2014/12/22 11:52:32.739] at java.util.concurrent.FutureTask.run(FutureTask.java:262)
[2014/12/22 11:52:32.739] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[2014/12/22 11:52:32.739] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[2014/12/22 11:52:32.739] at java.lang.Thread.run(Thread.java:745)
发现是由于上午打开一个开关,即应用系统将访问日志发送到消息平台kafka,可是应用系统服务器没有配置消息平台kafka的host,致使出现问题;
将kafka客户端的broker-list配置为一个不存在的host,而且不断发送消息,执行:
lsof –p $pid|wc -l
会发现该java进程的文件描述符不断增加,其中包含大量的socket链接,并最终超过限制,能够经过:
ulimit -a|grep 'open files'
来查看文件描述符数量限制,超出限制以后,会就不断报错:
java.net.SocketException: Too many open files
问题重现;
可是若是将broker-list配置为一个不存在的ip,则不能重现问题;
Generally, a file descriptor is an index for an entry in a kernel-resident array data structure containing the details of open files. In POSIX this data structure is called a file descriptor table, and each process has its own file descriptor table. The process passes the file descriptor to the kernel through a system call, and the kernel will access the file on behalf of the process. The process itself cannot read or write the file descriptor table directly.
On Linux, the set of file descriptors open in a process can be accessed under the path /proc/PID/fd/, where PID is the process identifier.
ulimit -a|grep 'open files'
查看进程文件描述符限制
cat /proc/sys/fs/file-max
查看系统总的文件描述符限制
lsof –p $pid
查看某个进程具体打开的文件描述符
ulimit –n $num
临时修改文件描述符限制
为了不kafka出现问题影响应用系统,使用kafka异步producer,发送消息后会将消息添加到内存队列并当即返回,另外有一个线程不断处理内存队列中的消息并发送到kafka:
当kafka服务器出现问题,消息发送不出去的时候,只要配置queue.enqueue.timeout.ms=0,则内存队列满时消息直接抛弃掉,不会阻塞:
Kafka客户端发送消息前,首先会经过broker-list中的配置链接到任一台broker获取metadata:
ERROR [2014-12-22 14:05:04,687] (Logging.scala:103) - Producer connection to $kafka_server_1:9096 unsuccessful
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:613)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.producer.SyncProducer.connect(SyncProducer.scala:141)
at kafka.producer.SyncProducer.getOrMakeConnection(SyncProducer.scala:156)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:68)
at kafka.producer.SyncProducer.send(SyncProducer.scala:112)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:53)
在这个过程报错UnresolvedAddressException,SyncProducer部分代码以下:
BlockingChannel代码以下:
发现问题缘由:BlockingChannel的connect方法中,channel.connect这行抛异常,此时connected为false,可是SyncProducer的disconnect中只有当BlockingChannel的connected为true时才会调用BlockingChannel的disconnect方法,因此致使一直在调用Blockingchannel的connect方法而没有调用disconnect方法,而connect方法中会调用SocketChannel.open方法:
->
->
->
可见每次调用SocketChannel.open方法都会建立一个文件描述符,这样文件描述符会不断增长,为何只有配host时出问题,而配ip时没有问题,SocketChannelImpl部分代码:
当配host时,是Net.checkAddress时抛UnresolvedAddressException异常,这时直接退出;当配ip时,是在Net.connect时抛ConnectionTimeoutException异常,这时会有后续的finally和catch代码执行,其中readerCleanup代码以下:
->
可见虽然没有调用disconnect,也会关闭文件描述符,因此配ip时不会出现文件描述符过多的问题;
修改kafka.network.SocketChannel代码以下:
增长try-catch,并在UnresolvedAddressException异常发生时调用disconnect,问题解决;
这是当时发现问题以后在kafka jira上提交的issues: https://issues.apache.org/jira/browse/KAFKA-1832