ORACLE的Dead Connection Detection浅析

时间 2019-11-16

标签 oracle dead connection detection 浅析栏目 Oracle 繁體版

原文原文链接

在复杂的应用环境下，咱们常常会遇到一些很是复杂而且有意思的问题，例如，咱们会遇到网络异常（网络掉包、无线网络断线）、客户端程序异常（例如应用程序崩溃Crash）、操做系统蓝屏、客户端电脑掉电、死机重启等异常状况，此时数据库链接可能都没有正常关闭（Colse）、事务都没有提交，链接（connections）就断开了。若是遇到这些状况，你未提交的一个事务在数据库中是否会回滚？若是回滚，什么条件才会触发回滚？须要多久才会触发回滚（不是回滚须要多少时间）？若是是一个查询呢，那么状况又是怎么样呢？ORACLE数据库是否提供某些机制来解决这些问题呢？若是这些问题你都能回答，那么能够不用看下文了，在介绍理论知识以前，咱们先经过构造测试案例，测试一下，毕竟实践出真知，抽象的理论须要实验来加深理解、全面详细阐述。 html

咱们首先来测试一下数据库会话正常退出的状况吧，我在客户端使用（SQL*Plus）链接到数据库，执行一个UPDATE语句后不提交，而后退出（注意：实验步骤是在服务器端查询一些信息后才退出）。以下所示： web

 
  SQL> select * from v$mystat where rownum=1; 
   
         SID STATISTIC#      VALUE 
   
  ---------- ---------- ---------- 
   
         196          0          0 
   
  SQL> select sid,serial# from v$session where sid=196; 
   
         SID    SERIAL# 
   
  ---------- ---------- 
   
         196          9 
   
  SQL> update scott.dept set loc='CHICAGO' where deptno=40; 
   
  1 row updated. 
   
  SQL> exit    --在服务器查询一些信息后才执行该命令

在服务器端咱们查看会话（196,9）的一些相关信息，以下所示：sql

 
  SQL> set linesize 1200 
   
  SQL> select sid, seconds_in_wait, event from v$session_wait where sid=196; 
   
         SID SECONDS_IN_WAIT EVENT 
   
  ---------- ----------- ------------------------------------------------- 
   
         196              33 SQL*Net message from client 
   
  SQL> SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6         ,A.LOCKED_MODE 
   
    7  FROM   V$LOCKED_OBJECT A,  
   
    8         V$SESSION B  
   
    9  WHERE  A.SESSION_ID = B.SID  
   
   10  ORDER  BY B.LOGON_TIME; 
   
  USERNAME              SID    SERIAL# LOGON_TIM  OBJECT_ID   LOCKED_MODE  
   
  ----------------- ---------- ---------- --------- ----------   ----------- 
   
  TEST                   196          9 01-DEC-16      73199       3

从上面能够看到196会话对表SCOTT.DEPT持有锁（Row-X 行独占(RX)），对象ID为73199，而后咱们在客户端不提交UPDATE语句就执行exit命令退出会话后，而后在服务器端检查会话是否回滚。以下所示，测试结果咱们能够看到，正常exit后，会话会当即回滚。（pmon进程当即回收相关进程，回收资源）数据库

 
  SQL> select sid, seconds_in_wait, event from v$session_wait where sid=196; 
   
  no rows selected 
   
  SQL> SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6  FROM   V$LOCKED_OBJECT A,  
   
    7         V$SESSION B  
   
    8  WHERE  A.SESSION_ID = B.SID  
   
    9  ORDER  BY B.LOGON_TIME; 
   
  no rows selected 
   
  SQL>

接下来，咱们来构造网络异常的案例（须要多台机器或虚拟机），以下所示，咱们首先在虚拟机上使用SQL*Plus链接到服务器端（帐号为test，另外服务器上sqlnet.ora 不要设置SQLNET.EXPIRE_TIME参数，不启用DCD，后面介绍至这个），而后执行一个UPATE语句不提交服务器

 
  SQL> show user; 
   
  USER is "TEST" 
   
  SQL> select * from v$mystat where rownum =1; 
   
         SID STATISTIC#      VALUE 
   
  ---------- ---------- ---------- 
   
         914          0          1 
   
  SQL> select sid,serial# from v$session where sid=914; 
   
         SID    SERIAL# 
   
  ---------- ---------- 
   
         914       3944 
   
  SQL> update scott.emp set sal=8000 where empno=7369; 
   
  1 row updated. 
   
  SQL>

而后咱们断开虚拟机的网络，构造网络异常案例（在客户端机器上执行service network stop命令断开网络），咱们在服务器端使用SQL*Plus查看会话（914,3944）的状况，以下所示网络

 
  SQL> select sid, seconds_in_wait, event from v$session_wait where sid=914; 
   
         SID SECONDS_IN_WAIT EVENT 
   
  ---------- --------------- ---------------------------------------------------------------- 
   
         914              93 SQL*Net message from client 
   
  SQL>  SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6  FROM   V$LOCKED_OBJECT A,  
   
    7         V$SESSION B  
   
    8  WHERE  A.SESSION_ID = B.SID  
   
    9  ORDER  BY B.LOGON_TIME; 
   
  USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID 
   
  ------------------------------ ---------- ---------- --------- ---------- 
   
  TEST                                  914       3944 01-DEC-16     782460 
   
  SQL>

咱们继续执行上面语句，你会看到看到会话914一直是INACTIVE，对表一直持有Row-X 行独占(RX)，并且seconds_in_wait也一直在增加 session

 
  SQL> select sid, seconds_in_wait, event from v$session_wait where sid=914; 
   
         SID SECONDS_IN_WAIT EVENT 
   
  ---------- --------------- ----------------------------------------------------- 
   
         914            4928 SQL*Net message from client 
   
  SQL>  SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6  FROM   V$LOCKED_OBJECT A,  
   
    7         V$SESSION B  
   
    8  WHERE  A.SESSION_ID = B.SID  
   
    9  ORDER  BY B.LOGON_TIME; 
   
  USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID 
   
  ------------------------------ ---------- ---------- --------- ---------- 
   
  TEST                                  914       3944 01-DEC-16     782460 
   
  SQL>  select sid, seconds_in_wait, event from v$session_wait where sid=914; 
   
         SID SECONDS_IN_WAIT EVENT 
   
  ---------- --------------- ------------------------------------------------ 
   
         914            5853 SQL*Net message from client 
   
  SQL> SELECT B.USERNAME 
   
    2         ,B.SID 
   
    3         ,B.SERIAL# 
   
    4         ,LOGON_TIME 
   
    5         ,A.OBJECT_ID  
   
    6  FROM   V$LOCKED_OBJECT A,  
   
    7         V$SESSION B  
   
    8  WHERE  A.SESSION_ID = B.SID  
   
    9  ORDER  BY B.LOGON_TIME; 
   
  USERNAME                              SID    SERIAL# LOGON_TIM  OBJECT_ID 
   
  ------------------------------ ---------- ---------- --------- ---------- 
   
  TEST                                  914       3944 01-DEC-16     782460 
   
  SQL>

最后一直等待pmon进程回收资源，通过屡次测试，发现这个时间都是在7860多秒后才会被PMON进程回收资源。 oracle

那么这个是否有一个固定的值？这个值是否有规律呢？我构造了这样一个脚本在服务器端运行(根据实际状况修改sid, serial#的值)，测试数据库须要耗费多久时间，PMON进程才会回收进程，释放资源，回滚事务。app

 
  CREATE TABLE TEST.SESSION_WAIT_RECORD 
   
  AS 
   
  SELECT sid,  
   
         seconds_in_wait,  
   
         event, 
   
     sysdate as curr_datetime 
   
  FROM   v$session_wait  
   
         where 1=0; 
   
  CREATE TABLE TEST.LOCK_OBJECT_RECORD AS  
   
          SELECT B.username,  
   
                 B.sid,  
   
                 B.serial#,  
   
                 logon_time,  
   
                 A.object_id , 
   
                 sysdate as curr_datetime 
   
          FROM   v$locked_object A,  
   
                 v$session B  
   
          WHERE  A.session_id = B.sid  
   
         AND 1=0; 
   
  DECLARE  
   
      v_index NUMBER := 1;  
   
  BEGIN  
   
      WHILE v_index != 0 LOOP  
   
            INSERT INTO SESSION_WAIT_RECORD 
   
          SELECT sid,  
   
                 seconds_in_wait,  
   
                 event , 
   
           sysdate 
   
          FROM   v$session_wait  
   
          WHERE  sid = 916;  
   
                  INSERT INTO LOCK_OBJECT_RECORD 
   
          SELECT B.username,  
   
                     B.sid,  
   
                 B.serial#,  
   
                 logon_time,  
   
                 A.object_id , 
   
                 sysdate 
   
          FROM   v$locked_object A,  
   
                 v$session B  
   
          WHERE  A.session_id = B.sid  
   
                  AND A.session_id=916 AND B.serial#=415 
   
          ORDER  BY B.logon_time;  
   
          commit; 
   
          dbms_lock.Sleep(10);  
   
          SELECT Count(*)  
   
          INTO   v_index  
   
          FROM   v$session_wait  
   
          WHERE  sid = 916;  
   
      END LOOP;  
   
  END;

1：屡次的测试结果是否一直一致？这个值是否有什么规律？ less

从个人几回测试结果来看（固然没有大量测试和考虑各类场景），几回测试的结果以下（查询SESSION_WAIT_RECORD表），基本上都是在7872~7876. 因为上面SQL会休眠10秒，因此能够推断数据库会在一个固定的时间后清理断开的会话。

测试实验1

测试实验2：

测试实验3：

将上面脚本休眠的时间改成2秒，避免修改时间过长引发的偏差，测试结果是7876

看似结果有点不一致，实际上是由于偏差，由于脚本里面休眠的时间(实验一、2的休眠时间为10秒，实验3改成2秒)，以及其余方面的一些偏差致使，规律就是这个跟Linux系统的TCP keepalive有关系，咱们先来看看TCP keepalive概念，以下

The keepalive concept is very simple: when you set up a TCP connection, you associate a set of timers. Some of these timers deal with the keepalive procedure. When the keepalive timer reaches zero, you send your peer a keepalive probe packet with no data in it and the ACK flag turned on. You can do this because of the TCP/IP specifications, as a sort of duplicate ACK, and the remote endpoint will have no arguments, as TCP is a stream-oriented protocol. On the other hand, you will receive a reply from the remote host (which doesn't need to support keepalive at all, just TCP/IP), with no data and the ACK set.

顾名思义，TCP keepalive它是用来保存TCP链接的，注意它只适用于TCP链接。系统会替你维护一个timer，时间到了，就会向remote peer发送一个probe package，固然里面是没有数据的，对方就会返回一个应答，这时你就知道这个通道保持正常。与TCP keepalive有关的三个参数tcp_keepalive_time、tcp_keepalive_intvl、tcp_keepalive_probes

[root@myln01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_time

7200

[root@mylnx01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl

[root@mylnx01uat ~]# cat /proc/sys/net/ipv4/tcp_keepalive_probes

[root@getlnx01uat ~]#

/proc/sys/net/ipv4/tcp_keepalive_time 当keepalive起用的时候，TCP发送keepalive消息的频度。默认是2小时。

/proc/sys/net/ipv4/tcp_keepalive_intvl 当探测没有确认时，keepalive探测包的发送间隔。缺省是75秒。

/proc/sys/net/ipv4/tcp_keepalive_probes 若是对方不予应答，keepalive探测包的发送次数。缺省值是9。

那么在Oracle没有启用DCD时，系统和数据库如何判断一个链接是否异常，须要关闭呢？这个时间是这样计算的，首先它等待了7200，而后每隔75秒发送探测包，一共发送了9次后（7200+ 75*9 = 7875 ），都没有收到客户端应答，那么它就判断这个链接死掉了，能够关闭了。因此这个值是一个固定值, 具体为7875, 固然不一样的操做系统可能有所不一样,取决于上面三个tcp_keepalive参数，过了7875秒后，这个时候PMON进程就会回收与它相关的全部资源（例如回滚事务，释放lock、latch、memory）。这个值与我测试的时间很是接近了（考虑咱们是采集的等待时间，以及测试脚本里面有休眠时间，这样采集的数据有些许误差）。

2：若是是一个查询操做呢？结果又是什么状况。

若是是查询操做，结果依然是如此，有兴趣的能够自行测试。

3：这个是否跟专用链接服务器模式与共享链接服务器模式有关？

测试结果发现，专用链接服务器模式与共享链接服务器模式都同样。只是跟Linux的系统内核参数tcp_keepalive_time、tcp_keepalive_intvl、tcp_keepalive_probes有关系。

那么问题来了，若是会话持续持有Row-X 行独占(RX)长达7875秒，那么颇有可能致使系统出现一些性能问题，重要的系统里面这个是不可接受的，好了，如今回到咱们讨论的正题，ORACLE是怎么处理这些问题的？它应该有一套机制来解决这个问题，不然它也太弱了。其实ORACLE提供了DCD（Dead Connection Detection 即死链接检测）机制来解决这个问题，下面来介绍这个：

Dead Connection Detection概念

DCD是Dead Connection Detection的缩写，用于检查那些死掉但没有断开的session。它的具体原理以下所示：

当一个新的数据库链接创建后，SQL*Net读取参数文件的SQLNET.EXPIRE_TIME设置（若是设置了的话），在服务端初始化DCD，DCD会为这个链接建立一个定时器，当该定时器超过SQLNET.EXPIRE_TIME指定时间间隔后，就会向客户端发送一个probe package（侦测包），该包实质上是一个空的SQL*NET包，不包括任何有用数据，它仅在底层协议上建立了数据流。若是此时客户端链接仍是正常的话，那么这个probe package就会被客户端直接丢弃，而后Oracle服务器就会把该链接对应的定时器从新复位。若是客户异常退出的话，侦测包由客户端的IP层交到TCP层时，就会发现原先的链接已经不存在了，而后TCP层就会返回错误信息，该信息被ORACLE服务端接收到后，ORACLE就会知道该链接已经不可用了，因而SQL*NET就会向操做系统发送消息，释放该链接的相关资源。

官方文档关于Dead Connection Detection的介绍请参考文档“Dead Connection Detection (DCD) Explained (文档 ID 151972.1)”，摘抄部分以下所示

 DEAD CONNECTION DETECTION ========================= OVERVIEW -------- Dead Connection Detection (DCD) is a feature of SQL*Net 2.1 and later, including Oracle Net8 and Oracle NET. DCD detects when a partner in a SQL*Net V2 client/server or server/server connection has terminated unexpectedly, and flags the dead session so PMON can release the resources associated with it. DCD is intended primarily for environments in which clients power down their systems without disconnecting from their Oracle sessions, a problem characteristic of networks with PC clients. DCD is initiated on the server when a connection is established. At this time SQL*Net reads the SQL*Net parameter files and sets a timer to generate an alarm. The timer interval is set by providing a non-zero value in minutes for the SQLNET.EXPIRE_TIME parameter in the sqlnet.ora file. When the timer expires, SQL*Net on the server sends a "probe" packet to the client. (In the case of a database link, the destination of the link constitutes the server side of the connection.) The probe is essentially an empty SQL*Net packet and does not represent any form of SQL*Net level data, but it creates data traffic on the underlying protocol. If the client end of the connection is still active, the probe is discarded, and the timer mechanism is reset. If the client has terminated abnormally, the server will receive an error from the send call issued for the probe, and SQL*Net on the server will signal the operating system to release the connection's resources. On Unix servers, the sqlnet.ora file must be in either $TNS_ADMIN or $ORACLE_HOME/network/admin. Neither /etc nor /var/opt/oracle alone is valid. It should be also be noted that in SQL*Net 2.1.x, an active orphan process (one processing a query, for example) will not be killed until the query completes. In SQL*Net 2.2, orphaned resources will be released regardless of activity. This is a server feature only. The client may be running any supported SQL*Net V2 release.

如何开启/启用DCD

开启DCD(Dead Connection Detection)很是简单，只须要在服务器端的sqlnet.ora里面设置SQLNET.EXPIRE_TIME便可，固然客户端也须要支持SQL*Net V2以及后面版本。如何检查、确认是否开启了DCD，官方文档有详细介绍：Note.395505.1 How to Check if Dead Connection Detection (DCD) is Enabled in 9i and 10g。此处不作展开。

DCD的问题与异常

DCD在一些版本和平台仍是有蛮多Bug的，你在Oracle Metalink上搜索一下，都能查到不少，另外我在测试过程当中，设置SQLNET.EXPIRE_TIME=5，测试发现，清理这些Dead Connection的时间不是5分钟，而是20多分钟，

搜索了大量资料，也没有彻底完全弄清楚这个问题，只是知道这个跟TCP/IP有超时重传机制有关系，网络知识是个人薄弱项啊（尝试了屡次无果后，只能放弃），固然，数据库回收Dead Connection也不会彻底跟SQLNET.EXPIRE_TIME指定的时间一致的（例如，下面官方文档就明确指出not at the exact time of the DCD value）。另外这个值还有可能被防火墙影响，能够参考防火墙、DCD与TCP Keep alive这篇文章。

To answer common questions about Dead Connection Detection (DCD). Common Questions about Dead Connection Detection ------------------------------------------------ Q: What is Dead Connection Detection? A: Dead Connection Detection (DCD) allows SQL*Net/Net8 to identify connections that have been left hanging by the abnormal termination of a client. This feature minimizes the waste of resources by connections that are no longer valid. It also automatically forces a rollback of uncommitted transactions and locks held by the user of the broken connection. Q: How does Dead Connection Work? A: On a connection with DCD enabled, a small probe packet is sent from server to client at a user defined interval (usually several minutes). If the connection is invalid (usually due to the client process or machine being unreachable), the session is "flagged" as dead, and PMon cleans up that session when next doing housekeeping (not at the exact time of the DCD value). The DCD mechanism does NOT terminate any sessions (idle or active). It merely marks the "dead" session for deletion by PMon. Q: How do you set the Dead Connection Detection feature? A: DCD is enabled on the server side of the connection by defining a parameter in the sqlnet.ora file in $ORACLE_HOME/network/admin called SQLNET.EXPIRE_TIME. This parameter determines the time period between successive probe packets across a connection between client and server. SQLNET.EXPIRE_TIME= <# of minutes> The sqlnet.expire_time parameter is defined in minutes and can have any value between 1 and an infinite number. If it is not defined in the sqlnet.ora file, DCD is not used. A time of 10 minutes is probably optimum for most applications. DCD probe packets originate on the server side, so it must be enabled on the server side. If you define sqlnet.expire_time on the client side, it will be ignored. Q: Will this work with the Oracle Multi-Threaded Server? A: DCD will work and is very useful with Multi-Threaded Server (MTS) configurations. MTS alone does not solve the problem, as a client that is powered down when connected to a MTS will also leave a defunct connection within the MTS (at least until the underlying protocol detects the loss of the client, at which time it will inform MTS, which will then free the resources). The resources used per client with MTS are less than those used by dedicated server, however, so the net gain per connection within MTS is less than that with dedicated server. Having said that, DCD has a distinct advantage within MTS configurations - as each server process is managing multiple clients simultaneously, the DBA has no option of killing a single process as a result of the termination of a single client. DCD therefore increases database uptime by allowing resources to be managed more effectively. Q: Can I use DCD on all of my connections over all protocols? A: You can use DCD over all protocols except for APPC/LU6.2, which prevents DCD from working due to its half-duplex nature. It also does not work over bequeathed connections. You should carefully consider whether to use DCD before you use it, however, as it creates additional processing work on the server and can also increase network traffic. Furthermore, some protocols already implement a form of DCD already, so it may not necessarily be needed on all protocols. Q: Are there any differences if I am using DCD on connections that go through the Oracle Multi-Protocol Interchange (MPI) or Connection Manager (CMAN)? A: No. DCD works through MPI and CMAN in the same way as direct client/server. If your connection spans across half-duplex and full-duplex protocols (for example APPC/LU6.2 and TCP/IP), DCD will be disabled by the server.

DCD的好处与弊端

其实，DCD的好处上面已经基本阐述清楚了，其实DCD仍是有一些弊端的。例如，在Windows平台性能不好（bug#303578）；在SCO Unix下它会触发Bug，消耗大量CPU资源; DCD 在协议层是很消耗资源的, 因此若是要用DCD来清除死进程, 会加剧系统的负担, 任什么时候候, 干净的退出系统，这是首要的. 以下英文所述：

DCD is much more resource-intensive than similar mechanisms at the protocol level, so if you depend on DCD to clean up all dead processes, that will put

an undue load on the server. Clearly it is advantageous to exit applications cleanly in the first place.

参考资料：

http://www.laoxiong.net/firewall-dcd-and-tcp-keep-alive.html

Note.601605.1 A discussion of Dead Connection Detection, Resource Limits, V$SESSION, V$PROCESS and OS processes:Note.395505.1 How to Check if Dead Connection Detection (DCD) is Enabled in 9i and 10g:Connections on Windows Platform Timout after 2 Hours, Why ? (文档 ID 1073461.1)Concurrent Manager Functionality Not Working And PCP Failover Takes Long Inspite of Enabling DCD With Database Server (文档 ID 438921.1)Common Questions About Dead Connection Detection (DCD) (文档 ID 1018160.6)Dead Connection Detection (DCD) Explained (文档 ID 151972.1)