记录vip,Scan_ip两节点同时存活的故障处理

客户环境:
VMware vSphere 5.1虚拟化环境上两台linux6.5 系统,安装Oracle 11.2.0.3 RAC 两节点集群

客户问题反馈:
客户反映说,数据库集群都可以对外访问,奇怪的是在节点1上检查显示XXX2.vip为FAILED OVER,在节点2上检查****1.vip为FAILED OVER

问题分析思路:
(1)检查集群的状态,看是否符合客户的描述;
(2)检查操作系统日志message,检查集群告警日志alert,cssd日志,数据库alert日志等信息,是否有异常报错输出;
(3)检查底层架构之间的部署,网络之间的通信,磁盘的划分与使用等是否有问题;
(4)在客户允许的情况下,重启集群,检查日志是否有异常,并观察重启后集群状态是否正常;

问题故障处理:
登录到服务器,检查如下图所示:
节点1:
$ crsctl stat res -t
在这里插入图片描述节点2:
在这里插入图片描述

通过检查集群的状态,确实如客户所描述的那样,是否是网络的问题呢?
所有的IP地址ping一遍,无任何丢包的报错信息。
在这里插入图片描述

检查操作系统日志,两节点均无任何异常输出
在这里插入图片描述

集群告警日志,监听状态检查失败,节点2被手动shutdown关闭,但是我检查的时候节点2明明是正常的,这又是什么原因造成的呢?

2019-03-05 10:41:16.062:  [ohasd(9155)]CRS-2112:The OLR service
       started on node yyzc01. 2019-03-05 10:41:16.077: 
       [ohasd(9155)]CRS-1301:Oracle High Availability Service started on
       node yyzc01
   . 2019-03-05 10:41:16.077: 
   [ohasd(9155)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot
   advisory log files, 0 were announced and 0 errors occurred 2019-03-05
   10:41:19.200:  [gpnpd(9292)]CRS-2328:GPNPD started on node yyzc01. 
   2019-03-05 10:41:21.679:  [cssd(9362)]CRS-1713:CSSD daemon is started
   in clustered mode 2019-03-05 10:41:23.482: 
   [ohasd(9155)]CRS-2767:Resource state recovery not attempted for
   'ora.diskmon' as its target state is OFFLINE 2019-03-05 10:41:41.460:
   [cssd(9362)]CRS-1707:Lease acquisition for node yyzc01 number 1
   completed 2019-03-05 10:41:42.843:  [cssd(9362)]CRS-1605:CSSD voting
   file is online: /dev/asm-diskg; details in
   /u01/app/11.2.0/grid/log/yyzc01/cssd/ocssd.log. 2019-03-05
   10:41:42.847:  [cssd(9362)]CRS-1605:CSSD voting file is online:
   /dev/asm-diskf; details in
   /u01/app/11.2.0/grid/log/yyzc01/cssd/ocssd.log. 2019-03-05
   10:41:42.854:  [cssd(9362)]CRS-1605:CSSD voting file is online:
   /dev/asm-diske; details in
   /u01/app/11.2.0/grid/log/yyzc01/cssd/ocssd.log. 2019-03-05
   10:41:51.981:  [cssd(9362)]CRS-1601:CSSD Reconfiguration complete.
   Active nodes are yyzc01 . 2019-03-05 10:41:53.963: 
   [ctssd(9517)]CRS-2407:The new Cluster Time Synchronization Service
   reference node is host yyzc01. 2019-03-05 10:41:53.965: 
   [ctssd(9517)]CRS-2401:The Cluster Time Synchronization Service
   started on host yyzc01. 2019-03-05 10:41:55.703: 
   [ohasd(9155)]CRS-2767:Resource state recovery not attempted for
   'ora.diskmon' as its target state is OFFLINE 2019-03-05 10:42:18.031:
   [crsd(9705)]CRS-1012:The OCR service started on node yyzc01.
   2019-03-05 10:42:18.802:  [evmd(9537)]CRS-1401:EVMD started on node
   yyzc01. 2019-03-05 10:42:19.971:  [crsd(9705)]CRS-1201:CRSD started
   on node yyzc01. 2019-03-05 10:42:21.273: 
   [/u01/app/11.2.0/grid/bin/oraagent.bin(9814)]CRS-5016:Process
   "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent
   "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed:
   details at "(:CLSN00010:)" in
   "/u01/app/11.2.0/grid/log/yyzc01/agent/crsd/oraagent_grid/oraagent_grid.log"
   2019-03-05 10:42:21.273: 
   [/u01/app/11.2.0/grid/bin/oraagent.bin(9814)]CRS-5016:Process
   "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent
   "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed:
   details at "(:CLSN00010:)" in
   "/u01/app/11.2.0/grid/log/yyzc01/agent/crsd/oraagent_grid/oraagent_grid.log"
   2019-03-05 10:42:21.287: 
   [/u01/app/11.2.0/grid/bin/oraagent.bin(9814)]CRS-5016:Process
   "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent
   "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed:
   details at "(:CLSN00010:)" in
   "/u01/app/11.2.0/grid/log/yyzc01/agent/crsd/oraagent_grid/oraagent_grid.log"
   2019-03-05 10:42:22.793:  [crsd(9705)]CRS-2772:Server 'yyzc01' has
   been assigned to pool 'Generic'. 2019-03-05 10:42:22.793: 
   [crsd(9705)]CRS-2772:Server 'yyzc01' has been assigned to pool
   'ora.MOEUUMDB'. 2019-03-05 10:42:22.794:  [crsd(9705)]CRS-2772:Server
   'yyzc01' has been assigned to pool 'ora.MOEUIADB'. 2019-03-05
   10:42:23.026:  [client(9948)]CRS-4743:File
   /u01/app/11.2.0/grid/oc4j/j2ee/home/OC4J_DBWLM_config/system-jazn-data.xml
   was updated from OCR(Size: 13384(New), 13397(Old) bytes) 2019-03-05
   10:43:01.702:  [cssd(9362)]CRS-1625:Node yyzc02, number 2, was
   manually shut down

此时还是一头雾水啊,告警日志没有明显的报错信息,集群数据库能正常的启动,状态也都较正常,这到底是什么原因导致的。
相反,本来是两节点的集群,在正常情况下,关闭任何一个节点,vip都会飘到正常的节点上继续存活和对外提供访问,这是Oracle集群的高可用性,现在两个节点都认为自己是正常的,其它都是有问题的,所以各自活成了自己?不对,这种情况下会有仲裁盘的仲裁,谁先与仲裁盘取得联系谁就能存活,把另一个节点踢出集群,这是oracle集群脑裂的机制,肯定只有一个节点存活才对,现在是两个节点都是正常启动的。换句话说,现在的环境等于分裂成两个独立的单节点集群了,检查cssd(管理集群的配置和节点成员的服务进程),有一句明显的日志信息如下:获取节点2的最新信息,仅有单一的节点有效,再一次验证了我的想法。
在这里插入图片描述

分析到这里,就应该知道是共享盘的问题了,节点2共享盘属性:
在这里插入图片描述

节点1共享盘属性:
在这里插入图片描述
到这里才发现,原来还真是共享盘的问题,补充一下,在虚拟化里面,共享盘是需要同scsi选择共享属性,正常情况下,应该如下:
在这里插入图片描述

检查到这里,基本问题就已经查清了,之前客户做过一次存储迁移,在迁移的过程中,没有考虑到关于共享盘这一问题,迁移完成以后,Vmware会重新为磁盘分配新的scsi端口,造成磁盘非共享,并且每一个节点都可以正常访问正常启动的问题。