Oracle Restart是11gR2中推出的重要高可用(High Availability)特性。在Single Instance状况下,Clusterware造成一个可用性维护框架,Oracle组件服务都是在这个维护管理框架上进行管理。css
Oracle Restart从职责上负责两方面的功能,一个是Oracle各个服务组件的自动启动。鉴于组件间复杂的依赖关系,使用Restart自动的进行启动顺序调节是比较好的一种策略。另外一个功能是高可用支持,若是某一个组件意外被终止运行,好比异常中断,Oracle Restart是能够按期的检查“治下”组件的生存状况,一旦检查出问题就会进行自动的启动。node
目前单实例Oracle使用Oracle Restart支持的组件内容有:监听器Listener、Oracle实例和数据库、ASM实例、ASM磁盘组、数据库服务Service和ONS(Oracle Notification Service)。linux
本篇记录笔者遇到的一个故障场景,不甚复杂,和行业大牛们大做不敢相比。权当思路记录,留待须要的朋友不时之需。sql
一、问题故障出现数据库
在一台11gR2的Oracle上,笔者部署了单实例ASM实例和磁盘组结构,而且在上面部署了Single Instance Oracle。因为是测试使用,笔者在上面进行过一些测试和实验,今天启动服务器以后,发现问题。服务器
grid@SimpleLinux simplelinux]$ uptimeoracle
13:58:13 up 2:24, 1 user, load average: 0.03, 0.02, 0.00app
[grid@SimpleLinux simplelinux]$ ps -ef | grep pmon框架
grid 3212 1 0 11:35 ? 00:00:01 asm_pmon_+ASMdom
grid 27724 27685 0 13:58 pts/0 00:00:00 grep pmon
根据标准的Oracle Restart配置,ASM实例、ASM磁盘组和数据库实例都是在Restart管理范围,应该是随着服务器启动而自动启动。可是从实际状况看,ASM实例已经自动启动,数据库实例没有启动。
同RAC结构同样,Restart也是借助服务器启动过程当中,以ohasd为首的高可用守护进程进行步步启动动做。
这种状况下,查看日志信息是最好的选择,看看那个环节出现问题。
[grid@SimpleLinux simplelinux]$ pwd
/u01/app/grid/product/11.2.0/grid/log/simplelinux
[grid@SimpleLinux simplelinux]$ ls -l | grep alert
-rw-rw---- 1 grid oinstall 14494 Oct 17 11:35 alertsimplelinux.log
对grid和clusterware的日志,都是保留在$ORACLE_HOME/log下的目录从中。Alert.log是主日志,也是检查的起始点。一般是里面发现的问题,进行进一步的分析动做。
[ohasd(2744)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2013-10-17 11:35:34.373
[cssd(3130)]CRS-1601:CSSD Reconfiguration complete. Active nodes are simplelinux .
2013-10-17 11:35:50.094
[/u01/app/grid/product/11.2.0/grid/bin/oraagent.bin(3072)]CRS-5010:Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
2013-10-17 11:35:55.645
[/u01/app/grid/product/11.2.0/grid/bin/oraagent.bin(3072)]CRS-5010:Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
2013-10-17 11:35:55.806
[ohasd(2744)]CRS-2807:Resource 'ora.ora11g.db' failed to start automatically.
咱们定位到了问题片断,从上面标红的内容看。Clusterware在启动dismon服务以后,试图启动数据库,也就是ora.ora11g.db。在访问一个参数文件(注意是pfile)过程当中,发现问题。
进一步检查指出的oraagent_grid.log日志,也没有过多的信息提示。
2013-10-17 11:35:50.049: [ora.ora11g.db][3013430160] {0:0:2} [start] sclsnInstAgent::sUpdateOratab file updated with dbName ora11g value /u01/app/oracle/product/11.2.0/db_1:N
2013-10-17 11:35:50.049: [ora.ora11g.db][3013430160] {0:0:2} [start] sclsnInstAgent::sUpdateOratab CSS unlock
2013-10-17 11:35:50.090: [ora.ora11g.db][3013430160] {0:0:2} [start] (:CLSN00014:)Failed to open file /u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora
2013-10-17 11:35:50.091: [ AGENT][3013430160] {0:0:2} UserErrorException: Locale is
2013-10-17 11:35:50.091: [ora.ora11g.db][3013430160] {0:0:2} [start] clsnUtils::error Exception type=2 string=
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
从信息上看,是对pfile没有可以打开。
二、一次不成功的尝试
从日志信息上,看到是不可以打开文本参数控制文件。初步猜想是文件权限缘由,下面进行检查。
[grid@SimpleLinux oraagent_grid]$ cd /u01/app/oracle/product/11.2.0/db_1/dbs/
[grid@SimpleLinux dbs]$ ls -l
total 20
-rw-rw---- 1 oracle asmadmin 1544 Sep 12 12:58 hc_ora11g.dat
-rw-r--r-- 1 oracle oinstall 2851 May 15 2009 init.ora
-rw-r----- 1 oracle oinstall 887 Sep 29 09:31 initora11g.ora
-rw-r----- 1 oracle asmadmin 24 Sep 12 12:58 lkORA11G
-rw-r----- 1 oracle oinstall 1536 Sep 12 13:11 orapwora11g
[grid@SimpleLinux dbs]$ id oracle
uid=500(oracle) gid=500(oinstall) groups=500(oinstall),501(dba),502(oper),602(asmdba)
[grid@SimpleLinux dbs]$ id grid
uid=501(grid) gid=500(oinstall) groups=500(oinstall),501(dba),600(asmadmin),601(asmoper),602(asmdba)
权限内容是oracle用户读写、组用户读。从权限上看,grid和oracle读取和修改的问题不算特别严重。可是仍是进行测试尝试。
[oracle@SimpleLinux dbs]$ chmod 770 initora11g.ora
[oracle@SimpleLinux dbs]$ ls -l
total 20
-rw-rw---- 1 oracle asmadmin 1544 Sep 12 12:58 hc_ora11g.dat
-rw-r--r-- 1 oracle oinstall 2851 May 15 2009 init.ora
-rwxrwx--- 1 oracle oinstall 887 Sep 29 09:31 initora11g.ora
-rw-r----- 1 oracle asmadmin 24 Sep 12 12:58 lkORA11G
-rw-r----- 1 oracle oinstall 1536 Sep 12 13:11 orapwora11g
尝试启动数据库。
[grid@SimpleLinux ~]$ srvctl start database -d ora11g
PRCR-1079 : Failed to start resource ora.ora11g.db
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
CRS-5017: The resource action "ora.ora11g.db start" encountered the following error:
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
. For details refer to "(:CLSN00107:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.ora11g.db' on 'simplelinux' failed
启动失败。那么,试着使用传统sqlplus命令行方式启动是否可行?
[oracle@SimpleLinux ~]$ sqlplus /nolog
SQL*Plus: Release 11.2.0.3.0 Production on Thu Oct 17 14:17:11 2013
Copyright (c) 1982, 2011, Oracle. All rights reserved.
SQL> conn / as sysdba
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 263639040 bytes
Fixed Size 1344312 bytes
Variable Size 134221000 bytes
Database Buffers 125829120 bytes
Redo Buffers 2244608 bytes
Database mounted.
Database opened.
SQL> quit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - Production
With the Partitioning, Automatic Storage Management, OLAP, Data Mining
and Real Application Testing options
[oracle@SimpleLinux ~]$ ps -ef | grep pmon
grid 3212 1 0 11:35 ? 00:00:02 asm_pmon_+ASM
oracle 27979 1 0 14:17 ? 00:00:00 ora_pmon_ora11g
oracle 28106 27921 0 14:17 pts/0 00:00:00 grep pmon
[oracle@SimpleLinux ~]$ srvctl status database -d ora11g
Database is running.
启动成功,使用sqlplus命令行能够启动,可是Oracle Restart启动就会失败。那么问题在哪儿?
三、Spfile vs. Pfile
从直观上看,Oracle Restart启动的时候是但愿访问到参数文件pfile。从直观的感受上,好像被替代很长时间的pfile为何会被说起。利用已经启动的数据库实例,看一下当前使用的是什么参数文件。
SQL> show parameter spfile
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
spfile string
SQL>
当前启动是利用pfile启动的,刚刚咱们对$ORACLE_HOME/dbs的检索也没有看到spfile文件。Oracle启动过程当中,是默认先根据环境变量“拼凑”的路径查找spfile,以后才是pfile。系统spfile参数为空,说明当前使用的是pfile。
可是,对应到Oracle Restart里面的启动信息,彷佛有些差异。
[grid@SimpleLinux ~]$ srvctl config database -d ora11g
Database unique name: ora11g
Database name: ora11g
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile: +DATA/ora11g/spfileora11g.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Database instance: ora11g
Disk Groups: DATA,RECO
Services:
明显出现不一样。这个时候,笔者想起以前进行过实验,在ASM环境下进行spfile和pfile的生成操做。怀疑是这个过程当中,存在Restart和实例信息的不匹配。
想出了第二种修复策略。
SQL> create spfile from pfile;
File created.
SQL> startup force
ORACLE instance started.
Total System Global Area 263639040 bytes
Fixed Size 1344312 bytes
Variable Size 134221000 bytes
Database Buffers 125829120 bytes
Redo Buffers 2244608 bytes
Database mounted.
Database opened.
SQL> show parameter spfile
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
spfile string /u01/app/oracle/product/11.2.0
/db_1/dbs/spfileora11g.ora
设置恢复现有的spfile做为启动参数文件。试图让Restart和实例信息一致。
[oracle@SimpleLinux ~]$ srvctl modify database -d ora11g -p /u01/app/oracle/product/11.2.0/db_1/dbs/spfileora11g.ora
[oracle@SimpleLinux ~]$ srvctl config database -d ora11g
Database unique name: ora11g
Database name: ora11g
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile: /u01/app/oracle/product/11.2.0/db_1/dbs/spfileora11g.ora
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Database instance: ora11g
Disk Groups: DATA,RECO
Services:
实验启动,故障依然。
[oracle@SimpleLinux tmp]$ srvctl start database -d ora11g
PRCR-1079 : Failed to start resource ora.ora11g.db
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
CRS-5017: The resource action "ora.ora11g.db start" encountered the following error:
CRS-5010: Update of configuration file "/u01/app/oracle/product/11.2.0/db_1/dbs/initora11g.ora" failed: details at "(:CLSN00014:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log"
. For details refer to "(:CLSN00107:)" in "/u01/app/grid/product/11.2.0/grid/log/simplelinux/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.ora11g.db' on 'simplelinux' failed
第二次修复尝试以失败了结,Oracle Restart依然寻找那个pfile。可是笔者得到了方向,就是系统问题在于Restart中对数据库启动参数文件的不一致。
四、问题解决
Oracle Restart是一个很复杂的体系,在没有经验和资料的状况下,笔者也不能证实说是Oracle Bug之类的。
一种思路能够进行尝试。对于Oracle Restart,各类组件都是在上面可插拔的。根据须要,咱们能够进行动态的配置注册过程。从以前的状况看,数据库自己是没有问题的,应该就是配置过程当中的故障。那么,modify配置是有问题的。可不能够将database ora11g剔除出Restart体系,以后再添加过来。
Srvctl的add和remove命令能够帮助咱们实现功能。并且在add过程当中,只有-o参数是强制的,输入ORACLE_HOME目录。
[oracle@SimpleLinux dbs]$ srvctl remove database -d ora11g
Remove the database ora11g? (y/[n]) y
[oracle@SimpleLinux dbs]$ srvctl add database -d ora11g -o /u01/app/oracle/product/11.2.0/db_1
[oracle@SimpleLinux dbs]$ srvctl config database -d ora11g
Database unique name: ora11g
Database name:
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile:
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Database instance: ora11g
Disk Groups:
Services:
Spfile为空。试着从新启动。
[oracle@SimpleLinux dbs]$ srvctl start database -d ora11g
[oracle@SimpleLinux dbs]$ ps -ef | grep pmon
grid 3215 1 0 14:47 ? 00:00:00 asm_pmon_+ASM
oracle 5265 1 0 15:22 ? 00:00:00 ora_pmon_ora11g
oracle 5386 3578 0 15:22 pts/0 00:00:00 grep pmon
[oracle@SimpleLinux dbs]$ srvctl config database -d ora11g
Database unique name: ora11g
Database name:
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile:
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Database instance: ora11g
Disk Groups: DATA,RECO
Services:
启动成功!最后尝试看看reboot系统时,可否自动启动。
--从新启动系统
[root@SimpleLinux simplelinux]# ps -ef | grep pmon
grid 3213 1 0 15:27 ? 00:00:00 asm_pmon_+ASM
oracle 3270 1 0 15:27 ? 00:00:00 ora_pmon_ora11g
root 3336 3042 0 15:27 pts/0 00:00:00 grep pmon
[grid@SimpleLinux ~]$ lsnrctl status
LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 17-OCT-2013 15:32:07
Copyright (c) 1991, 2011, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC1521)))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date 17-OCT-2013 15:27:06
Uptime 0 days 0 hr. 5 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u01/app/grid/product/11.2.0/grid/network/admin/listener.ora
Listener Log File /u01/app/grid/diag/tnslsnr/SimpleLinux/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=SimpleLinux.localdomain)(PORT=1521)))
Services Summary...
Service "+ASM" has 1 instance(s).
Instance "+ASM", status READY, has 1 handler(s) for this service...
Service "ora11g" has 1 instance(s).
Instance "ora11g", status READY, has 1 handler(s) for this service...
Service "ora11gXDB" has 1 instance(s).
Instance "ora11g", status READY, has 1 handler(s) for this service...
The command completed successfully
SQL> show parameter spfile
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
spfile string /u01/app/oracle/product/11.2.0/db_1/dbs/spfileora11g.ora
问题解决。
五、结论和反思
从直观的感受看,这应该是Restart和原有命令协调的一个故障。原有create pfile以后,Restart彷佛不可以支持pfile的启动了。另外,在修复过程当中,咱们始终看到不能对spfile修改参数生效,也是一个疑惑点。
可以确定的是,在添加数据库ora11g的时候,没有明确指定启动spfile的位置,那么应该是进入了自动检索目录spfile-pfile的过程。因此系统获得修复。