我陷于苦思冥想中……


Filename=alert_p4moddb1.log

2018-08-16T14:56:00.272280+08:00

WARNING: Read Failed. group:2 disk:4
AU:1053 offset:2605056 size:16384

style=”font-size: 16px;”>path:/dev/rdisk/MES1_p4_moddb_redo02

incarnation:0xf7e12348 synchronous
result:’I/O error’

subsys:System krq:0x9ffffffffd1c0608
bufp:0x9ffffffffd007000 osderr1:0x69c0 osderr2:0x0

IO elapsed time: 0 usec Time waited on
I/O: 0 usec

WARNING: failed to read mirror side 1
of virtual extent 7 logical extent 0 of file 260 in group
[2.3551108175] from disk MES1_REDO02 allocation unit 1053 reason
error; if possible, will try another mirror side

NOTE: successfully read mirror side 2
of virtual extent 7 logical extent 1 of file 260 in group
[2.3551108175] from disk RPT_REDO01 allocation unit 1052
–>检测到I/O
error,但是能成功读取mirror数据

……

2018-08-16T14:56:13.489201+08:00.
–>大量IO操作错误

Errors in file
/oracle/diag/rdbms/p4moddb/p4moddb1/trace/p4moddb1_lgwr_13680.trc:

ORA-15080: synchronous I/O operation
failed to write block 1383 of disk 4 in disk group
DG_REDO_MOD

ORA-27063: number of bytes read/written
is incorrect

HPUX-ia64 Error: 11: Resource
temporarily unavailable

Additional information:
4294967295

Additional information: 1024

WARNING: failed to write mirror side 1
of virtual extent 0 logical extent 0 of file 257 in group 2 on disk 4
allocation unit 277

2018-08-16T14:56:31.050369+08:00

……

ERROR: cannot read disk header of disk
MES1_REDO02 (4:4158726984)

2018-08-16T14:56:34.418045+08:00

NOTE: ospid 13682 initiating cluster
wide offline of disk 5 in group 2

2018-08-16T14:56:34.418576+08:00

NOTE: process _rms0_p4moddb1 (13666)
initiating offline of disk 4.4158726984 (MES1_REDO02) with mask 0x7e
in group 2 (DG_REDO_MOD) with client assisting

2018-08-16T14:56:34.446660+08:00

……

NOTE: disk 4 (MES1_REDO02) in group 2
(DG_REDO_MOD) is offline for reads

2018-08-16T14:56:49.513068+08:00

CKPT (ospid: 13682) waits for event
‘control file parallel write’ for 78 secs.

Errors in file
/oracle/diag/rdbms/p4moddb/p4moddb1/trace/p4moddb1_lmhb_13668.trc
(incident=3529892):

ORA-29771: process MARK (OSID 13751)
blocks CKPT (OSID 13682) for more than 70 seconds

Incident details in:
/oracle/diag/rdbms/p4moddb/p4moddb1/incident/incdir_3529892/p4moddb1_lmhb_13668_i3529892.trc

2018-08-16T14:56:51.744467+08:00

MARK (ospid: 13751) is blocking CKPT
(ospid: 13682) in a wait

LMHB (ospid: 13668) kills MARK (ospid:
13751). style=”font-size: 16px;”>–>LMHB检测到核心后台进程阻塞70s以上,并开始尝试kill

Please check LMHB trace file for more
detail.

2018-08-16T14:56:54.755778+08:00

ERROR: failed to mark AU as stale
[4294967279]

WARNING: failed to mark stale some of
AUs corresponding to the mirrors of virtual extent 0 of file 260 in
group 2

WARNING: group 2 file 260 vxn 0 block 3
write I/O failed

2018-08-16T14:56:54.756850+08:00

Errors in file
/oracle/diag/rdbms/p4moddb/p4moddb1/trace/p4moddb1_ckpt_13682.trc:

ORA-00206: error in writing (block 3,
# blocks 1) of control file

ORA-00202: control file:
‘+DG_REDO_MOD/p4moddb.ctl01’

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15081: style=”font-size: 16px;”>failed to submit an I/O operation to a
disk

2018-08-16T14:56:54.757193+08:00

Errors in file
/oracle/diag/rdbms/p4moddb/p4moddb1/trace/p4moddb1_ckpt_13682.trc:

ORA-00221: error on write to control
file

ORA-00206: error in writing (block 3,
# blocks 1) of control file

ORA-00202: control file:
‘+DG_REDO_MOD/p4moddb.ctl01’

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15081: failed to submit an I/O
operation to a disk

Errors in file
/oracle/diag/rdbms/p4moddb/p4moddb1/trace/p4moddb1_ckpt_13682.trc
(incident=3529948):

ORA-221 [] [] [] [] [] []
[] [] [] [] [] []

Incident details in:
/oracle/diag/rdbms/p4moddb/p4moddb1/incident/incdir_3529948/p4moddb1_ckpt_13682_i3529948.trc

2018-08-16T14:56:56.478631+08:00

USER (ospid: 13682): style=”font-size: 16px;”>terminating the instance due to error 221.
–à因为控制文件IO错误导致实例被重启!

2018-08-16T14:56:56.502831+08:00

美高梅注册,System state dump requested by
(instance=1, osid=3891780 (CKPT)), summary=[abnormal instance
termination].

System State dumped to trace file
/oracle/diag/rdbms/p4moddb/p4moddb1/trace/p4moddb1_diag_13633_20180816145656.trc

无奈我自行搜索HP多路径说明文档,终于发现2个可疑的相关参数:

问题到此迎刃而解!

4) 第四类测试

2018-08-01T16:57:45.214182+08:00

Instance terminated by USER, pid =
10158

2018-08-01T16:57:36.704927+08:00

Errors in file
/oracle/diag/rdbms/node1/node11/trace/node11_ckpt_10158.trc:

ORA-00206: error in writing (block 3,
# blocks 1) of control file

ORA-00202: control file:
‘+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275’

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15064: communication failure with
ASM instance

2018-08-01T16:57:36.705340+08:00

Errors in file
/oracle/diag/rdbms/node1/node11/trace/node11_ckpt_10158.trc:

ORA-00221: error on write to control
file

ORA-00206: error in writing (block 3,
# blocks 1) of control file

ORA-00202: control file:
‘+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275’

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15081: failed to submit an I/O
operation to a disk

ORA-15064: communication failure with
ASM instance

结合上述分析得到的规律及测试数据,我做了多方尝试并深入思考:

通过记录每次的存储断开时间点、日志报错时间点及ASM
offline
disks时间点,反复比较分析后,发现每次重启实例时的规律如下:

既然是controlfile存在此问题,是不是因为DATA磁盘比较多,导致offline检测时间长呢?

责任编辑:

在Oracle文档发现只能采用ASM
FG来实现高可用,因为任何控制文件都需要在线,否则将直接导致实例中止!

style=”font-size: 16px;”>https://docs.oracle.com/cd/B28359\_01/server.111/b28310/control002.htm\#ADMIN11284

Multiplex Control Files on Different
Disks

Every Oracle Database should have at
least two control files, each stored on a different physical disk. If
a control file is damaged due to a disk failure, the associated
instance must be shut down. Once the disk drive is repaired, the
damaged control file can be restored using the intact copy of the
control file from the other disk and the instance can be restarted. In
this case, no media recovery is required.

The behavior of multiplexed control
files is this:

The database writes to all filenames
listed for the initialization parameter CONTROL_FILES in the database
initialization parameter file.

The database reads only the first file
listed in the CONTROL_FILES parameter during database
operation.

If any of the control files become
unavailable during database operation, the instance becomes inoperable
and should be aborted.

Note:

Oracle strongly recommends that your
database has a minimum of two control files and that they are located
on separate physical disks.

既然理论和实战存在差异,那肯定是有原因的,我开始了新的探索及分析,不放过任何一个可疑点:

但是这样一来,控制文件又造成了单点故障风险,难道这个问题没有办法解决吗?

2018-08-01T16:57:41.712885+08:00

NOTE: ASM client style=”font-size: 16px;”>node11:node1:node1-rac disconnected
unexpectedly

transient_secs说明配置的在 LUN
OFFLINE 后、failing
I/O前的一段静默期,这个值默认是120s,如果把这个时间加上,ASM
instance每次的offline操作时间就可以解释了,如果这个静默期多路径是不对IO请求响应的,将会直接导致ASM
instance无法判断IO是否成功,导致了核心数据库文件(controlfile/redolog)的IO提交错误及核心后台进程的阻塞超时。

06总结

在实际高可用测试中,拔掉1个存储,发现如下现象:

发现有时测试能够成功,但是有时会在**REDO
LOG**读写时报错导致DB重启!

百万级用户规模营销账务系统研发及实施运维经验,主持过11省千万级电力营销业务系统运维主管工作;设计实施过10多个阿里云平台新能源SAAS系统。历任开发工程师、项目经理、技术经理、项目总监、运维主管、云平台架构师等职位。

原标题:守得云开见月明:一次ASM存储高可用故障解决过程分析

Oracle ASM 全称为Automated Storage
Management,即自动存储管理,它是自 Oracle10g 这个版本 Oracle
推出的新功能。这是 Oracle
提供的一个卷管理器,用于替代操作操作系统所提供的
LVM,它不仅支持单实例配置,也支持RAC这样的多实例配置。

本问题能够得以解决的主要原因:

重新开始模拟1个存储故障测试,发现仍然会发生控制文件无法读写,重启实例!

创建2个独立的DG,指向2个不同存储,REDO
GROUP的2个member multiplex到2个DG中。

scsimgr save_attr -N /escsi/esdisk -a
path_fail_secs=30 -a transient_secs=10

作者 | 姜劲松,云和恩墨专家支持部Oracle技术专家,Oracle OCP,MySQL
OCP,RHCE等认证专家。长期服务移动运营商行业客户,精通 oracle
性能优化,故障诊断,特殊恢复领域。23年IT从业经验、资深数据库及系统软硬件集成专家。

通过减低主机服务器path_fail_secs、transient_secs:

1) 第一类测试

简单来说,esd_secs是磁盘块设备的IO超时时间,path_fail_secs是LUN的IO超时时间,超过上述时间后多路径软件会把对应的设备或者路径置为offline。直接基于主机多路径设备的测试结果也证明了这种行为。

03再次梳理

反复测试ASM实例在检测到IO错误了,会迅速offline所以故障磁盘,快速恢复了IO正常,证明了上述分析结论!

1、每个存储分配1个10GB
LUN给服务器;

2、基于每个LUN创建1个DG,controlfile
multiplex到这2个DG中。

测试failover成功,ASM实例会将故障DG
dismount,数据库完全不受影响!

ASM日志:

Oracle CKPT 进程因为控制文件 IO
错误阻塞,导致主动重启 instance,每次测试都在超时70s之后开始Terminate
instance。

01问题简述

3、transient_secs

Seconds to wait after a LUN has gone
OFFLINE before failing I/Os,default 120s

本案例通过某客户项目现场1次ASM存储容灾无法实现预期目标的问题分析解决过程,和大家共同探讨对于非预期问题的解决之道。

无奈之下直接去主机man
scsimgr翻阅对应参数,又发现1个可疑的参数:

05柳暗花明

(1)
结合DB报出的70s超时警告,试图找到数据库的隐含参数调高后台进程HUNG住的检测阀值,多次尝试发现这个是硬限制,没有合适的参数控制;

(2)
结合上述现象,问题都出现在数据库检测到IO失败到offline
disks的这段时间中,IO报错的时间点和多路径失效是吻合的,预期行为应该是数据库发现多路径失败,快速offline失败的磁盘并且使用Mirror
side!但是实际的情况是数据库反复对故障设备进行IO尝试,最后导致实例异常重启。这个要么是数据库判断逻辑失误,要么是存储故障时的返回信息有误?

(3) 结合手工offline
disks非常迅速,没有延迟时间, style=”font-size: 16px;”>因此重点怀疑存储端的响应异常问题?

给 Oracle 数据库管理员带来极大的方便,ASM
可以自动管理磁盘组,并提供数据冗余和优化。
ASM提供了丰富的管理和容灾手段,通过适当的配置,可以实现高效的数据库层面的存储容灾功能。

style=”font-size: 16px;”>1.CRS集群不受影响,ocr/votedisk自动Failover;

2.DB
Controlfile/Redolog发生I/O错误,导致LWGR/CKPT等核心进程长时间阻塞后,Oracle主动重启DB实例(1个或2个实例)后,数据库恢复正常;

style=”font-size: 16px;”>3.数据库数据正常,故障存储Online后自动同步正常。

原创:姜劲松。

由于怀疑是存储端问题,我询问了HP工程师是否在存储多路径方面有相关的参数?HP工程师回复有多路径超时设置默认120s,并且无法调整。

存在疑问:

反复的测试和分析都将问题的疑点指向了多路径,但是翻遍官方手册没有新的收获。那么是否还有其他的控制机制造成了这种非预期行为呢?

style=”font-size: 16px;”>1、OS在断开存储后60s就会返回多路径失效及磁盘无法访问,但是ASM实例后台为什么一直锁盘,具体是在做哪些工作?

style=”font-size: 16px;”>2、目前看问题都是在检测到IO错误100s后发生,能否控制ASM在收到OS
磁盘无法访问后,快速将asm disk offline,而不是反复报IO错误?

style=”font-size: 16px;”>3、如果手工offline存储对应的磁盘,ASM会迅速反应并更新DG状态,但是故障时日志却在不断发出offline请求,为什么无法快速完成呢?

(1)关掉存储后,数据库在 style=”font-size: 16px;”>60s左右后开始报IO错误;

style=”font-size: 16px;”>(2)在OS查看全部多路径正常关闭,但是diskinfo查看磁盘会卡死,经测试确认是Oracle
asm instance锁住的;

style=”font-size: 16px;”>(3)数据库在第一次IO报错后,每次需要 style=”font-size: 16px;”>100s左右完成disk offline及DG PST
refresh,然后OS层面diskinfo能正常发现asm disk不可访问。如果关闭asm
instance,OS层面 style=”font-size: 16px;”>30s报多路径失效, style=”font-size: 16px;”>60sdiskinfo报asm disk设备报不可访问!

style=”font-size: 16px;”>(3)db实例单开的情况下,会在关掉多路径 style=”font-size: 16px;”>130s左右完成disk
offline正常运行;db实例2个节点全开, style=”font-size: 16px;”>70s左右1个数据库日志LMHB报CKPT/LGWR阻塞超时的错误并且时间持续增加,期间有controlfile
parallel write阻塞及IO
错误,有时会强制重启1个或2个实例,有时候却能正常!

(4)如果手工offline
disks,数据库能快速完成操作,不受任何影响!

1、 ASM
Failgroup对数据库文件处理完全没有问题,可以实现Failover

2、 ControlFile/RedoLogfile在Normal
DG做offline时,异常长时间阻塞并主动重启DB实例,重启后运行正常,数据完整性不受影响!

由于ASM
DG使用的LUN裸设备,所以我修改了path_fail_secs参数,从120s减少为30s,修改后的确发现数据库检查到IO错误的的确从60s左右缩短到了30s左右。但是发现在IO错误后,ASM
instance并没有直接offline
disks,而是在不断的Retry并报出IO错误,直到大约130s才能成功offline故障磁盘,故障现象仍然不规律出现,问题仍然没有彻底解决!

ALERT LOG :

尝试对 controlfile 进行 multiplex:

所以这种 multiplex 方法对 controlfile
的高可用无效!

1、
OS在断开存储后30s就会返回多路径失效及磁盘无法访问,ASM后台为什么一直锁盘,具体是在做哪些工作?

style=”font-size: 16px;”>多路径进入静默等待期,IO锁定无反应,导致ASM实例反复Retry直到静默超时(120s)

2、
问题都是在检测到IO错误100s后发生,能否控制ASM实例在收到OS
磁盘无法访问后,快速将asm disk offline,而不是反复报IO错误?

style=”font-size: 16px;”>数据库层面无法控制,因为多路径没有返回失败信息

3、 为什么故障现象不稳定?

style=”font-size: 16px;”>这个推测很可能是故障时的IO行为不同导致:

1)
如果故障时没有核心的controlfile/redofile的写入,或者IO写入没有发生断裂,那么在多路径静默期超时后,系统是能够恢复正常的;

2)
否则由于静默期IO锁定,Oracle无法判断IO一致性,从而超时导致实例重启!

3)
至于为什么重启后就正常就比较好解释了,因为实例重启需要的时间是大于120s的,重启后静默期已经过去了,disks成功offline了,自然就没有问题了。

4)
修改默认多路径参数是否有其他不利影响?

style=”font-size: 16px;”>多路径的这些参数的默认值原因是用于容忍短时间存储IO抖动,避免系统异常的。但是在我们的项目场景中,由于Oracle数据库的IO完整性要求,导致了故障的产生。对于双存储系统来说,一旦存储IO故障时,快速Failover到另外1个存储才是最好的选择!

1、esd_secs

attribute determines the timeout of I/O
operations to block devices. By default esdisk sets esd_secs to 30
seconds for all devices bound to it.

2、path_fail_secs:

Timeout in seconds before declaring a
lunpath offline when the device no longer responds to I/O requests
sent through this I/O path. The default value is 120 seconds.

尝试将controlfile转移到磁盘较少的REDO
DG,仍然在controfile这里报错:

在连续1周无法解决问题的情况,采用了如下临时的解决方案:

转载:意向文章下方留言。返回搜狐,查看更多

04初露曙光

背景说明:

怀疑是ASM实例offline
disk时间过慢,希望调高CKPT阻塞时间阀值解决问题,但是没有找到对应的参数。

3) 第三类测试

1、存储完成拔线:16:56:05

2、实例16:57:37-16:57:39 挂掉

1、Oracle12.2RAC+ASM Normal Redendancy
模式,数据库存储采用双存储冗余架构,规避单存储故障导致服务中断及数据丢失;

2、 ASM DiskGroup 设计2个
Failgroup(FG),1个FG磁盘全部存储在1#存储;1个FG全部磁盘存储在2#存储中;

style=”font-size: 16px;”>3、期望任意存储故障或断电,数据库实例不受影响,数据不丢失,故障存储上线后数据自动同步。

问题虽然解决了,但是有些疑问需要反思和解答:

美高梅注册 1

2018-08-16T14:56:04.957199+08:00
–>差不多和DB同时检测到IO错误

WARNING: Write Failed. group:1 disk:1
AU:1 offset:4190208 size:4096

style=”font-size: 16px;”>path:/dev/rdisk/MES1_p4_moddb_ocrvote

incarnation:0xf0996d00 asynchronous
result:’I/O error’

subsys:System krq:0x9ffffffffd285710
bufp:0x9ffffffffd299000 osderr1:0x69b7 osderr2:0x0

IO elapsed time: 0 usec Time waited on
I/O: 0 usec

……

2018-08-16T14:56:34.400892+08:00

WARNING: Disk 4 (MES1_REDO02) in group
2 mode 0x7f is now being offlined

……

ORA-27072: File I/O error

HPUX-ia64 Error: 11: Resource
temporarily unavailable

Additional information: 4

Additional information: 8188

Additional information:
4294967295

GMON checking disk modes for group 2 at
2460 for pid 49, osid 13802

2018-08-16T14:56:35.056049+08:00

NOTE: checking PST for grp 2
done.

2018-08-16T14:56:35.056382+08:00

Errors in file
/grid/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_gmon_18741.trc:

ORA-27072: File I/O error

HPUX-ia64 Error: 11: Resource
temporarily unavailable

Additional information: 4

Additional information: 8188

Additional information:
4294967295

ORA-27072: File I/O error

HPUX-ia64 Error: 11: Resource
temporarily unavailable

Additional information: 4

Additional information: 8188

Additional information:
4294967295

ORA-27072: File I/O error

HPUX-ia64 Error: 11: Resource
temporarily unavailable

Additional information: 4

Additional information: 8188

Additional information:
4294967295

2018-08-16T14:56:35.056592+08:00

GMON checking disk modes for group 2 at
2461 for pid 19, osid 18728

NOTE: initiating PST update: grp 2
(DG_REDO_MOD), dsk = 4/0xf0996e2a, mask = 0x6a, op = clear
mandatory

GMON checking disk modes for group 1 at
2462 for pid 57, osid 24043

2018-08-16T14:56:35.057277+08:00

NOTE: checking PST for grp 2
done.

NOTE: initiating PST update: grp 2
(DG_REDO_MOD), dsk = 4/0xf0996e2a, mask = 0x6a, op = clear
mandatory

2018-08-16T14:56:35.059431+08:00

NOTE: checking PST for grp 1
done.

2018-08-16T14:56:35.059697+08:00

NOTE: sending set offline flag message
(3413962691) to 1 disk(s) in group 1

2018-08-16T14:56:35.059796+08:00

GMON checking disk modes for group 4 at
2463 for pid 59, osid 24045

2018-08-16T14:56:36.205496+08:00

NOTE: Attempting voting file refresh on
diskgroup DG_CRS_MOD

WARNING: Read Failed. group:1 disk:1
AU:0 offset:0 size:4096

style=”font-size: 16px;”>path:/dev/rdisk/MES1_p4_moddb_ocrvote

incarnation:0xf0996d00 asynchronous
result:’I/O error’

subsys:System krq:0x9ffffffffd4a4068
bufp:0x9ffffffffcf91000 osderr1:0x534b4950 osderr2:0x0

IO elapsed time: 0 usec Time waited on
I/O: 0 usec

NOTE: Refresh completed on diskgroup
DG_CRS_MOD. Found 3 voting file(s).

NOTE: Voting file relocation is
required in diskgroup DG_CRS_MOD

NOTE: Attempting voting file relocation
on diskgroup DG_CRS_MOD

WARNING: Read Failed. group:1 disk:1
AU:0 offset:0 size:4096

style=”font-size: 16px;”>path:/dev/rdisk/MES1_p4_moddb_ocrvote

incarnation:0xf0996d00 asynchronous
result:’I/O error’

subsys:System krq:0x9ffffffffd4a4068
bufp:0x9ffffffffcf84000 osderr1:0x534b4950 osderr2:0x0

IO elapsed time: 0 usec Time waited on
I/O: 0 usec

NOTE: Successful voting file relocation
on diskgroup DG_CRS_MOD

2018-08-16T14:56:52.061661+08:00

System state dump requested by (instance=1, osid=3906536),
summary=[SYSTEMSTATE_GLOBAL: global system state dump request
(kjdgdss_g)].

System State dumped to trace file
/grid/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_diag_18708_20180816145652.trc

2018-08-16T14:56:55.250804+08:00

WARNING: Read Failed. group:2 disk:4 AU:0 offset:0 size:4096

2018-08-16T14:56:55.250889+08:00

WARNING: Read Failed. group:2 disk:4 AU:7 offset:245760 size:4096

2018-08-16T14:56:55.251279+08:00

path:/dev/rdisk/MES1_p4_moddb_redo02

2018-08-16T14:56:55.251590+08:00

incarnation:0xf0996e2a synchronous result:’I/O error’

subsys:System krq:0x9ffffffffd4b2f80 bufp:0x9ffffffffd407000
osderr1:0x69b7 osderr2:0x0

2018-08-16T14:56:55.251753+08:00

path:/dev/rdisk/MES1_p4_moddb_redo02

IO elapsed time: 0 usec Time waited on I/O: 0 usec

2018-08-16T14:56:55.252112+08:00

incarnation:0xf0996e2a synchronous result:’I/O error’

subsys:System krq:0x9ffffffffd225ec0 bufp:0x9ffffffffcc0f000
osderr1:0x69c0 osderr2:0x0

IO elapsed time: 0 usec Time waited on I/O: 0 usec

WARNING: cache failed reading from group=2(DG_REDO_MOD) fn=8 blk=60
count=1 from disk=4 (MES1_REDO02) mirror=0 kfkist=0x20 status=0x07
osderr=0x69c0 file=kfc.c line=12986

WARNING: cache succeeded reading from group=2(DG_REDO_MOD) fn=8
blk=60 count=1 from disk=3 (MES2_REDO01) mirror=1 kfkist=0x20
status=0x01 osderr=0x0 file=kfc.c line=13031

2018-08-16T14:56:55.274598+08:00

SUCCESS: extent 0 of file 8 group 2 – all online mirror sides found
readable, no repair required —>
所有online mirror side是完好的

DB:

2) 第二类测试

反复多次测试,问题均随机出现,因此高度怀疑为Oracle
BUG,在MOS上发现1个类似『 链接:Bug 23179662 – ASM B-slave Process
Blocking Fatal background Process like LGWR producing ORA-29771 (文档 ID
23179662.8)』,但是MOS说明 20180417PSU 已经 fixed 此 BUG, Wordaround
行为就是重启实例。

前言

style=”font-size: 16px;”>1、坚定的决心:前期测试一度将注意力转移到Oracle
BUG,在MOS没有收获后,通过变通方法解决,但是没有完全解决控制文件高可用的问题。基于为客户负责的原则,迫使我背水一战,开始细致分析规律,这才有了明确的问题定位方向;

2、
多方资源整合及信息沟通:在连续几天无法解决问题时,我一方面主动寻求了公司专家、Oracle
SR的帮助,虽然没有最终解决本问题,但是排除了一些问题分析方向可能并借鉴了一些思路;一方面和客户负责人主动沟通,发现这个问题并不是个例,之前他们类似系统在AIX/HPUX/LINUX及Veritas文件系统的11g/12c版本RAC均未能解决此问题,这让我进一步意识到Oralce
BUG的可能性较小,基础平台支撑的问题更可能是关键!

3、
“精诚所至金石为开”:技术探索的路上不能遇到困难绕道走!在没有HP工程师支持,Oracle
SR也没有有效回复的情况下,通过细致深入的分析、不厌其烦的反复测试数十次,促成了问题的最终解决;

4、
在问题发生过程中,充分利用网络资源并分析整合,HP多路径使用文档让我开始有了解决问题的思路和信心,后续进而通过man帮助找到了解决问题的关键!

5、
通过案例梳理了RAC、ASM的整体工作流程和原理,这些基础知识的掌握才能在复杂的故障现象和日志中找到规律并快速寻找到解决方案。

根据以上的测试过程,发现如下现象:

将controlfile存储在一个RPT存储中,避免因为controlfile同步导致的阻塞。

style=”font-size: 16px;”>(1)controlfile迁移到第三个存储;

(2)Redo通过Oracle
multiplex功能将2个member存储到不同存储。

重新将controlfile及redologs迁移到Normal
Diskgroup中,测试中发现数据库实例存在正常、1节点重启、2节点重启等多种情况,故障现象不规律!

systemstatedump文件:

—– Beginning of Customized Incident
Dump(s) —–

Process CKPT (ospid: 4693) is waiting
for event ‘control file sequential read’.

Process O009 (ospid: 5080) is the
blocker of the wait chain.

===[ Wait Chain ]===

CKPT (ospid: 4693) waits for event
‘control file sequential read’.

LGWR (ospid: 4691) waits for event ‘KSV
master wait’.

O009 (ospid: 5080) waits for event ‘ASM
file metadata operation’.

node1_lgwr_4691.trc

—– END DDE Actions Dump (total 0
csec) —–

ORA-15080: synchronous I/O operation
failed to write block 1031 of disk 4 in disk group
DG_REDO_MOD

ORA-27063: number of bytes read/written
is incorrect

HPUX-ia64 Error: 11: Resource
temporarily unavailable

Additional information:
4294967295

Additional information: 1024

NOTE: process _lgwr_node1 (4691)
initiating offline of disk 4.4042263303 (DG_REDO_MOD_0004) with
mask 0x7e in group 3 (DG_REDO_MO

D) with client assisting

02测试过程

alert_+ASM1.log :

我反复测试,细致梳理了关键事件的时间点信息,示例如下:

相关文章