Top 5 Grid Infrastructure Startup Issues
Top 5 Grid Infrastructure Startup Issues
In this Document
Purpose
Scope
Details
Issue #1: CRS-4639: Could not contact Oracle High Availability Services, ohasd.bin not running or ohasd.bin is
running but no init.ohasd or other processes
Issue #2: CRS-4530: Communications failure contacting Cluster Synchronization Services daemon, ocssd.bin is not
running
Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running
Issue #4: Agent or mdnsd.bin, gpnpd.bin, gipcd.bin not running
Issue #5: ASM instance does not start, ora.asm is OFFLINE
References
APPLIES TO:
PURPOSE
The purpose of this note is to provide a summary of the top 5 issues that may prevent the successful startup of the Grid
Infrastructure (GI) stack.
SCOPE
DETAILS
Issue #1: CRS-4639: Could not contact Oracle High Availability Services, ohasd.bin not running or ohasd.bin
is running but no init.ohasd or other processes
Symptoms:
https://support.oracle.com/epmos/faces/SearchDocDisplay?_adf.ctrl-state=1bom7kkla3_136 1/6
3/10/2017 Document Display
OHASD starting
Timed out waiting for init.ohasd script to start; posting an alert
6. ohasd.bin keeps restarting, ohasd.log report:
2014-08-31 15:00:25.132: [ CRSSEC][733177600]{0:0:2} Exception: PrimaryGroupEntry constructor failed to validate
group name with error: 0 groupId: 0x7f8df8022450 acl_string: pgrp:spec:r-x
2014-08-31 15:00:25.132: [ CRSSEC][733177600]{0:0:2} Exception: ACL entry creation failed for: pgrp:spec:r-x
2014-08-31 15:00:25.132: [ INIT][733177600]{0:0:2} Dump State Starting ...
7. Only the ohasd.bin is runing, but there is nothing written in ohasd.log. OS /var/log/messages shows:
2015-07-12 racnode1 logger: autorun file for ohasd is missing
Possible Causes:
1. For OL5/RHEL5/under and other platform, the file '/etc/inittab' does not contain the line similar to the following
(platform dependent) :
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
For OL6/RHEL6+, upstart is not configed properly.
2. runlevel 3 has not been reached, some rc3 script is hanging
3. the init process (pid 1) did not spawn the process defined in /etc/inittab (h1) or a bad entry before init.ohasd like
xx:wait:<process> blocked the start of init.ohasd
4. CRS autostart is disabled
5. The Oracle Local Registry ($GRID_HOME/cdata/<node>.olr) is missing or corrupted (check as root user via
"ocrdump -local /tmp/olr.log", the /tmp/olr.log should contain all GI daemon processes related information, compare
with a working cluster to verify)
6. root user was in group "spec" before but now the group "spec" has been removed, the old group for root user is still
recorded in the OLR, this can be verified in OLR dump
7. HOSTNAME was null when init.ohasd started especially after a node reboot
Solutions:
If OLR backup does not exist for any reason, perform deconfig and rerun root.sh is required to
recreate OLR, as root user:
# <GRID_HOME>/crs/install/rootcrs.pl -deconfig -force
# <GRID_HOME>/root.sh
6. Reinitialize/Recreate the OLR is required, using the same command as recreating OLR per
above
7. Restart the init.ohasd process or add "sleep 30" in init.ohasd to allow hostname populated
correctly before starting Clusterware, refer to Note 1427234.1
8. If above does not help, check OS messages for ohasd.bin logger message and manually execute
crswrapexece.pl command mentioned in the OS message with LD_LIBRARY_PATH set to <GRID_HOME>/lib
to continue debug.
Issue #2: CRS-4530: Communications failure contacting Cluster Synchronization Services daemon,
ocssd.bin is not running
https://support.oracle.com/epmos/faces/SearchDocDisplay?_adf.ctrl-state=1bom7kkla3_136 2/6
3/10/2017 Document Display
Symptoms:
2012-01-27 13:42:58.796: [ CSSD][19]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB,
DHB has rcfg 223132864, wrtcnt, 1112, LATS 783238209,
lastSeqNo 1111, uniqueness 1327692232, timestamp 1327693378/787089065
5. for 3 or more node cases, 2 nodes form cluster fine, the 3rd node joined then failed, ocssd.log show:
2012-04-08 12:04:33.153: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with
uniqueness value 1333911873
......
2012-04-08 12:14:31.994: [ CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
2012-04-08 12:14:31.994: [ CSSD][5]###################################
2012-04-08 12:14:31.994: [ CSSD][5](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is
terminating abnormally
7. alert<node>.log shows:
2014-02-05 06:16:56.815
[cssd(3361)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in
/u01/app/11.2.0/grid/log/bdprod2/cssd/ocssd.log
...
2014-02-05 06:27:01.707
[ohasd(2252)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'bdprod2'.
2014-02-05 06:27:02.075
[ohasd(2252)]CRS-2771:Maximum restart attempts reached for resource 'ora.cssd'; will not restart.
Possible Causes:
Solutions:
1. restore the voting disk access by checking storage access, disk permissions etc.
If the disk is not accessible at OS level, please engage system administrator to restore the
disk access.
If the voting disk is missing from the OCR ASM diskgroup, start CRS in exclusive mode and
recreate the voting disk:
https://support.oracle.com/epmos/faces/SearchDocDisplay?_adf.ctrl-state=1bom7kkla3_136 3/6
3/10/2017 Document Display
# crsctl start crs -excl
# crsctl replace votedisk <+OCRVOTE diskgroup>
2. Refer to Document 1212703.1 for multicast test and fix. For 11.2.0.3 PSU5/PSU6/PSU7 or
12.1.0.1, either enable multicast for private network or apply patch 16547309 or latest PSU.
Refer to Document 1564555.1
3. Consult with the network administrator to restore private network access or disable firewall
for private network (for Linux, check service iptables status and service ip6tables status)
4. Kill the gpnpd.bin process on surviving node, refer Document 10105195.8
Once above issues are resolved, restart Grid Infrastructure stack.
If ping/traceroute all work for private network, there is a failed 11.2.0.1 to 11.2.0.2
upgrade happened, please check out
Bug 13416559 for workaround
5. Limit the number of ASM disks scan by supplying a more specific asm_diskstring, refer to bug
13583387
For Solaris 11.2.0.3 only, please apply patch 13250497, see Note 1451367.1.
6. Refer to the solution and workaround in Note 1479380.1
Issue #3: CRS-4535: Cannot communicate with Cluster Ready Services, crsd.bin is not running
Symptoms:
Possible Causes:
Solutions:
1. Check the solution for Issue 2, ensure ocssd.bin is running and ora.cssd is ONLINE
2. For 11.2.0.2+, ensure that the resource ora.cluster_interconnect.haip is ONLINE, refer to
Document 1383737.1 for ASM startup issues related to HAIP.
Check if GRID_HOME/bin/oracle binary is linked with RAC option Document 284785.1
3. Ensure the OCR disk is available and accessible. If the OCR is lost for any reason, refer to
Document 1062983.1 on how to restore the OCR.
4. Restore network configuration to be the same as interface defined in
$GRID_HOME/gpnp/<node>/profiles/peer/profile.xml, refer to Document 283684.1 for private
network modification.
5. touch the file with <host>.pid under $GRID_HOME/crs/init.
For 11.2.0.1, the file is owned by <grid> user.
For 11.2.0.2, the file is owned by root user.
6. Using ocrconfig -repair command to fix the ocr.loc content:
for example, as root user:
# ocrconfig -repair -add +OCR2 (to add an entry)
# ocrconfig -repair -delete +OCR2 (to remove an entry)
https://support.oracle.com/epmos/faces/SearchDocDisplay?_adf.ctrl-state=1bom7kkla3_136 4/6
3/10/2017 Document Display
ohasd.bin needs to be up and running in order for above command to run.
Once above issues are resolved, either restart GI stack or start crsd.bin via:
# crsctl start res ora.crsd -init
7. Engage network admin to enable jumbo frame from switch layer if it is enabled at the network
interface. If jumbo frame is not required, change MTU to 1500 for the private network on all
nodes, then restart GI stack on all nodes.
8. On AIX 6.1 TL08 SP01 and AIX 7.1 TL02 SP01, apply AIX patch per Document 1528452.1 AIX 6.1
TL8 or 7.1 TL2: 11gR2 GI Second Node Fails to Join the Cluster as CRSD and EVMD are in
INTERMEDIATE State
9. Increase udp_sendspace to recommended value, refer to Document 1280234.1
Symptoms:
Possible Causes:
Solutions:
1. Either compare the permission/ownership with a good node GRID_HOME and make correction
accordingly or as root user:
# cd <GRID_HOME>/crs/install
# ./rootcrs.pl -unlock
# ./rootcrs.pl -patch
This will stop clusterware stack, set permssion/owership to root for required files and restart
clusterware stack.
2. If the corresponding <node>.pid does not exist, touch the file with correct ownership and
permission, otherwise correct the <node>.pid ownership/permission as required, then restart the
clusterware stack.
Here is the list of <node>.pid file under <GRID_HOME>, owned by root:root, permission 644:
./ologgerd/init/<node>.pid
./osysmond/init/<node>.pid
./ctss/init/<node>.pid
./ohasd/init/<node>.pid
./crs/init/<node>.pid
Owned by <grid>:oinstall, permission 644:
./mdns/init/<node>.pid
./evm/init/<node>.pid
./gipc/init/<node>.pid
./gpnp/init/<node>.pid
https://support.oracle.com/epmos/faces/SearchDocDisplay?_adf.ctrl-state=1bom7kkla3_136 5/6
3/10/2017 Document Display
Symptoms:
Possible Causes:
Solutions:
1. Create a temporary pfile to start ASM instance, then recreate spfile, see Document 1095214.1
for more details.
2. Refer to Document 1077094.1 to correct the ASM discovery string.
3. Refer to Document 1050164.1 to fix ASMlib configuration.
4. Refer to Document 1383737.1 for solution. For more information about HAIP, please refer to
Document 1210883.1
For further debugging GI startup issue, please refer to Document 1050908.1 Troubleshoot Grid Infrastructure Startup
Issues.
https://support.oracle.com/epmos/faces/SearchDocDisplay?_adf.ctrl-state=1bom7kkla3_136 6/6