Troubleshooting Oracle Exadata Database Service on Cloud@Customer Systems
These topics cover some common issues you might run into and how to address them.
- Patching Failures on Oracle Exadata Database Service on Cloud@Customer Systems
- Obtaining Further Assistance
- VM Operating System Update Hangs During Database Connection Drain
- Adding a VM to a VM Cluster Fails
- Nodelist is not Updated for Data Guard-Enabled Databases
- CPU Offline Scaling Fails
- Standby Database Fails to Restart After Switchover in Oracle Database 11g Oracle Data Guard Setup
- Using Custom SCAN Listener Port With Data Guard On Disaster Recovery Network Causes Data Guard Association Verification Failures
- PDB Creation Fails After Moving Database to a New DB Home (23ai)
Patching Failures on Oracle Exadata Database Service on Cloud@Customer Systems
Patching operations can fail for various reasons. Typically, an operation fails because a database node is down, there is insufficient space on the file system, or the virtual machine cannot access the object store.
- Determining the Problem
In the Console, you can identify a failed patching operation by viewing the patch history of an Oracle Exadata Database Service on Cloud@Customer system or an individual database. - Troubleshooting and Diagnosis
Diagnose the most common issues that can occur during the patching process of any of the Oracle Exadata Database Service on Cloud@Customer components.
Determining the Problem
In the Console, you can identify a failed patching operation by viewing the patch history of an Oracle Exadata Database Service on Cloud@Customer system or an individual database.
A patch that was not successfully applied displays a status of
Failed
and includes a brief description of the
error that caused the failure. If the error message does not contain enough
information to point you to a solution, you can use the database CLI and log
files to gather more data. Then, refer to the applicable section in this
topic for a solution.
Troubleshooting and Diagnosis
Diagnose the most common issues that can occur during the patching process of any of the Oracle Exadata Database Service on Cloud@Customer components.
- Database Server VM Issues
One or more of the following conditions on the database server VM can cause patching operations to fail. - Oracle Grid Infrastructure Issues
One or more of the following conditions on Oracle Grid Infrastructure can cause patching operations to fail. - Oracle Databases Issues
An improper database state can lead to patching failures.
Database Server VM Issues
One or more of the following conditions on the database server VM can cause patching operations to fail.
Database Server VM Connectivity Problems
Cloud tooling relies on the proper networking and connectivity configuration between virtual machines of a given VM cluster. If the configuration is not set properly, this may incur in failures on all the operations that require cross-node processing. One example can be not being able to download the required files to apply a given patch.
Given the case, you can perform the following actions:
- Verify that your DNS configuration is correct so that the relevant virtual machine addresses are resolvable within the VM cluster.
- Refer to the relevant Cloud Tooling logs as instructed in the Obtaining Further Assistance section and contact Oracle Support for further assistance.
Parent topic: Troubleshooting and Diagnosis
Oracle Grid Infrastructure Issues
One or more of the following conditions on Oracle Grid Infrastructure can cause patching operations to fail.
Oracle Grid Infrastructure is Down
Oracle Clusterware enables servers to communicate with each other so that they can function as a collective unit. The cluster software program must be up and running on the VM Cluster for patching operations to complete. Occasionally you might need to restart the Oracle Clusterware to resolve a patching failure.
./crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
crsctl start cluster -all
crsctl check cluster
Parent topic: Troubleshooting and Diagnosis
Oracle Databases Issues
An improper database state can lead to patching failures.
Oracle Database is Down
The database must be active and running on all the active nodes so the patching operations can be completed successfully across the cluster.
srvctl status database -d db_unique_name -verbose
The system returns a message including the database instance status. The instance status must be Open for the patching operation to succeed.
srvctl start database -d db_unique_name -o open
Parent topic: Troubleshooting and Diagnosis
Obtaining Further Assistance
If you were unable to resolve the problem using the information in this topic, follow the procedures below to collect relevant database and diagnostic information. After you have collected this information, contact Oracle Support.
- Collecting Cloud Tooling Logs
Use the relevant log files that could assist Oracle Support for further investigation and resolution of a given issue. - Collecting Oracle Diagnostics
Related Topics
Collecting Cloud Tooling Logs
Use the relevant log files that could assist Oracle Support for further investigation and resolution of a given issue.
DBAASCLI Logs
/var/opt/oracle/log/dbaascli
dbaascli.log
Parent topic: Obtaining Further Assistance
VM Operating System Update Hangs During Database Connection Drain
Description: This is an intermittent issue. During virtual machine
operating system update with 19c Grid Infrastructure and running databases,
dbnodeupdate.sh
waits for RHPhelper
to drain the
connections, which will not progress because of a known bug "DBNODEUPDATE.SH HANGS IN
RHPHELPER TO DRAIN SESSIONS AND SHUTDOWN INSTANCE".
- VM operating system update hangs in
rhphelper
- Hangs the automation
- Some or none of the database connections will have drained, and some or all of the database instances will remain running.
- VM operating system update does not drain database connections because
rhphelper
crashed- Does not hang automation
- Some or none of the database connection draining completes
/var/log/cellos/dbnodeupdate.trc
will show this as the
last line:
(ACTION:) Executing RHPhelper to drain sessions and shutdown instances. (trace:/u01/app/grid/crsdata/scaqak04dv0201/rhp//executeRHPDrain.150721125206.trc)
- Upgrade Grid Infrastructure version to 19.11 or above.
(OR)
Disable
rhphelper
before updating and enable it back after updating.To disable before updating is started:/u01/app/19.0.0.0/grid/srvm/admin/rhphelper /u01/app/19.0.0.0/grid 19.10.0.0.0 -setDrainAttributes ENABLE=false
To enable after updating is completed:/u01/app/19.0.0.0/grid/srvm/admin/rhphelper /u01/app/19.0.0.0/grid oracle-home-current-version -setDrainAttributes ENABLE=true
If you disable
rhphelper
, then there will be no database connection draining before database services and instances are shutdown on a node before the operating system is updated. - If you missed disabling RHPhelper and upgrade is not progressing and hung, then
it is observed that the draining of services is taking time:
- Inspect the
/var/log/cellos/dbnodeupdate.trc
trace file, which contains a paragraph similar to the following:(ACTION:) Executing RHPhelper to drain sessions and shutdown instances. (trace: /u01/app/grid/crsdata/<nodename>/rhp//executeRHPDrain.150721125206.trc)
- Open the
/var/log/cellos/dbnodeupdate.trc
trace file.Ifrhphelper
fails, then the trace file contains the message as follows:"Failed execution of RHPhelper"
Ifrhphelper
hangs, then the trace file contains the message as follows:(ACTION:) Executing RHPhelper to drain sessions and shutdown instances.
- Identify the
rhphelper
processes running at the operating system level and kill them.There are two commands that will have the string “rhphelper” in the name – a Bash shell, and the underlying Java program, which is really
rhphelper
executing.rhphelper
runs asroot
, so must be killed asroot
(sudo
fromopc
).For example:[opc@<HOST> ~] pgrep –lf rhphelper 191032 rhphelper 191038 java
[opc@<HOST> ~] sudo kill –KILL 191032 191038
- Verify that the
dbnodeupdate.trc
file moves and the Grid Infrastructure stack on the node is shutdown.
For more information about RHPhelper, see Using RHPhelper to Minimize Downtime During Planned Maintenance on Exadata (Doc ID 2385790.1).
- Inspect the
Adding a VM to a VM Cluster Fails
[FATAL] [INS-32156] Installer has detected that there are non-readable files in oracle home. CAUSE: Following files are non-readable, due to insufficient permission oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc ACTION: Ensure the above files are readable by grid.
Cause: Installer has detected a non-readable trace file,
oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc
created by Autonomous Health Framework (AHF) in Oracle home that causes adding a cluster
VM to fail.
AHF ran as root
created a trc
file with
root
ownership, which the grid
user is not able to
read.
grid
user before you add VMs to a VM cluster. To fix the permission
issue, run the following commands as root
on all the existing VM
cluster
VMs:chown grid:oinstall /u01/app/19.0.0.0/grid/srvm/admin/logging.properties
chown -R grid:oinstall /u01/app/19.0.0.0/grid/oracle.ahf*
chown -R grid:oinstall /u01/app/grid/oracle.ahf*
Nodelist is not Updated for Data Guard-Enabled Databases
Description: Adding a VM to a VM cluster completes successfully,
however, for Data Guard-enabled databases, the new VM is not added to the nodelist in
the /var/opt/oracle/creg/<db>.ini
file.
Cause: Data Guard-enabled databases will not be extended to the newly
added VM. And therefore, the <db>.ini
file will also not be
updated because the database instance is not configured in the new VM.
Action: To add an instance to primary and standby databases and to the new VMs (Non-Data Guard), and to remove an instance from a Data Guard environment, see My Oracle Support note 2811352.1.
CPU Offline Scaling Fails
** CPU Scale Update **An error occurred during module execution. Please refer to the log file for more information
Cause: After provisioning a VM cluster, the
/var/opt/oracle/cprops/cprops.ini
file, which is automatically
generated by the database as a service (DBaaS) is not updated with the
common_dcs_agent_bindHost
and
common_dcs_agent_port
parameters and this causes CPU offline
scaling to fail.
root
user, manually add the following
entries in the /var/opt/oracle/cprops/cprops.ini
file.common_dcs_agent_bindHost=<IP_Address>
common_dcs_agent_port=7070
The
common_dcs_agent_port
value is 7070 always.
netstat -tunlp | grep 7070
netstat -tunlp | grep 7070
tcp 0 0 <IP address 1>:7070 0.0.0.0:* LISTEN 42092/java
tcp 0 0 <IP address 2>:7070 0.0.0.0:* LISTEN 42092/java
You can specify either of the two IP addresses, <IP
address 1> or <IP address 2> for
the common_dcs_agent_bindHost
parameter.
Standby Database Fails to Restart After Switchover in Oracle Database 11g Oracle Data Guard Setup
Description: After performing the switchover, the new standby (old primary) database remains shut down and fails to restart.
Action: After performing switchover, do the following:
- Restart the standby database using the
srvctl start database -db <standby dbname>
command. - Reload the listener as
grid
user on all primary and standby cluster nodes.- To reload the listener using high availability, download and
apply patch 25075940 to the Grid home, and then run
lsnrctl reload -with_ha
. - To reload the listener, run
lsrnctl reload
.
- To reload the listener using high availability, download and
apply patch 25075940 to the Grid home, and then run
After reloading the listener, verify that the
<dbname>_DGMGRL
services are loaded into the
listener using the lsnrctl status
command.
To download patch 25075940
- Log in to My Oracle Support.
- Click Patches & Updates.
- Select Bug Number from the Number/Name or Bug Number (Simple) drop-down list.
- Enter the bug number 34741066, and then click Search.
- From the search results, click the name of the latest patch.
You will be redirected to the Patch 34741066: LSNRCTL RELOAD -WITH_HA FAILED TO READ THE STATIC ENTRY IN LISTENER.ORA page.
- Click Download.
Using Custom SCAN Listener Port With Data Guard On Disaster Recovery Network Causes Data Guard Association Verification Failures
Description: If the SCAN listener port for the client network and disaster recovery network (DR network) are different, then Data Guard (DG) configuration fails during verification phase of create data guard association.
Action: Use the same SCAN listener ports (or default port) on all
networks. To fix the listener port after the cluster has been configured, run the
GI home/bin/srvctl modify listener
-listener listener_name -endpoints endpoints
command. For more information, see
srvctl modify listener in the
Oracle Real Application Clusters
Administration and Deployment Guide.
PDB Creation Fails After Moving Database to a New DB Home (23ai)
[FATAL] [DBAAS-60022] Command '/u02/app/oracle/product/23.0.0.0/dbhome_3/bin/srvctl 'start' 'service' '-db' 'db23ano' '-service' 'db23ano_PDBJULY242.paas.oracle.com'' has failed on nodes [localnode].
Action: If the Grid Infrastructure version is 23.4.0.24.05, upgrade to version 23.5.0.24.07 to resolve this issue.