Oracle Data Guard Concepts and Administration Release 2 (9.2) Part Number A96653-02 |
|
|
View PDF |
This appendix provides help troubleshooting a standby database. This appendix contains the following sections:
If you encounter a problem during standby database preparation, it will probably be one of the following:
If the STANDBY_ARCHIVE_DEST
initialization parameter is not defined as a valid directory name on the standby site, the Oracle database server will not be able to determine the directory in which to store the archived redo logs. Check the DESTINATION
and ERROR
columns in the V$ARCHIVE_DEST
view. For example, enter:
SQL> SELECT DESTINATION, ERROR FROM V$ARCHIVE_DEST;
Make sure the destination is valid.
If the standby site is not receiving the logs, the first thing you should do is obtain information about the archiving status of the primary database by querying the V$ARCHIVE_DEST
view. Check especially for error messages. For example, enter the following query:
SQL> SELECT DEST_ID "ID", 2> STATUS "DB_status", 3> DESTINATION "Archive_dest", 4> ERROR "Error" 5> FROM V$ARCHIVE_DEST; ID DB_status Archive_dest Error -- --------- ------------------------------ ------------------------------------ 1 VALID /vobs/oracle/work/arc_dest/arc 2 ERROR standby1 ORA-16012: Archivelog standby database identifier mismatch 3 INACTIVE 4 INACTIVE 5 INACTIVE 5 rows selected.
If the output of the query does not help you, check the following list of possible issues. If any of the following conditions exist, the primary database will fail to archive to the standby site:
tnsnames.ora
file at the primary site.LOG_ARCHIVE_DEST_
n
parameter of the primary initialization parameter file is incorrect.LOG_ARCHIVE_DEST_STATE_
n
parameter specifying the state of the standby archiving destination has the value DEFER.
listener.ora
file has not been configured correctly at the standby site.If any of the following conditions exist, you cannot mount the physical standby database:
NOMOUNT
mode. You must first start the instance and then mount the database.ALTER DATABASE
CREATE STANDBY CONTROLFILE ... statement or RMAN. You cannot use the following types of control file backups:
If you specify REOPEN
for an OPTIONAL
destination, it is possible for the Oracle database server to reuse online redo logs even if there is an error. If you specify REOPEN
for a MANDATORY
destination, the log transport services component stalls the primary database when it cannot successfully archive redo logs.
The REOPEN
attribute is required when you use the MAX_FAILURE
attribute. Example A-1 shows how to set a retry time of 5 seconds and limit retries to 3 times.
LOG_ARCHIVE_DEST_1='LOCATION=/arc_dest REOPEN=5 MAX_FAILURE=3'
Using the ALTERNATE
attribute of the LOG_ARCHIVE_DEST_
n
parameter, you can specify alternate archive destinations. An alternate archive destination can be used when the archiving of an online redo log to a standby site fails. If archiving fails and the NO
REOPEN attribute has been specified, or the MAX_FAILURE
attribute threshold has been exceeded, log transport services will attempt to archive redo logs to the alternate destination on the next archiving operation.
Use the NOALTERNATE
attribute to prevent the original archive destination from automatically changing to an alternate archive destination when the original archive destination fails.
Example A-2 shows how to set the initialization parameter file so that a single, mandatory, local destination will automatically fail over to a different destination if any error occurs.
LOG_ARCHIVE_DEST_1='LOCATION=/disk1 MANDATORY ALTERNATE=LOG_ARCHIVE_DEST_2' LOG_ARCHIVE_DEST_STATE_1=ENABLE LOG_ARCHIVE_DEST_2='LOCATION=/disk2 MANDATORY' LOG_ARCHIVE_DEST_STATE_2=ENABLE LOG_ARCHIVE_DEST_STATE_2=ALTERNATE
If the LOG_ARCHIVE_DEST_1
destination fails, the archiving process will automatically switch to the LOG_ARCHIVE_DEST_2
destination at the next log switch on the primary database.
An important skip tool is DBMS_LOGSTDBY.SKIP_ERROR
. Depending on how important a table is, you might want to do one of the following:
See Also:
Oracle9i Supplied PL/SQL Packages and Types Reference for more information about using the |
Taking one of these actions prevents the SQL apply operations from stopping. Later, you can query the DBA_LOGSTDBY_EVENTS
view to find and correct any problems that exist.
If you encounter a problem switching over from a primary database to a standby database, it will probably be one of the following:
ALTER DATABASE COMMIT TO SWITCHOVER
failed with ORA-01093 error "Alter database close only permitted with no sessions connected."
This error occurs because the COMMIT TO SWITCHOVER
statement implicitly closed the database and, if there are any other user sessions connected to the database, the close fails.
Action: Make sure all user sessions are disconnected from the database. You can query the V$SESSION
fixed view to see what sessions are still around. For example:
SQL> SELECT SID, PROCESS, PROGRAM FROM V$SESSION; SID PROCESS PROGRAM ---------- --------- ------------------------------------------------ 1 26900 oracle@dbuser-sun (PMON) 2 26902 oracle@dbuser-sun (DBW0) 3 26904 oracle@dbuser-sun (LGWR) 4 26906 oracle@dbuser-sun (CKPT) 5 26908 oracle@dbuser-sun (SMON) 6 26910 oracle@dbuser-sun (RECO) 7 26912 oracle@dbuser-sun (ARC0) 8 26897 sqlplus@dbuser-sun (TNS V1-V3) 11 26917 sqlplus@dbuser-sun (TNS V1-V3) 9 rows selected.
In the previous example, the first seven sessions are all server background processes. Among the two SQL*Plus sessions, one is the current SQL*Plus session issuing the query, and the other is an extra session that should be disconnected before the switchover operation.
In most cases, following the steps described in Section 7.2.1 will result in a successful switchover operation. However, if the switchover operation is initially unsuccessful, you might still be able to use one of the following recovery options to complete the switchover operation successfully.
If the switchover operation does not complete successfully, you can query the SEQUENCE#
column in the V$ARCHIVED_LOG
view to see if the last archived log was archived and applied on the old physical standby database. If the last log was not archived to the old physical standby database, you can manually copy the archived log from the old primary database to the old physical standby database and register it with the SQL ALTER DATABASE REGISTER LOGFILE
filespec statement. If you then start up the managed recovery process, the archived log will be applied automatically. Query the SWITCHOVER_STATUS
column in the V$DATABASE
view. The TO PRIMARY
value in the SWITCHOVER_STATUS
column verifies that switchover to the primary role is now possible.
SQL> SELECT SWITCHOVER_STATUS FROM V$DATABASE; SWITCHOVER_STATUS ----------------- TO PRIMARY 1 row selected
See Also:
Chapter 14 for information about other valid values for the |
To continue with the switchover operation, return to Section 7.2.1 Step 5, and try again to switch the target physical standby database to the primary role.
In situations where an error has occurred and it is not possible to continue with the switchover operation, it might still be possible to revert the new physical standby back to the primary role by using the following steps:
See Also:
Section 3.2.3 for information about creating a standby control file. |
If this procedure is successful and archive gap management is enabled, the FAL processes will start and re-archive any missing archived redo logs to the physical standby database. Force a log switch on the primary database and examine the alert logs on both the primary database and physical standby database to ensure that the archived redo log sequence numbers are correct.
See Also:
Section 6.4 for information about archive gap management and Section 6.7 for information about locating the trace files. |
At this point, the Data Guard configuration has been rolled back to its initial state, and you can try the switchover operation again (after correcting any problems that might have led to the initial unsuccessful switchover operation).
Suppose the standby database and the primary database reside on the same site. After both the ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL
STANDBY
and the ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY
statements are successfully executed, shut down and restart the physical standby database and the primary database. However, the startup of the second database fails with ORA-01102 error "cannot mount database in EXCLUSIVE
mode."
This could happen during the switchover if you forget to set the LOCK_NAME_SPACE
parameter in the initialization parameter file that is used by the standby database (that is, the original primary database). If the LOCK_NAME_SPACE
parameter of the standby database is not set, the standby and the primary databases both use the same mount lock and cause the ORA-01102 error during the startup of the second database.
Action: Add LOCK_NAME_SPACE=
unique_lock_name
to the initialization parameter file used by the physical standby database and shut down and restart both the standby and the primary databases.
The archived redo logs are not applied to the standby database after the switchover.
This might happen because some environment or initialization parameters have not been properly set after the switchover.
Action:
tnsnames.ora
file at the primary site and the listener.ora
file at the standby site. There should be entries for a listener at the standby site and a corresponding tnsname at the primary site.LOG_ARCHIVE_DEST_
n
initialization parameter has been set to properly archive logs from the primary site to standby site. For example, query the V$ARCHIVE_DEST
fixed view at the primary site as follows:
SQL> SELECT DEST_ID, STATUS, DESTINATION FROM V$ARCHIVE_DEST;
If you do not see an entry corresponding to the standby site, you need to set LOG_ARCHIVE_DEST_
n
and LOG_ARCHIVE_DEST_STATE_
n
initialization parameters.
STANDBY_ARCHIVE_DEST
and LOG_ARCHIVE_FORMAT
initialization parameters correctly at the standby site so that the archived redo logs are applied to the desired location.DB_FILE_NAME_CONVERT
and LOG_FILE_NAME_CONVERT
initialization parameters. Set the STANDBY_FILE_MANAGEMENT
initialization parameter to AUTO
if you want the standby site to automatically add new datafiles that are created at the primary site.If you do not include the WITH SESSION SHUTDOWN
clause as a part of the ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY
statement, active SQL sessions might prevent a switchover from being processed. Active SQL sessions can include other Oracle processes.
When sessions are active, an attempt to switch over fails with the following error message:
SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY; ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY * ORA-01093: ALTER DATABASE CLOSE only permitted with no sessions connected
Action: Query the V$SESSION
view to determine which processes are causing the error. For example:
SQL> SELECT SID, PROCESS, PROGRAM FROM V$SESSION 2> WHERE TYPE = 'USER' 3> AND SID <> (SELECT DISTINCT SID FROM V$MYSTAT); SID PROCESS PROGRAM --------- -------- ------------------------------------------------ 7 3537 oracle@nhclone2 (CJQ0) 10 14 16 19 21 6 rows selected.
In the previous example, the JOB_QUEUE_PROCESSES
parameter corresponds to the CJQ0 process entry. Because the job queue process is a user process, it is counted as a SQL session that prevents switchover from taking place. The entries with no process or program information are threads started by the job queue controller.
Verify that the JOB_QUEUE_PROCESSES
parameter is set using the following SQL statement:
SQL> SHOW PARAMETER JOB_QUEUE_PROCESSES; NAME TYPE VALUE ------------------------------ ------- -------------------- job_queue_processes integer 5
Then, set the parameter to 0. For example:
SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES=0; Statement processed.
Because JOB_QUEUE_PROCESSES
is a dynamic parameter, you can change the value and have the change take effect immediately without having to restart the instance. You can now retry the switchover procedure.
Do not modify the parameter in your initialization parameter file. After you shut down the instance and restart it after switchover has completed, the parameter will be reset to the original value. This applies to both primary and physical standby databases.
Table A-1 summarizes the common processes that prevent switchover and what corrective action you need to take.
Log apply services cannot apply unsupported DML statements, DDL statements, and Oracle supplied packages to a logical standby database in SQL apply mode.
When an unsupported statement or package is encountered, SQL apply operations stop. You can take the actions described in Table A-2 to correct the situation and start applying SQL statements to the logical standby database again.
See Also:
Chapter 14 for information about querying the |
The process of archiving redo logs involves reading a buffer from the redo log and writing it to the archive log location. When the destination is remote, the buffer is written to the archive log location over the network using Oracle Net services.
The default archive log buffer size is 1 megabyte. The default transfer buffer size for Oracle Net is 2 kilobytes. Therefore, the archive log buffer is divided into units of approximately 2 kilobytes for transmission. These units could get further divided depending on the maximum transmission unit (MTU) of the underlying network interface.
The Oracle Net parameter that controls the transport size is session data unit (SDU). This parameter can be adjusted to reduce the number of network packets that are transmitted. This parameter allows a range of 512 bytes to 32 kilobytes.
For optimal performance, set the Oracle Net SDU
parameter to 32 kilobytes for the associated SERVICE
destination parameter.
The following example shows a database initialization parameter file segment that defines a remote destination netserv
:
LOG_ARCHIVE_DEST_3='SERVICE=netserv' SERVICE_NAMES=srvc
The following example shows the definition of that service name in the tnsnames.ora
file:
netserv=(DESCRIPTION=(SDU=32768)(ADDRESS=(PROTOCOL=tcp)(HOST=host) (PORT=1521)) (CONNECT_DATA=(SERVICE_NAME=srvc)(ORACLE_HOME=/oracle)))
The following example shows the definition in the listener.ora
file:
LISTENER=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp) (HOST=host)(PORT=1521)))) SID_LIST_LISTENER=(SID_LIST=(SID_DESC=(SDU=32768)(SID_NAME=sid) (GLOBALDBNAME=srvc)(ORACLE_HOME=/oracle)))
If you archive to a remote site using high-latency/high-bandwidth connections, you can improve performance by increasing the TCP send and receive window sizes. Use caution, however, because this might adversely affect networked applications that do not exhibit the same characteristics as archiving. This method consumes a large amount of system resources.
You can also use cascade standby databases to off-load network processing from the primary database to a standby database. See Appendix D for more information.
For any given Oracle Data Guard network connection, there are two processes communicating with each other. When the network connection is unexpectedly broken, how these processes react differs greatly. This is a discussion of what actually occurs when a network connection is broken, and how it affects the Data Guard environment and configuration. This discussion applies to both physical and logical standby databases.
Data Guard uses a peer-to-peer connection protocol, whereby a primary database process, if it is the Log Writer (LGWR) or the Archiver (ARCH), establishes a network connection to the standby database. As a result of the network connection request, the listener on the standby site creates a separate process on the standby database - called the Remote File Server (RFS) process. This RFS process uses network messages from the primary database; it reads from the network and sends an acknowledgement message back to the primary when it is done processing the request.
During normal Data Guard operations, when redo data is transmitted from the primary to the standby, network messages are initiated from the primary database (the network `client'), and always acknowledged by the standby database (the network `server'). In this case, the LGWR and ARCH processes are the network clients, and the RFS process is the network server.
Consider the simple scenario where the network between the primary and standby systems is disconnected. This results in what is known as a dead connection. A dead connection indicates that there is no physical connection, but the connection appears to still be there to the processes on each system.
When the LGWR process attempts to send a new message to the RFS process over the dead connection, the LGWR process receives an error from Oracle Net, after a TCP timeout, indicating that the connection has been broken. In this way, the LGWR is able to establish that network connectivity has been lost, and take corrective action. The Data Guard attributes [NO]MAX_FAILURE
, [NO]REOPEN
and [NO]NET_TIMEOUT
, which are options for the LOG_ARCHIVE_DEST_
n
parameter, provide LGWR with the desired flexibility to control the timeout intervals and number of retries associated with a network connection that is not responding.
In contrast to the LGWR process, the RFS process on the standby database is always synchronously waiting for a new message to arrive from the primary database. The RFS process that is doing the network read operation is blocked until some data arrives to its reading buffer, or until the underlying network software determines the dead connection is no longer valid.
Oracle Net periodically sends a network probe to verify that a client/server connection is still active. This ensures that connections are not left open indefinitely due to an abnormal client termination. If the probe finds a dead connection or a connection that is no longer in use, it returns an error that causes the RFS process to exit.
You can use the Oracle Net parameter SQLNET.EXPIRE_TIME
to specify the time interval, expressed in seconds, when to send a probe to verify that the network session is active. Setting this parameter to a small value allows for more timely detections of dead connections. Connections that do not respond to this probe signal are disconnected. This parameter should be set up for the standby database, as well as the primary, to prepare it for future switchover scenarios.
Limitations on using the dead connection detection feature are:
Once the RFS process receives notification of the dead network connection, it will terminate itself. However, until such time as the RFS process terminates itself, it will retain lock information on the archivelog on the standby site, or the standby redo log, whose redo information was being received from the primary database. During this interval, no new RFS processes can receive redo information from the primary database for the same archived redo log (or the standby redo log).
The dead network connection detection timer expiration value can also be controlled using the TCP/IP keepalive parameter that specifies the number of seconds to wait before verifying the network connection is valid. Note that the value of the TCP/IP keepalive parameter defaults on most system to two hours, which means that in the default case the RFS process will wait for 2 hours before timing out on a dead network connection.
Therefore, Oracle Corporation recommends setting the Oracle Net SQLNET.EXPIRE_TIME
parameter and the TCP/IP keepalive parameter to 60 seconds. This is a reasonable value for most systems, and setting the parameter to a small value does not significantly impact production systems.
Once the network problem is resolved, and the primary database processes are again able to establish network connections to the standby database, a new RFS process will automatically be spawned on the standby database for each new network connection. These new RFS processes will resume the reception of redo data from the primary database.