Skip Headers

Oracle® Database High Availability Architecture and Best Practices
10g Release 1 (10.1)

Part Number B10726-02
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

7 Oracle Configuration Best Practices

This chapter describes Oracle configuration best practices. It includes the following sections:

7.1 Configuration Best Practices for the Database

The practices that are recommended in this section affect the performance, availability, and MTTR of your system. These practices apply to the single-instance database, RAC-only, Data Guard-only, and Maximum Availability architectures described in Chapter 4, "High Availability Architectures". The recommendations in this section are identical for the primary and standby databases when Oracle Data Guard is used. Some of these practices may reduce performance, but they are necessary to reduce or avoid outages. The minimal performance impact is outweighed by the reduced risk of corruption or the performance improvement for recovery.

This section includes the following recommendations:


Use Two Control Files
Set CONTROL_FILE_RECORD_KEEP_TIME Large Enough
Configure the Size of Redo Log Files and Groups Appropriately
Multiplex Online Redo Log Files
Enable ARCHIVELOG Mode
Enable Block Checksums
Enable Database Block Checking
Log Checkpoints to the Alert Log
Use Fast-Start Checkpointing to Control Instance Recovery Time
Capture Performance Statistics About Timing
Use Automatic Undo Management
Use Locally Managed Tablespaces
Use Automatic Segment Space Management
Use Temporary Tablespaces and Specify a Default Temporary Tablespace
Use Resumable Space Allocation
Use a Flash Recovery Area
Enable Flashback Database
Set Up and Follow Security Best Practices
Use the Database Resource Manager
Use a Server Parameter File

7.1.1 Use Two Control Files

Maintain two copies of the control file. If a single control file is damaged, then any Oracle instance fails when it attempts to access the damaged or missing control file. If another copy of the current control file is available, then an instance can be easily restarted after copying the good control file to the location of the bad control file. Database recovery is unnecessary.

7.1.2 Set CONTROL_FILE_RECORD_KEEP_TIME Large Enough

Set the CONTROL_FILE_RECORD_KEEP_TIME initialization parameter to a value that enables all on-disk backup information to be retained in the control file. Allocate 200 MB for each control file. CONTROL_FILE_RECORD_KEEP_TIME specifies the number of days that records are kept within the control file before becoming a candidate for reuse. Set the CONTROL_FILE_RECORD_KEEP_TIME value to slightly longer than the oldest backup file that you intend to keep on disk, as determined by the size of the flash recovery area. For example, if the flash recovery area is sized to maintain two full backups that are taken every 7 days, as well as daily incremental backups and archived redo log files, then set CONTROL_FILE_RECORD_KEEP_TIME to a value like 21 or 30. Records older than this will be reused. However, the backup metadata will still be available in the RMAN recovery catalog.

7.1.3 Configure the Size of Redo Log Files and Groups Appropriately

All online redo log files should be the same size and configured to switch approximately once an hour during normal activity. They should switch no more frequently than every 20 minutes during peak activity.

There should be a minimum of four online log groups to prevent LGWR from waiting for a group to be available following a log switch. A group may be unavailable because a checkpoint has not yet completed or the group has not yet been archived.

7.1.4 Multiplex Online Redo Log Files

Use Oracle log multiplexing to create multiple redo log members in each redo group. This protects against a failure involving the redo log, such as a disk corruption that exists on both sides of the disk mirror for one of the members, or a user error that accidentally removes a member. If at least one redo log member is available, then the instance can continue to function.

7.1.5 Enable ARCHIVELOG Mode

ARCHIVELOG mode enables the database to be backed up while it is online and is necessary to recover the database to a point in time later than what has already been restored.

Architectures that include Oracle Data Guard require that the production database run in ARCHIVELOG mode before a standby database is instantiated. ARCHIVELOG mode is required to maintain a standby database.


See Also:

Oracle Database Administrator's Guide for more information about using automatic archiving

7.1.6 Enable Block Checksums

By default, Oracle always tests the data blocks that it reads from disk. Enabling data and log block checksums by setting DB_BLOCK_CHECKSUM to TRUE enables Oracle to detect other types of corruption caused by underlying disks, storage systems, or I/O systems. Before a data block is written to disk, a checksum is computed and stored in the block. When the block is subsequently read from disk, the checksum is computed again and compared with the stored checksum. Any difference is treated as a media error and an ORA-1578 error is signaled. Block checksums are always maintained for the SYSTEM tablespace.

In addition to enabling data block checksums, Oracle also calculates a checksum for every redo log block before writing it to the current log. Redo record corruptions are found as soon as the log is archived. Without this option, a corruption in a redo log can go unnoticed until the log is applied to a standby database or until a backup is restored and rolled forward through the log containing the corrupt log block.

RMAN also calculates checksums when taking backups to ensure that all blocks being backed up are validated.

Turning on this feature typically has minimal overhead. Measure the performance impact with your workload on a test system and ensure that the performance impact is acceptable before introducing this feature on an active database.

7.1.7 Enable Database Block Checking

Enable database block checking by setting DB_BLOCK_CHECKING to TRUE. When block checking is enabled, whenever a block is modified, Oracle verifies that the block is self-consistent. If it is inconsistent, then the block is marked corrupt, an ORA-1578 error is returned, and a trace file is created containing details of the problem. Without block checking enabled, corruptions can go undetected until the block is accessed again. Block checking for the SYSTEM tablespace is always enabled.

Block checking can often prevent memory and data corruption. Turning on this feature typically causes an additional 1 percent to 10 percent overhead, depending on workload. Measure the performance impact on a test system using your workload and ensure that it is acceptable before introducing this feature on an active database.

To ensure that blocks are not corrupted externally to Oracle, use one of the following:

7.1.8 Log Checkpoints to the Alert Log

Checkpoint activity should be logged to the alert log by setting LOG_CHECKPOINT_TO_ALERT to TRUE. Monitor checkpoint activity to ensure that a current checkpoint completes before the next checkpoint starts.

7.1.9 Use Fast-Start Checkpointing to Control Instance Recovery Time

Fast-start checkpointing refers to the periodic writes by the database writer (DBWn) processes. DBWn processes write changed data blocks from the Oracle buffer cache to disk and advance the thread checkpoint. With fast-start checkpointing, the checkpoint continually advances so that recovery time from instance or node failure occurs predictably.

Oracle Database 10g supports automatic checkpoint tuning which takes advantage of periods of low I/O usage to advance checkpoints and therefore improve availability. Automatic checkpoint tuning is in effect if the FAST_START_MTTR_TARGET database initialization parameter is unset. Observe the following recommendations to take advantage of automatic checkpoint tuning:

  • If it is necessary to control the time to recover from an instance or node failure, then set FAST_START_MTTR_TARGET to the desired MTTR in seconds.

  • If targeting a specific MTTR is unnecessary, then leave FAST_START_MTTR_TARGET unset to enable automatic checkpoint tuning.

  • Fast-start checkpointing can be disabled by setting FAST_START_MTTR_TARGET=0. Disable fast-start checkpointing only when system I/O capacity is insufficient with fast-start checkpointing enabled and achieving a target MTTR is not important.

Enabling fast-start checkpointing increases the average number of writes per transaction that DBWn issues for a given workload (when compared with disabling fast-start checkpointing). However, if the system is not already near or at its maximum I/O capacity, then fast-start checkpointing has a negligible impact on performance. The percentage of additional DBWn writes with very aggressive fast-start checkpointing depends on many factors, including the workload, I/O speed and capacity, CPU speed and capacity, and the performance of previous recoveries.

Monitor the V$MTTR_TARGET_ADVICE view for advisory information and an estimate of the number of additional I/O operations that would occur under different FAST_START_MTTR_TARGET values. You should also test various FAST_START_MTTR_TARGET settings (such as 0, 1, 90, 180, 270, 3600, and unset) under load to determine the runtime impact (for example, the amount of increased DBWn write activity) on a particular system and the instance recovery time achieved with that setting.

If FAST_START_MTTR_TARGET is set to a low value, then fast-start checkpointing is more aggressive, and the average number of writes per transaction that DBWn issues is higher in order to keep the thread checkpoint sufficiently advanced to meet the requested MTTR. Conversely, if FAST_START_MTTR_TARGET is set to a high value, or if automatic checkpoint tuning is in effect (that is, FAST_START_MTTR_TARGET is unset), then fast-start checkpointing in less aggressive, and the average number of writes per transaction that DBWn issues is lower.

Fast-start checkpointing can be explicitly disabled by setting FAST_START_MTTR_TARGET=0. Disabling fast-start checkpointing leads to the fewest average number of writes per transaction for DBWn for a specific workload and configuration, but also results in the highest MTTR.

When you enable fast-start checkpointing, remove or disable (set to 0) the following initialization parameters: LOG_CHECKPOINT_INTERVAL, LOG_CHECKPOINT_TIMEOUT, FAST_START_IO_TARGET.

7.1.10 Capture Performance Statistics About Timing

Set the TIMED_STATISTICS initialization parameter to TRUE to capture Oracle event timing data. This parameter is set to TRUE by default if the STATISTICS_LEVEL database parameter is set to its default value of TYPICAL. Effective data collection and analysis is essential for identifying and correcting system performance problems. Oracle provides several tools that allow a performance engineer to gather information about instance and database performance. Setting TIMED_STATISTICS to TRUE is essential to effectively using the Oracle tools.


See Also:

7.1.11 Use Automatic Undo Management

With automatic undo management, the Oracle server effectively and efficiently manages undo space, leading to lower administrative complexity and cost. When Oracle internally manages undo segments, undo block and consistent read contention are eliminated because the size and number of undo segments are automatically adjusted to meet the current workload requirement.

To use automatic undo management, set the following parameters:

  • UNDO_MANAGEMENT = AUTO

  • UNDO_RETENTION is the desired time in seconds to retain undo data. It must be the same on all instances.

  • UNDO_TABLESPACE should specify a unique undo tablespace for each instance.

Advanced object recovery features, such as Flashback Query, Flashback Version Query, Flashback Transaction Query, and Flashback Table, require automatic undo management. These features depend on the UNDO_RETENTION setting. Retention is specified in units of seconds. By default, Oracle automatically tunes undo retention by collecting database use statistics and estimating undo capacity needs. You can affect this automatic tuning by setting the UNDO_RETENTION initialization parameter. The default value of UNDO_RETENTION is 900. You do not need to set this parameter if you want Oracle to tune undo retention. The UNDO_RETENTION value can be changed dynamically at any time by using the ALTER SYSTEM statement.

Setting UNDO_RETENTION does not guarantee that undo data will be retained for the specified period of time. If undo data is needed for transactions, then the UNDO_RETENTION period is reduced so that transactions can receive the necessary undo data.

You can guarantee that unexpired undo data is not overwritten even if it means that future operations that need to generate undo data will fail. This is done by specifying the RETENTION GUARANTEE clause for the undo tablespace when it is created by either the CREATE DATABASE or CREATE UNDO TABLESPACE statement. Alternatively, you can later specify this clause in an ALTER TABLESPACE statement.

With the retention guarantee option, the undo guarantee is preserved even if there is need for DML activity. (DDL statements are still allowed.) If the tablespace is configured with less space than the transaction throughput requires, the following four things will occur in this sequence:

  1. If you have an autoextensible file, then it will automatically grow to accommodate the retained undo data.

  2. A warning alert is issued at 85 percent full.

  3. A critical alert is issued at 97 percent full.

  4. Transactions receive an out-of-space error.


    See Also:

    Oracle Database Administrator's Guide for more information about the UNDO_RETENTION setting and the size of the undo tablespace

7.1.12 Use Locally Managed Tablespaces

Locally managed tablespaces perform better than dictionary-managed tablespaces, are easier to manage, and eliminate space fragmentation concerns. Locally managed tablespaces use bitmaps stored in the datafile headers and, unlike dictionary managed tablespaces, do not contend for centrally managed resources for space allocations and de-allocations.

7.1.13 Use Automatic Segment Space Management

Automatic segment space management simplifies space administration tasks, thus reducing the chance of human error. An added benefit is the elimination of performance tuning related to space management. It facilitates management of free space within objects such as tables or indexes, improves space utilization, and provides significantly better performance and scalability with simplified administration. The automatic segment space management feature is available only with permanent locally managed tablespaces.

7.1.14 Use Temporary Tablespaces and Specify a Default Temporary Tablespace

Temporary tablespaces improve the concurrency of multiple sort operations, reduce sort operation overhead, and avoid data dictionary space management operations altogether. This is a more efficient way of handling temporary segments, from the perspective of both system resource usage and database performance.

A default temporary tablespace should be specified for the entire database to prevent accidental exclusion of the temporary tablespace clause. This can be done at database creation time by using the DEFAULT TEMPORARY TABLESPACE clause of the CREATE DATABASE statement or after database creation by the ALTER DATABASE statement. Using the default temporary tablespace ensures that all disk sorting occurs in a temporary tablespace and that other tablespaces are not mistakenly used for sorting.

7.1.15 Use Resumable Space Allocation

Resumable space allocation provides a way to suspend and later resume database operations if there are space allocation failures. The affected operation is suspended instead of the database returning an error. No processes need to be restarted. When the space problem is resolved, the suspended operation is automatically resumed.

Set the RESUMABLE_TIMEOUT initialization parameter to the number of seconds of the retry time.

7.1.16 Use a Flash Recovery Area

The flash recovery area is an Oracle-managed directory, file system, or automatic storage management disk group that provides a centralized disk location for backup and recovery files. The flash recovery area is defined by setting the following database initialization parameters:

  • DB_RECOVERY_FILE_DEST: Default location for the flash recovery area

  • DB_RECOVERY_FILE_DEST_SIZE: Specifies (in bytes) the hard limit on the total space to be used by target database recovery files created in the recovery area location

The bigger the flash recovery area, the more beneficial it becomes. The minimum recommended disk limit is the sum of the database size, the size of incremental backups, the size of all archived redo logs that have not been copied to tape, and the size of the flashback logs.


See Also:

Oracle Database Backup and Recovery Advanced User's Guide and Oracle Database Backup and Recovery Basics for detailed information about sizing the flash recovery area and setting the retention period

7.1.17 Enable Flashback Database

Flashback Database is a revolutionary recovery feature that operates on only the changed data, thereby making the time to correct an error equal to the time to cause the error without recovery time being a function of the database size. You can flash back a database from both RMAN and SQL*Plus with a single command instead of a complex procedure. Flashback Database is similar to conventional point-in-time recovery in its effects, enabling you to return a database to its state at a time in the recent past. However, Flashback Database is much faster than point-in-time recovery, because it does not require restoring datafiles from backup or extensive application of redo data.

To enable Flashback Database, set up a flash recovery area, and set a flashback retention target (DB_FLASHBACK_RETENTION_TARGET initialization parameter), to specify how far back into the past in minutes you want to be able to restore your database using Flashback Database. To enable Flashback Database, execute the ALTER DATABASE FLASHBACK ON statement. It is important to note that the flashback retention target is a target, not an absolute guarantee that Flashback Database will be available. If your flash recovery area is not large enough to hold required files such as archived redo logs and other backups, then flashback logs may be deleted to make room in the flash recovery area for these required files. To determine how far you can flash back at any one time, query the V$FLASHBACK_DATABASE_LOG view. If you have a standby database, then set FLASHBACK_RETENTION_TIME to be the same for both primary and standby databases.

7.1.18 Set Up and Follow Security Best Practices

The biggest threat to corporate data comes from employees and contractors with internal access to networks and facilities. Corporate data is one of a company's most valuable assets that can be at grave risk if placed on a system or database that does not have proper security measures in place. A well-defined security policy can help protect your systems from unwanted access and protect sensitive corporate information from sabotage. Proper data protection reduces the chance of outages due to security breeches.

In addition to the "High Availability" section in Chapter 9, "Oracle Security Products and Features", the Oracle Security Overview manual is a high-level guide to technical security solutions for the data security challenge. Consult Part II, "Technical Solutions to Security Risks" of the Oracle Security Overview for an overview of techniques for implementing security best practices. For a much more detailed view of security policies, checklists, guidelines, and features, see the Oracle Database Security Guide

7.1.19 Use the Database Resource Manager

The Database Resource Manager gives database administrators more control over resource management decisions, so that resource allocation can be aligned with the business objectives of an enterprise. The Database Resource Manager provides the ability to prioritize work within the Oracle system. Availability of the database encompasses both its functionality and performance. If the database is available but users are not getting the level of performance they need, then availability and service level objectives are not being met. Application performance, to a large extent, is affected by how resources are distributed among the applications that access the database. The main goal of the Database Resource Manager is to give the Oracle database server more control over resource management decisions, thus circumventing problems resulting from inefficient operating system management and operating system resource managers.

7.1.20 Use a Server Parameter File

The server parameter file (SPFILE) enables a single, central parameter file to hold all of the database initialization parameters associated with all of the instances associated with a database. This provides a simple, persistent, and robust environment for managing database parameters.

An SPFILE is required when using the Data Guard Broker.

7.2 Configuration Best Practices for Real Application Clusters

The practices that are recommended in this section affect the performance, availability, and MTTR of your system. These practices build on the single instance database configuration best practices. The practices are identical for the primary and standby databases if they are used with Data Guard in the MAA architecture. Some of these practices may reduce performance levels, but they are necessary to reduce or avoid outages. The minimal performance impact is outweighed by the reduced risk of corruption or the performance improvement for recovery.

The rest of this section includes the following topics:

7.2.1 Register All Instances with Remote Listeners

The listeners should be cross-registered by using the REMOTE_LISTENER parameter so that all listeners know about all services and in which instances the services are running. The listeners should use server-side load balancing, which can be based on session count for connection. The listeners must be listening on the virtual IP addresses and on the cluster alias, when it is available. The listeners must not listen on the hostname. Listening on the hostname will result in disconnected sessions when virtual IPs are relocated automatically back to their owning nodes.

7.2.2 Do Not Set CLUSTER_INTERCONNECTS Unless Required for Scalability

The CLUSTER_INTERCONNECTS initialization parameter should be set only if there is more than one cluster interconnect and the default cluster interconnect does not meet the throughput requirements of the RAC database. When CLUSTER_INTERCONECTS is set to more than one network address, Oracle load-balances across the interfaces. However, there are no automatic failover capabilities employed by Oracle, requiring that all interfaces be available for a properly functioning database environment.

7.3 Configuration Best Practices for Data Guard

These practices build on the recommendations for configuring the single-instance database. The proper configuration of Oracle Data Guard Redo Apply and SQL Apply is essential to ensuring that all standby databases work properly and perform their roles within service levels after switchovers and failovers. Most Data Guard configuration settings can be made through the Oracle Enterprise Manager. For more advanced, less frequently used Data Guard configuration parameters, the Data Guard Broker command-line interface or SQL*Plus can be used.

Data Guard enables you to use either a physical standby database or a logical standby database or both, depending on the business requirements. A physical standby database provides a physically identical copy of the primary database, with on-disk database structures that are identical to the primary database on a block-for-block basis. The database schema, including indexes, are the same. A physical standby database is kept synchronized with the primary database by applying the redo data received from the primary database.

A logical standby database contains the same logical information as the production database, although the physical organization and structure of the data can be different. It is kept synchronized with the primary database by transforming the data in the redo log files received from the primary database into SQL statements and then executing the SQL statements on the standby database. A logical standby database can be used for other business purposes in addition to disaster recovery requirements.

Table 7-1 can help you determine which type of standby database to use.

Table 7-1 Determining the Standby Database Type

Questions Recommendations
  1. Do you have any datatypes that are not supported by the logical standby database?
Run the following query:
SELECT DISTINCT OWNER, TABLE_NAME FROM DBA_LOGSTDBY_UNSUPPORTED ORDER BY OWNER, TABLE_NAME;

Rows returned - Use a physical standby database or investigate changing to supported datatypes

No rows returned - Go to next question

  1. Do you need to have the standby database open for read-only or read/write access?

See Also: "Oracle9i Data Guard: SQL Apply Best Practices" at http://otn.oracle.com/deploy/availability/htdocs/maa.htm

Yes - Evaluate a logical standby database

No - Use a physical standby database

Table 7-2 shows the recommendations for configuration that apply to both logical and physical standby databases and those that apply to only logical and only physical.

7.3.1 Use a Simple, Robust Archiving Strategy and Configuration

This archiving strategy is based on the following assumptions:

  • Every instance uses the flash recovery area.

  • The production instances archive remotely to only one apply instance.

Table 7-3 describes the recommendations for a robust archiving strategy.

Table 7-3 Archiving Recommendations

Recommendation Description
Archiving must be started on the primary database Maintaining a standby database requires archiving to be enabled and started on the primary database.
SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP MOUNT;
SQL> ALTER DATABASE ARCHIVELOG;
SQL> ALTER DATABASE OPEN;
Remote archiving must be enabled. REMOTE_ARCHIVE_ENABLE=TRUE
Use a consistent log format (LOG_ARCHIVE_FORMAT). LOG_ARCHIVE_FORMAT should have the thread, sequence, and resetlogs ID attributes and should be consistent across all instances. %S instructs the format to fill the prefix of the sequence number with leading zeros.

If the flash recovery is used, then this format is ignored.

For example: LOG_ARCHIVE_FORMAT=arch_%t_%S_%r.arc

Local archiving is done first by the archiver process (ARCH). Using ARCH reduces the work for LGWR. The default setting for LOG_ARCHIVE_LOCAL_FIRST is TRUE, which means that after the redo log has been completely and successfully archived to at least one local destination, it will then be transmitted to the remote destination. Using the flash recovery area implies that LOG_ARCHIVE_DEST_10 is used for local archiving.
Remote archiving should be done to only one standby instance and node for each standby RAC database. All production instances archive to one standby destination, using the same net service name. Oracle Net Services connect-time failover is used if you want to automatically switch to the secondary standby host when the primary standby instance has an outage.
The standby archive destination should use the flash recovery area. For simplicity, the standby archive destination (STANDBY_ARCHIVE_DEST) should use the flash recovery area, which is the same as the directory for the local archiving. Because SRLs are present, the standby ARCH process writes to the local archive destination.
The logical standby archive destination cannot use the flash recovery area. For a logical standby database, STANDBY_ARCHIVE_DEST cannot use the flash recovery area. Set STANDBY_ARCHIVE_DEST to an explicit archive directory.
Specify role-based destinations with the VALID_FOR attribute The VALID_FOR attribute enables you to configure destination attributes for both the primary and the standby database roles in one server parameter file (SPFILE), so that the Data Guard configuration operates properly after a role transition. This simplifies switchovers and failovers by removing the need to enable and disable the role-specific parameter files after a role transition.

See Also: Appendix B, "Database SPFILE and Oracle Net Configuration File Samples"

The following example illustrates the recommended initialization parameters for a primary database communicating to a physical standby database. There are two instances, SALES1 and SALES2, running in maximum protection mode.

*.DB_RECOVERY_FILE_DEST=/recoveryarea
*LOG_ARCHIVE_DEST_1='SERVICE=SALES LGWR AFFIRM
    REOPEN=15 MAX_FAILURE=10 VALID_FOR=(ONLINE+LOGFILES, ALL ROLES)'
*.LOG_ARCHIVE_DEST_STATE_1='ENABLE'
*.STANDBY_ARCHIVE_DEST='USE_DB_RECOVERY_FILE_DEST'

Note the following observations for this example:

  • Oracle recommends that you never use the LGWR SYNC=NOPARALLEL option for the LOG_ARCHIVE_DEST_n initialization parameter for the maximum availability or maximum protection modes of Oracle Data Guard. Always use the SYNC=PARALLEL default. Fault detection after a standby instance fails occurs within the time specified by the NET_TIMEOUT option of the LOG_ARCHIVE_ DEST_n initialization parameter. Further, Oracle recommends that NET_ TIMEOUT be set to 30 seconds for most configurations.

  • The REOPEN=15 MAX_FAILURE=10 setting denotes that if there is a connection failure, then the connection is reopened after 15 seconds and is retried up to 10 times.

  • The VALID_FOR clause is used to designate the role for a destination. When the database is in a physical standby role, remote destination LOG_ARCHIVE_DEST_1 is not used because a physical standby database does not use online log files.

The flash recovery area must be accessible to any node within the cluster and use a shared file system technology such as automatic storage management (ASM), a cluster file system, a global file system, or high availability network file system (HA NFS). You can also mount the file system manually to any node within the cluster very quickly. This is necessary for recovery because all archived redo log files must be accessible on all nodes.

On the standby database nodes, recovery from a different node is required when Node 1 fails and cannot be restarted. In that case, any of the existing standby instances residing on a different node can initiate managed recovery. In the worst case, when the standby archived redo log files are inaccessible, the new managed recovery process (MRP) or logical standby process (LSP) on the different node fetches the archived redo log files using the FAL server to retrieve from the production nodes directly.

When configuring hardware vendor shared file system technology, verify the performance and availability implications. Investigate the following issues before adopting this strategy:

  • Is the shared file system accessible by any node regardless of the number of node failures?

  • What is the performance impact when implementing a shared file system?

  • Is there any impact on the interconnect traffic?

7.3.2 Use Multiplexed Standby Redo Logs and Configure Size Appropriately

Standby redo logs (SRLs) should be used on both sites. Use Oracle log multiplexing to create multiple standby redo log members in each standby redo group. This protects against a failure involving the redo log, such as disk corruption that exists on both sides of the disk mirror for one of the members or a user error that accidentally removed a member.

Use this formula to determine the number of SRLs:

# of SRLs = sum of all production online log groups per thread + number of threads

For example, if a primary database has two instances (threads) and each thread has four online log groups, then there should be ten SRLs. Having one more standby log group for each thread than the number of the online redo log groups for the production database reduces the likelihood that the LGWR for the production instance is blocked because an SRL cannot be allocated on the standby.

The following are additional guidelines for creating SRLs:

  • Create the same number of SRLs for both production and standby databases.

  • All of the online redo logs and SRLs for both the production and the standby databases should be the same size.

  • The SRLs should exist on both production and standby databases.

  • In a RAC environment, the SRLs must be on a shared disk.

  • In a RAC environment, assign the SRL to a thread when the SRL is created. For example:

ALTER DATABASE ADD STANDBY LOGFILE THREAD 1 GROUP 10 
       '/dev/vx/rdsk/ha10-dg/DGFUN stbyredo10 01.log' SIZE 50M REUSE;

The remote file server (RFS) process for the standby database writes only to an SRL whose size is identical to the size of an online redo log for the production database. If it cannot find an appropriately sized SRL, then RFS creates an archived redo log file directly instead and logs the following message in the alert log:

No standby redo log files of size <#> blocks available.

7.3.3 Enable FORCE LOGGING Mode

When the production database is in FORCE LOGGING mode, all database changes are logged except for those in temporary tablespaces and temporary segments. FORCE LOGGING mode ensures that the standby database remains consistent with the production database. If this is not possible because you require the load performance with NOLOGGING operations, then you must ensure that the corresponding standby datafiles are subsequently synchronized. After completing the nologging operations, a production backup of the affected datafiles needs to replace the corresponding standby datafiles. Before the file transfer, the physical standby database must stop recovery and the logical standby database must temporarily take the affected tablespaces offline.

You can enable force logging immediately by issuing an ALTER DATABASE FORCE LOGGING statement. If you specify FORCE LOGGING, then Oracle waits for all ongoing unlogged operations to finish.

7.3.4 Use Real Time Apply

Using real time apply enables the log apply services to apply redo data (physical standby database) or SQL (logical standby database) as it is received without waiting for the current standby redo log file to be archived. This results in faster switchover and failover times because the standby redo log files are applied to the standby database before failover or switchover begins.

For a physical standby database, use the following SQL statement

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE;

For a logical standby database, use the following SQL statement:

ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;

7.3.5 Configure the Database and Listener for Dynamic Service Registration

The setting for the SERVICE attribute of the LOG_ARCHIVE_DEST_2 initialization parameter and the settings for the FAL_SERVER and FAL_CLIENT initialization parameters depend on a proper Oracle Net configuration. For the Oracle Data Guard transport service and the gap resolution feature to work, the SPFILE, listener.ora, tnsnames.ora, and sqlnet.ora files must be consistent.

The remote archive destination FAL_CLIENT and FAL_SERVER parameters require an Oracle Net service. This service is represented as a net service name entry in the local tnsnames.ora file. Notice that the FAL_SERVER and FAL_CLIENT reference the same Oracle network service name. This is possible because the FAL_SERVER service is defined in the standby tnsnames.ora file, whereas the FAL_CLIENT service is defined in the primary tnsnames.ora file. This works only when you use the Oracle Network Service local naming method. If you are not using the local naming method, then you must have different service names. Furthermore, Oracle recommends using dynamic service registration instead of a static SID list in the listener configuration. To ensure that service registration works properly, the server parameter file should contain the following parameters:

  • SERVICE_NAMES for the database service name

  • INSTANCE_NAME for the instance name

  • LOCAL_LISTENER to specify a nondefault listener address

PMON dynamically registers a database service with the listener. PMON attempts to resolve LOCAL_LISTENER using some naming method. In the case described here, PMON finds the corresponding name in the local tnsnames.ora file.

For example:

SALES1.INSTANCE_NAME='SALES1'
SALES2.INSTANCE_NAME='SALES2'
*.LOG_ARCHIVE_DEST_2='SERVICE=SALES LGWR AFFIRM REOPEN=15 MAX_FAILURE=10'
*.LOCAL_LISTENER='SALES_lsnr'
*.SERVICE_NAMES='SALES'    # required for service registration
*.FAL_SERVER='SALES'
*.FAL_CLIENT='SALES'

The listener.ora file should be identical for each primary and secondary host except for HOST settings. Because service registration is used, there is no need for statically configured information.

The local tnsnames.ora file should contain the net service names and the local listener name translation. To use the same service name on each node, you must use a locally managed tnsnames.ora file for the production and standby databases. On the primary cluster, the tnsnames.ora entry, SERVICE_NAME, should equal the setting of the SERVICE_NAMES SPFILE parameter. If the listener is started after the instance, then service registration does not happen immediately. In this case, issue the ALTER SYSTEM REGISTER statement on the database to instruct the PMON background process to register the instance with the listeners immediately.

7.3.6 Tune the Network in a WAN Environment

Reducing the number of round trips across the network is essential for optimizing the transportation of redo log data to a standby site. With Oracle Net Services it is possible to control data transfer by adjusting the size of the Oracle Net setting for the session data unit (SDU). In a WAN environment, setting the SDU to 32K can improve performance. The SDU parameter designates the size of an Oracle Net buffer before it delivers each buffer to the TCP/IP network layer for transmission across the network. Oracle Net sends the data in the buffer either when requested or when it is full. Oracle internal testing of Oracle Data Guard on a WAN has demonstrated that the maximum setting of 32K (32768) performs best on a WAN. The primary gain in performance when setting the SDU is a result of the reduced number of calls to packet the data.

In addition to setting the SDU parameter, network throughput can often be substantially improved by using the SQLNET.SEND_BUF_SIZE and SQLNET.RECV_BUF_SIZE Oracle Net parameters to increase the size of the network TCP send and receive I/O buffers.

7.3.7 Determine the Data Protection Mode

In some situations, a business cannot afford to lose data at any cost. In other situations, the availability of the database may be more important than protecting f data. Some applications require maximum database performance and can tolerate a potential loss of data if a disaster occurs.

Choose one of the following protection modes:

  • Maximum protection mode guarantees that no data loss will occur if the primary database fails. To ensure that data loss cannot occur, the primary database shuts down if a fault prevents it from writing its redo stream to at least one remote standby redo log.

  • Maximum availability mode provides the highest level of data protection that is possible without compromising the availability of the primary database.

  • Maximum performance mode (the default mode) provides the highest level of data protection that is possible without affecting the performance of the primary database. This is accomplished by allowing a transaction to commit as soon as the redo data needed to recover that transaction is written to the local online redo log. The redo data stream of the primary database is also written to at least one standby database, but that redo stream is written asynchronously with respect to the commitment of the transactions that create the redo data. When network links with sufficient bandwidth are used, this mode provides a level of data protection that approaches that of maximum availability mode with minimal impact on primary database performance.

This section includes the following topics:

7.3.7.1 Determining the Protection Mode

To determine the correct data protection mode for your application, ask the questions in Table 7-4.

Table 7-4 Determining the Appropriate Protection Mode

Question Recommendations
Is data loss acceptable if the primary site fails? Yes: Use any protection mode.

No: Use maximum protection or maximum availability modes.

How much data loss is tolerated if a site is lost? None: Use maximum protection or maximum availability modes.

Some: Use maximum performance mode with ASYNC=blocks. The value for the number of blocks determines the maximum amount of possible redo data loss if a site fails.

Is potential data loss between the production and the standby databases tolerated when a standby host or network connection is temporarily unavailable? Yes: Use maximum performance or maximum availability modes.

No: Use maximum protection mode.

How far away should the disaster recovery site be from the primary site? The distance between sites and the network infrastructure between the sites determines network latency. In general, the latency increases with distance. Determine the minimum distance between sites to provide for outage isolation and minimal network latency. Assess what data centers are available for your company, or assess Oracle outsourcing services.
What is the current or proposed network bandwidth and latency between sites? Bandwidth must be greater than maximum redo generation rate. A guideline for two-way communication is for bandwidth to be 50 percent of the stated line capacity, but you must consider network usage of other applications.

Using maximum performance mode with asynchronous log transport or the archiver mitigates the effect on performance.

7.3.7.2 Changing the Data Protection Mode

The default data protection mode is maximum performance mode. After a failover to the standby database, the protection mode automatically changes to maximum performance mode. Switchover operations do not change the protection mode.

To change the data protection mode from maximum performance to maximum availability or maximum protection, perform the following steps:

  1. Change the appropriate initialization parameters. For maximum protection and maximum availability modes, Oracle requires one functional remote or standby archive destination using the LGWR SYNC option and a valid net service name during startup. For maximum performance mode, use the LGWR ASYNC option with a valid net service name.

  2. Shut down the primary database and restart it in mounted mode.

  3. Shut down all instances and start a single instance in exclusive mode.

    SHUTDOWN IMMEDIATE;
    STARTUP MOUNT EXCLUSIVE;
    
    
  4. Change the data protection mode explicitly to the desired mode

    ALTER DATABASE SET STANDBY TO MAXIMIZE [AVAILABILITY | PROTECTION];
    
    
  5. Restart all instances.

To change the protection mode from maximum protection to maximum performance or maximum availability, use a statement similar to the following:

ALTER DATABASE SET STANDBY TO MAXIMIZE [PERFORMANCE | AVAILABILITY];

See Also:

7.3.8 Conduct a Performance Assessment with the Proposed Network Configuration

Oracle recommends that you conduct a performance assessment with your proposed network configuration and current (or anticipated) peak redo rate. The network impact between the production and standby databases and the impact on the primary database throughput needs to be understood. Because the network between the production and standby databases is essential for the two databases to remain synchronized, the infrastructure must have the following characteristics:

  • Sufficient bandwidth to accommodate the maximum redo generation rate

  • Minimal latency to reduce the performance impact on the production database

  • Multiple network paths for network redundancy

The required bandwidth of a dedicated network connection is determined by the maximum redo rate of the production database. You also need to account for actual network efficiency. Depending on the data protection mode, there are other recommended practices and performance considerations. Maximum protection mode and maximum availability mode require LGWR SYNC transport. Maximum performance protection mode uses the ASYNC transport option or the archiver (ARCHn) instead of LGWR to transfer the redo. These recommendations were derived from an Oracle internal performance study that measured the impact of network latency on primary database throughput for each Oracle Data Guard transport option: ARCH, LGWR ASYNC, and LGWR SYNC.

The network infrastructure between the primary and secondary sites must be able to accommodate the redo traffic because the production database redo data is updating the physical standby database. If your maximum redo traffic at peak load is 8 MB/second, then your network infrastructure must have sufficient bandwidth to handle this load. Furthermore, network latency affects overall throughput and response time for OLTP and batch operations.

When you compare maximum protection mode or maximum availability mode with LGWR SYNC operations with maximum performance mode with LGWR ASYNC operations, measure whether performance or throughput will be degraded due to the incurred latency. You should also check whether the new throughput and response time are within your application performance requirements. Distance and the network configuration directly influence latency, while high latency may slow down your potential transaction throughput and increase response time. The network configuration, number of repeaters, the overhead of protocol conversions, and the number of routers also affect the overall network latency and transaction response time.

7.3.9 Use a LAN or MAN for Maximum Availability or Maximum Protection Modes

Maximum availability mode or maximum protection mode require the Oracle Data Guard transport service to use the LGWR SYNC transport option. Network latency is an additional overhead for each LGWR SYNC I/O operation. Figure 7-1 shows that LGWR SYNC writes both locally to the online redo log and remotely through the network to the RFS process to the standby redo logs.

Figure 7-1 LGWR SYNC Operation

Description of maxav005.gif follows
Description of the illustration maxav005.gif

The following formulas emphasize that the remote write is always slower than the local write and is the limiting factor when LGWR synchronous writes are occurring.

Local write = local write I/O time
Remote write = network round trip time (RTT) + local write I/O time (on standby machine)

Using an example in which the network round trip time (RTT) is 20 milliseconds and LGWR synchronous write is configured, every transaction commit time increases by 20 milliseconds. This overhead impacts response time and may affect primary database throughput. Because of the additional overhead incurred by the RTT, a local area network (LAN) or a metropolitan area network (MAN) with an RTT less than or equal to 10 milliseconds should be used for applications that cannot tolerate a change in performance or response time. Whether to use a LAN or MAN depends on the results of the performance assessment.

7.3.10 Use ARCH for the Greatest Performance Throughput

The ARCH attribute of the LOG_ARCHIVE_DEST_n initialization parameter provides the greatest performance throughput but the greatest data loss potential. ARCH does not affect primary performance when latency increases as long as the redo logs are configured correctly as described in "Configure the Size of Redo Log Files and Groups Appropriately". This is recommended for maximum performance data protection mode and is the default.

The effects of latency on primary throughput are detailed in the following white paper.


See Also:

"Oracle9i Data Guard: Primary Site and Network Configuration Best Practices" at http://otn.oracle.com/deploy/availability/htdocs/maa.htm for details about ARCH performance and latency

7.3.11 Use the ASYNC Attribute to Control Data Loss

Using LGWR ASYNC instead of the archiver in maximum performance mode reduces the amount of data loss. However, ARCH overrides LGWR ASYNC when the ASYNC network buffer does not empty in a timely manner. For best results, use a minimum ASYNC buffer size of at least 10MB.

Using larger buffer sizes also increases the chance of avoiding ORA-16198 timeout messages that result from a buffer full condition in a WAN. However, if the LGWR wait on full LNS buffer database wait event is in the top 3 database wait events, use ARCH.

If the network buffer becomes full and remains full for 1 second, the transport times out and converts to ARCH transport. This condition indicates that the network to the standby destination cannot keep up with the redo generation rate on the primary database. This is indicated in the alert log by the following message:

ORA-16198: LGWR timed out on Network Server 1 due to buffer full condition.

This message indicates that the standby destination configured with the LGWR ASYNC attributes encountered an async buffer full condition. Log transport services automatically stop using the network server process, LNSn, to transmit the redo data and convert to using the archiver process, ARCn, until a log switch occurs. At the next log switch, redo transmission reverts to using the LGWR ASYNC transport. This change occurs automatically. Using the largest asynchronous network buffer, 50MB, reduces the chance of the transport converting to ARCH. If this error occurs for every log or for the majority of logs, then the transport should be modified to use the archiver process permanently.

Figure 7-2 shows the architecture when the standby protection mode is set to maximum performance with LGWR ASYNC configuration.

Figure 7-2 LGWR ASYNC Transport Service

Description of maxav006.gif follows
Description of the illustration maxav006.gif

7.3.12 Evaluate SSH Port Forwarding with Compression

Evaluate SSH port forwarding with compression for maximum performance mode over a high-latency WAN (RTT greater than 100 milliseconds). Coupled with using LGWR ASYNC, the maximum buffer size, SSH with compression reduces the chance of receiving an "async buffer full" timeout. It also reduces network traffic.


See Also:

7.3.13 Set LOG_ARCHIVE_LOCAL_FIRST to TRUE

Setting LOG_ARCHIVE_LOCAL_FIRST to TRUE enables the archiver processes to archive the local online redo log files on the primary database before transmitting the redo data to remote standby destinations. This is especially useful when the network to the standby databases is slow.

This is the default setting for LOG_ARCHIVE_LOCAL_FIRST.

7.3.14 Provide Secure Transmission of Redo Data

Because a lack of security can directly affect availability, Data Guard provides a secure environment and prevents tampering with redo data as it is being transferred to the standby database. To enable secure transmission of redo data, set up every database in the Data Guard configuration to use a password file, and set the password for the SYS user identically on every system. The following is a summary of steps needed for each database in the Data Guard configuration:

  1. Create a password file for each database in the Data Guard configuration.

  2. Set the REMOTE_LOGIN_PASSWORDFILE=[EXCLUSIVE | SHARED] initialization parameter on each instance.

After you have performed these steps to set up security on every database in the Data Guard configuration, Data Guard transmits redo data only after the appropriate authentication checks using SYS credentials are successful. This authentication can be performed even if Oracle Advanced Security is not installed and provides some level of security when shipping redo data. To further protect redo data (for example, to encrypt redo data or to compute an integrity checksum value for redo traffic over the network to disallow redo tampering on the network), Oracle recommends that you install and use Oracle Advanced Security.


See Also:

7.3.15 Set DB_UNIQUE_NAME

Specify a unique name for the standby database. The name does not change even if the primary and standby databases reverse roles. The DB_UNIQUE_NAME parameter defaults to the value of the DB_NAME parameter.

7.3.16 Set LOG_ARCHIVE_CONFIG Correctly

Specify the DG_CONFIG attribute of the LOG_ARCHIVE_CONFIG initialization parameter so that it lists the DB_UNIQUE_NAME for the primary database and each standby database in the Data Guard configuration. By default, this parameter enables the primary database to send redo data to remote destinations and enables standby databases to receive redo data. The DG_CONFIG attribute must be set to enable the dynamic addition of a standby database to a Data Guard configuration that has a RAC primary database running in either maximum protection or maximum availability mode.

7.3.17 Recommendations for the Physical Standby Database Only

The following recommendation applies only to the physical standby database:

7.3.17.1 Tune Media Recovery Performance

To use Oracle Data Guard with a physical standby database or to use any media recovery operation effectively, you need to tune your database recovery.


See Also:

"Oracle9i Media Recovery Best Practices" at http://otn.oracle.com/deploy/availability/htdocs/maa.htm

7.3.18 Recommendations for the Logical Standby Database Only

The following recommendations apply only to the logical standby database:

7.3.18.1 Use Supplemental Logging and Primary Key Constraints

Use supplemental logging and primary key constraints on all production tables.

If your application ensures that the rows in a table are unique, then you can create a disabled primary key RELY constraint on the table. This avoids the overhead of maintaining a primary key on the primary database. To create a disabled RELY constraint on a primary database table, use the ALTER TABLE statement with a RELY DISABLE clause.

To improve the performance of SQL Apply, add an index to the columns that uniquely identify the row on the logical standby database. Failure to do this results in full table scans.

7.3.18.2 Set the MAX_SERVERS Initialization Parameter

If the logical standby database is being used to remove reporting or decision support operations from the primary database, then you should probably reserve some of the parallel query slaves for such operations. Because the SQL Apply process by default uses all the parallel query slaves, setting the MAX_SERVERS initialization parameter enables a specified number of parallel query slaves to be reserved.

Table 7-5 shows examples of MAX_SERVERS values.

Table 7-5 Examples of MAX_SERVERS Values

PARALLEL_MAX_SERVERS Initialization Parameter MAX_SERVERS Initialization Parameter Number of Servers Reserved for Parallel Query Operations Number of Servers Reserved for SQL Apply Operations
24 Unset 0 24
24 24 0 24
48 24 24 24

It is recommended that MAX_SERVERS be set initially to the larger of the following values: 9 or 3 plus 3 times CPU.

7.3.18.3 Increase the PARALLEL_MAX_SERVERS Initialization Parameter

Increase the PARALLEL_MAX_SERVERS initialization parameter by the larger of 9 or 3 times CPU on both the primary and standby instances:

PARALLEL_MAX_SERVERS=current value + max(9, 3 +(3 x CPU))

The PARALLEL_MAX_SERVERS initialization parameter specifies the maximum number of parallel query processes that can be created on the database instance. With the exception of the coordinator process, all the processes that constitute the SQL Apply engine are created from the pool of parallel query processes. The SQL Apply engine, by default, uses all the parallel query processes available on the database instance. This behavior can be overridden by using the logical standby parameters

It is recommended that PARALLEL_MAX_SERVERS be increased by the value of MAX_SERVERS.

7.3.18.4 Set the TRANSACTION_CONSISTENCY Initialization Parameter

The logical standby database supports the following methods of data application:

  • For a reporting or decision support system, use FULL or READ_ONLY transaction consistency.

  • For a disaster recovery solution or when the SQL Apply engine needs to catch up, set TRANSACTION_CONSISTENCY to NONE.

If the logical standby database will be used for reporting or decision support operations, then:

  • If the standby database has multiple instances (RAC), then choose FULL.

  • If the standby database has only one instance (no RAC), then choose READ_ONLY.

7.3.18.5 Skip SQL Apply for Unnecessary Objects

Database objects that do not need to be replicated to the standby database should be skipped by using the DBMS_LOGSTDBY.SKIP procedure. Skipping such objects reduces the processing of the the SQL Apply engine. Consider this recommendation in a decision support environment.

7.4 Configuration Best Practices for MAA

This section recommends configuration practices in addition to the ones that are discussed for the single-instance database, RAC, and Data Guard. These practices are recommended when MAA is employed (RAC and Data Guard are used on both sites).

This section includes the following topics:

7.4.1 Configure Multiple Standby Instances

In an MAA environment, the standby database uses RAC, and multiple standby instances are associated with the same standby database. Having multiple standby instances is not the same as having multiple standby databases. Only one instance can have the managed recovery process (MRP) or the logical standby apply process (LSP). The standby instance with the MRP or LSP is called the primary standby instance. All other standby instances are called secondary standby instances.

Having multiple standby instances for the same database on the cluster provides the following benefits:

  • They enable transparent connection failover to a secondary standby instance if connectivity to the primary standby instance fails. In this scenario, the MRP or LSP session is automatically restarted by the Data Guard Broker. If the Broker is not being used, then these processes must be restarted manually on the new primary standby instance.

  • They provide a scheduled maintenance solution whenever the primary standby instance and host need to be shut down for maintenance. The secondary standby can take over and receive logs through Oracle Net service because connect-time failover occurs.

7.4.2 Configure Connect-Time Failover for Network Service Descriptors

Data Guard connect-time failover occurs when a connection request is forwarded to another listener if the connection fails. Connect-time failover is enabled by service registration, because the listener knows which available Oracle instance provides the requested service.

The following is an Oracle Net connection descriptor in the tnsnames.ora file:

sales.us.acme.com=
 (DESCRIPTION=
  (ADDRESS_LIST=
    (ADDRESS=(PROTOCOL=tcp)(HOST=sales1-server)(PORT=1521))
    (ADDRESS=(PROTOCOL=tcp)(HOST=sales2-server)(PORT=1521)))
  (CONNECT_DATA=
    (SERVICE_NAME=sales.us.acme.com)))

Note that the SALES net service name contains multiple address lists (two because it is a two-node cluster) for the production and standby clusters. The second address list enables connect-time failover if the first connection fails. This works for all protection modes.

To add a network protocol address to an existing net service name or database service, use either Oracle Enterprise Manager or Oracle Net Manager.

7.5 Recommendations for Backup and Recovery

While it is prudent that every database have a good backup, recovery using a backup is not always the fastest solution. Other available Oracle technologies, such as RAC, Data Guard, and flashback technology often provide faster means of recovering from an outage than restoring from backups.

A good backup and recovery strategy is still vital to the overall high availability solution and ensures that specific outages are recovered from in an acceptable amount of time. The following topics are included in this section:


Use Recovery Manager to Back Up Database Files
Understand When to Use Backups
Use an RMAN Recovery Catalog
Use the Autobackup Feature for the Control File and SPFILE
Use Incrementally Updated Backups to Reduce Restoration Time
Enable Change Tracking to Reduce Backup Time
Create Database Backups on Disk in the Flash Recovery Area
Create Tape Backups from the Flash Recovery Area
Determine Retention Policy and Backup Frequency
Configure the Size of the Flash Recovery Area Properly
In a Data Guard Environment, Back Up to the Flash Recovery Area on All Sites
During Backups, Use the Target Database Control File as the RMAN Repository
Regularly Check Database Files for Corruption
Periodically Test Recovery Procedures
Back Up the OCR to Tape or Offsite

See Also:

7.5.1 Use Recovery Manager to Back Up Database Files

Recovery Manager (RMAN) uses server sessions to perform backup and recovery operations and stores metadata about backups in a repository. RMAN offers many advantages over typical user-managed backup methods, such as the ability to do online database backups without placing tablespaces in backup mode; support for incremental backups; data block integrity checks during backup and restore operations; and the ability to test backups and restores without actually performing the operation. RMAN automates backup and recovery, whereas the user-managed method requires you to keep track of all database files and backups. For example, instead of requiring you to locate backups for each datafile, copy them to the correct place using operating system commands, and choose which logs to apply; RMAN manages these tasks automatically. There are also capabilities of Oracle recovery that are only available when using RMAN, such as block media recovery.

7.5.2 Understand When to Use Backups

Most production database unscheduled outages are either handled automatically by various database components or are resolved by using another technology to restore a backup. For example, some outages are handled best by using Flashback Database or the standby database. However, there are situations that require using database backups, including the following:

7.5.2.1 Perform Regular Backups

Perform regular backups.

7.5.2.2 Initial Data Guard Environment Set-Up

During initial set-up of a standby database, a backup of the production database is required at the secondary site to create the initial standby database.

7.5.2.3 Recovering from Data Failures Using File or Block Media Recovery

When a data failure, which includes block corruption and media failure, occurs in an environment that does not include Data Guard, the only method of recovery is using an existing backup. Even with Data Guard, the most efficient means of recovering from data failure may be restoring and recovering the affected object from an existing backup.

7.5.2.4 Double Failure Resolution

A double failure scenario affects the availability of both the production and standby databases. The only resolution of this situation is to re-create the production database from an available backup, then re-create the standby database. An example of a double failure scenario is a site outage at the secondary site, which eliminates fault tolerance, followed by a media failure on the production database. Some multiple failures, or more appropriately disasters (such as a primary site outage followed by a secondary site outage) may require the use of backups that exist in an offsite location, so developing and following a process to deliver and maintain backup tapes at an offsite location is necessary to restore service in the most dire of circumstances.

7.5.2.5 Long-Term Backups

Some businesses require the ability to maintain long-term backups that may be needed years into the future. By using RMAN with the KEEP option, it is possible to retain backups that are exempt from the retention policy and never expire, providing the capability to restore and recover the database to any desired point in time. It is important that a recovery catalog be used for the RMAN repository so that backup metadata is not lost due to lack of space, which may occur when using the target database control file for the RMAN repository.

7.5.3 Use an RMAN Recovery Catalog

RMAN automatically manages the backup metadata in the control file of the database that is being backed up. To protect and keep backup metadata for long periods of time, the RMAN repository, usually referred to as a recovery catalog, is created in a separate database. There are many advantages of using a recovery catalog, including the ability to store backup information long-term, the ability to store metadata for multiple databases, and the ability to restore an available backup on to another system. In addition, if you are using only the target database control file to house the repository, the control file, with its limited maximum size, may not be large enough to hold all desired backup metadata. If the control file is too small to hold additional backup metadata, then existing backup information is overwritten, making it difficult to restore and recover using those backups.

7.5.4 Use the Autobackup Feature for the Control File and SPFILE

RMAN can be configured to automatically back up the control file and server parameter file (SPFILE) whenever the database structure metadata in the control file changes and whenever a backup record is added. The autobackup enables RMAN to recover the database even if the current control file, catalog, and SPFILE are lost. The RMAN autobackup feature is enabled with the CONFIGURE CONTROLFILE AUTOBACKUP ON statement.

7.5.5 Use Incrementally Updated Backups to Reduce Restoration Time

Oracle's incrementally updated backups feature enables you to create an image copy of a datafile, then regularly create incremental backups of the database and apply them to that image copy. The image copy is updated with all changes up through the SCN at which the incremental backup was taken. RMAN can use the resulting updated datafile in media recovery just as it would use a full image copy taken at that SCN, without the overhead of performing a full image copy of the database every day. A backup strategy based on incrementally updated backups can help minimize MTTR for media recovery.

7.5.6 Enable Change Tracking to Reduce Backup Time

Oracle's change tracking feature for incremental backups improves incremental backup performance by recording changed blocks in each datafile in a change tracking file. If change tracking is enabled, then RMAN uses the change tracking file to identify changed blocks for incremental backup, thus avoiding the need to scan every block in the datafile.

7.5.7 Create Database Backups on Disk in the Flash Recovery Area

Using automatic disk-based backup and recovery, you can create a flash recovery area, which automates management of backup-related files. Choose a location on disk and an upper bound for storage space and set a retention policy that governs how long backup files are needed for recovery. Oracle manages the storage used for backup, archived redo logs, and other recovery-related files for your database within that space. Files no longer needed are eligible for deletion when RMAN needs to reclaim space for new files.

7.5.8 Create Tape Backups from the Flash Recovery Area

Use the BACKUP RECOVERY FILE DESTINATION RMAN command to move disk backups created in the flash recovery area to tape. Tape backups are used for offsite and long-term storage and are used to handle certain outage scenarios.

7.5.9 Determine Retention Policy and Backup Frequency

The backup retention policy is the rule set regarding which backups must be retained (on disk or other backup media) to meet recovery and other requirements. It may be safe to delete a specific backup because it is old enough to be superseded by more recent backups or because it has been stored on tape. You may also need to retain a specific backup on disk for other reasons such as archival requirements. A backup that is no longer needed to satisfy the backup retention policy is said to be obsolete.

Backup retention policy can be based on redundancy or a recovery window. In a redundancy-based retention policy, you specify a number n such that you always keep at least n distinct backups of each file in your database. In a recovery window-based retention policy, you specify a time interval in the past (for example, one week or one month) and keep all backups required to let you perform point-in-time recovery to any point during that window.

Frequent backups are essential for any recovery scheme. Base the frequency of backups on the rate or frequency of database changes such as:

  • Addition and deletion of tables

  • Insertions and deletions of rows in existing tables

  • Updates to data within tables

The more frequently your database is updated, the more often you should perform database backups. If database updates are relatively infrequent, then you can make whole database backups infrequently and supplement them with incremental backups, which will be relatively small because few blocks have changed.

7.5.10 Configure the Size of the Flash Recovery Area Properly

Configuring the size of the flash recovery area properly enables fast recovery from user error with Flashback Database and fast recovery from data failure with file or block media recovery from disk. The appropriate size of the flash recovery area depends on the following: retention policy, backup frequency, size of the database, rate and number of changes to the database. Specific formulas for determining the proper size of the flash recovery area for different backup scenarios are provided in Oracle Database Backup and Recovery Basics.

7.5.11 In a Data Guard Environment, Back Up to the Flash Recovery Area on All Sites

Take backups at the primary and secondary sites. The advantages of this practice are as follows:

  • It significantly reduces MTTR in certain double outage scenarios.

  • It avoids introducing new backup procedures upon a switchover or failover.

  • RMAN file and block media recovery is a recovery option for data failure outages at both primary and secondary sites.

Consider a scenario in which backups are done only at the secondary site. Suppose there is a site outage at the secondary site where the estimated time to recover is three days. The primary site is completely vulnerable to an outage that is typically resolved by a failover, also to any outage that could be resolved by having a local backup (such as a data failure outage resolved by block media recovery). In this scenario, a production database outage can be resolved only by physically shipping the off-site tape backups that were taken at the standby site. If primary site backups were available, then restoring locally would be an available option in place of the failover than cannot be done. Data may be lost, but having primary site backups significantly shortens the MTTR.

Another undesirable approach is to start taking primary site backups at the time that there is a secondary site outage. However, this approach should be avoided because it is introducing new processes and procedures at a time when the environment is already under duress and the impact of a mistake by staff will be magnified. Also, it is not a time to learn that backups cannot be taken at the primary site.

In addition, primary site disk backups are necessary to ensure a reasonable MTTR when using RMAN file or block media recovery. Without a local on-disk backup, a backup taken at the standby site must be restored to the primary site, significantly lengthening the MTTR for this type of outage.

7.5.12 During Backups, Use the Target Database Control File as the RMAN Repository

During backups, use the target database control file as the RMAN repository and resynchronize afterward with the RMAN RESYNC CATALOG command.

When creating backups to disk or tape, use the target database control file as the RMAN repository so that the ability to back up or the success of the backup does not depend on the availability of the RMAN catalog in the manageability database. This is accomplished by running RMAN with the NOCATALOG option. After the backup is complete, the new backup information stored in the target database control file can be resynchronized with the recovery catalog using the RESYNC CATALOG command.

7.5.13 Regularly Check Database Files for Corruption

Using the BACKUP VALIDATE RMAN command, database files should be checked regularly for block corruptions that have not yet been reported by a user session or by normal backup operations. RMAN scans the specified files and verifies content-checking for physical and logical errors but does not actually perform the backup or recovery operation. Oracle records the address of the corrupt block and the type of corruption in the control file. Access these records through the V$DATABASE_BLOCK_CORRUPTION view, which can be used by RMAN block media recovery.

If BLOCK CHANGE TRACKING is enabled, then do not use the INCREMENTAL LEVEL option with BACKUP VALIDATE to ensure that all data blocks are read and verified.

To detect all types of corruption that are possible to detect:

  • Do not specify the MAXCORRUPT option

  • Do not specify the NOCHECKSUM option

  • Do specify the CHECK LOGICAL option

7.5.14 Periodically Test Recovery Procedures

Complete, successful, and tested backups are fundamental to the success of any recovery. Create test plans for the different outage types. Start with the most common outage types and progress to the least probable. Issuing backup procedures does not ensure that the backups are successful; they must be rehearsed. Monitor the backup procedure for errors, and validate backups by testing your recovery procedures periodically. Also, validate the ability to do backups and restores by using the RMAN commands BACKUP VALIDATE and RESTORE... VALIDATE commands.

7.5.15 Back Up the OCR to Tape or Offsite

The Oracle Cluster Registry (OCR) contains cluster and database configuration information for RAC and Cluster Ready Services (CRS), such as the cluster database node list, CRS application resource profiles, and Event Manager (EVM) authorizations. Using the ocrconfig tool, there are two methods of copying OCR content and using the content for recovery. The first method uses automatically generated physical OCR file copies. The second method uses manually created logical OCR export files. The backup file created with ocrconfig should be backed as part of the operating system backup using standard operating system or third-party tools.

7.6 Recommendations for Fast Application Failover

In any high availability architecture, client and mid-tier applications can be redirected to available services within a Real Application Cluster and with some customization to a Data Guard or replicated database. This redirection can usually be transparent and can be used to reduce or eliminate both planned and unplanned downtime to the client or mid-tier application.

Services are prerequisites for fast, transparent application failover. When you create services in RAC, you can assign the services to instances for preferred (normal) and available (recovery) processing. When an instance to which you have assigned a service becomes unavailable, RAC can reconnect users connected to that instance to an available instance without service interruptions. Clients and mid-tier applications make connection requests by specifying a service using a global name. The connection information must be aware of all potential production instances or databases that can publish that service. Services enable you to model and deploy both planned and unplanned operations for any type of high availability or disaster recovery scenario.

To respond to changes in the cluster database, RAC's Cluster Ready Services (CRS), event system, and service callouts can be used to notify clients and mid-tier applications automatically. Event notifications can be configured to initiate recovery processing after failures to eliminate network timeouts and to provide end-to-end control over essential resources. These rapid notifications are done automatically from RAC to JDBC clients through JDBC fast connection failover. However, RAC provides a robust callout and event system that enables the user to customize specialized callouts to respond to database UP and DOWN events. Use these callouts to notify middle-tier applications to interrupt existing problematic connections and redirect new connections to available resources.

For disaster recovery, the new production database can also be configured to publish the production service while stopping the services on the old production database. Again, callouts are required to notify the mid-tier applications.

To configure for fast application failover, follow these recommendations for middle-tier applications and clients:

Follow this recommendation for all databases:

Follow these recommendation for RAC:

Follow these recommendations for Data Guard, replicated, or distributed environments:

7.6.1 Configure Connection Descriptors for All Possible Production Instances

Clients and mid-tier applications make connection requests by specifying a service using a global name. The connection information must be aware of all potential production instances or databases that are capable of publishing that service. Furthermore, these connection descriptors should ideally be stored in an LDAP or Oracle Name server for ease of maintenance and administration.

This section includes three sample Oracle Net connection descriptors. Use the PROD_RAC connection descriptor in Example 7-1when there is no standby database available or if DNS site failover is deployed.

The PROD_RAC_DG connection descriptor in Example 7-2 has an address list that contains all production RAC instances and the Data Guard standby instances. This example can be used in the case of a production database outage when the hardware cluster is still available. It helps you avoid TCP/IP timeouts.

When the entire hardware cluster fails, the connection needs to be manually adjusted to point to the standby database using the connection descriptor provided in Example 7-3.

For disaster recovery, client-side DNS failover or site failover is recommended over listing both production instances and standby database instances.

Example 7-1 Connection Descriptor: No Standby Database Available or DNS Site Failover is Deployed

PROD_RAC=
  (DESCRIPTION =
    (ADDRESS_LIST =
      (LOAD_BALANCE=on)  
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE1)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE2)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE3)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE4)(PORT = 1520))
    (CONNECT_DATA =
      (SERVICE_NAME = MAA_PROD)))

Example 7-2 Connection Descriptor: Production Database Outage When Hardware Cluster is Available

PROD_RAC_DG=
  (DESCRIPTION =
    (ADDRESS_LIST =
      (LOAD_BALANCE=on)  
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE1)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE2)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE3)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE4)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = DG_INSTANCE1)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = DG_INSTANCE2)(PORT = 1520)))
    (CONNECT_DATA =
      (SERVICE_NAME = MAA_PROD)))

Example 7-3 Connection Descriptor: Hardware Cluster Fails

PROD_DR=
  (DESCRIPTION =
    (ADDRESS_LIST =
      (LOAD_BALANCE=on)  
      (ADDRESS = (PROTOCOL = TCP)(HOST = DG_INSTANCE1)(PORT = 1520))
           (ADDRESS = (PROTOCOL = TCP)(HOST = DG_INSTANCE2)(PORT = 1520)))
    (CONNECT_DATA =
      (SERVICE_NAME = MAA_PROD)))

7.6.2 Use RAC Availability Notifications and Events

Ideally, the middle-tier applications and clients should use the automatic RAC availability notifications and events. Applications that use Oracle Database 10g JDBC fast connection failover subscribe to these events automatically. Other applications may need to configure these service callouts and modify the applications to react to them.

7.6.3 Use Transparent Application Failover If RAC Notification Is Not Feasible

If you cannot use RAC notification or if RAC is not deployed, then use Transparent Application Failover (TAF). When Oracle Call Interface (OCI) client applications are used, Transparent Application Failover (TAF) can be configured to transparently fail over connections between an application server and a database server.

OCI client applications can take advantage of automatic reconnection after failover and callback functions that help to automate state recovery. They can also replay interrupted SELECT statements and callback functions that help to automate state recovery. The Oracle JDBC and ODBC drivers also support automatic database reconnection and replay of interrupted SELECT statements without the need for any additional application coding.

The TAF configuration is specified in the connect string that clients use to connect to the database.

The following sample TAF connection descriptor is used to describe the impact of TAF and how to use each component.

PROD=
  (DESCRIPTION =
    (FAILOVER=on)
    (ADDRESS_LIST =
      (LOAD_BALANCE=on)  
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE1)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE2)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE3)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE4)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = DG_INSTANCE1)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = DG_INSTANCE2)(PORT = 1520)))
    (CONNECT_DATA =
      (SERVICE_NAME = MAA_PROD)
 (FAILOVER_MODE =    
(BACKUP=PROD_BACKUP)(TYPE=SESSION)(METHOD=BASIC)
(RETRIES=12)(DELAY=5))))

PROD_BACKUP=
  (DESCRIPTION =    
    (FAILOVER=on)
    (ADDRESS_LIST =
      (LOAD_BALANCE=on)   
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE1)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE2)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE3)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = RAC_INSTANCE4)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = DG_INSTANCE1)(PORT = 1520))
      (ADDRESS = (PROTOCOL = TCP)(HOST = DG_INSTANCE2)(PORT = 1520)))
    (CONNECT_DATA =
      (SERVICE_NAME = MAA_PROD)
      (FAILOVER_MODE = 
(BACKUP=PROD)(TYPE=SESSION)(METHOD=BASIC)
(RETRIES=12)(DELAY=5))))

7.6.3.1 New Connections

New connections use the address list and connect to the first available listener that has the service (MAA_PROD) registered. This is true for both instance and listener failure. If there is an attempt to connect to a failed node, then a TCP/IP timeout occurs. Retries and delay have no effect on new connections because new connections attempt the address list only once.

7.6.3.2 Existing Connections

Existing connections use the backup connection descriptor and wait the number of seconds specified by DELAY between each iteration. After attempting all addresses in the backup connection descriptor, the client waits the number of seconds specified by DELAY before attempting the address list again. The client retries the address list up to the number of times specified by RETRIES. If the service is not available anywhere after the number of seconds that is equal to RETRIES times DELAY, then the client receives an ORA-3113 error. The maximum switchover or failover times should be less than RETRIES*DELAY if you want automatic client failover to a disaster recovery site.

7.6.3.3 LOAD_BALANCE Parameter in the Connection Descriptor

LOAD_BALANCE sets the client-side load balancing. When it is set to ON, the client randomly selects an address from the address list. If a listener has multiple instances registered with it that offer the same service, then the listener can balance client requests between the instances based on the load on the instances at that time.

7.6.3.4 FAILOVER Parameter in the Connection Descriptor

Set FAILOVER to ON. The client fails through the address list if one or more of the services, instances, listeners, or nodes on the list is down or not available.

7.6.3.5 SERVICE_NAME Parameter in the Connection Descriptor

The service name is published by the database to the listener.

7.6.3.6 RETRIES Parameter in the Connection Descriptor

This parameter determines how many times an existing connection retries the addresses in the BACKUP list or after a failover. This parameter has no effect on new connections. New clients go through the address list only once.

7.6.3.7 DELAY Parameter in the Connection Descriptor

This parameter determines the number of seconds the client waits between each retry. After going through the address list, the client waits for the number of seconds specified by DELAY before retrying. There is no delay between individual addresses in the address list. The delay applies only after the whole list has been traversed.

With RAC, TCP/IP timeouts due to an unavailable node in a cluster are avoided because RAC manages a Virtual Internet Protocol (VIP) and cluster alias. However, TCP/IP timeouts cannot be avoided when the entire cluster or non-RAC host is not available. To avoid this TCP/IP timeout, the customer should do one of the following:

  • Create a special event and callout to detect and react to such as event

  • Adjust TCP/IP parameters to reduce overall timeout impact

The customized callout should interrupt existing connections and redirect new connections with a new connection descriptor that does not contain the unavailable nodes or clusters.

Adjusting TCP/IP parameters may have other application and system impact, so always use caution. However, the following TCP/IP parameters were modified on a Solaris platform to reduce overall TCP/IP timeout intervals in Oracle testing:

  • tcp_ip_abort_cinterval

  • tcp_ip_abort_interval

  • tcp_keepalive_interval

Check your operating system platform documentation for similar parameters.

7.6.4 Configure Services

Within RAC, use the Database Configuration Assistant (DBCA), Server Control (SRVCTL), or the DBMS_SERVICE PL/SQL package to create services. Then use the DBCA or Enterprise Manager to administer them. If this is a non-RAC environment, then set the SERVICE_NAME database initialization parameter.

7.6.5 Configure CRS for High Availability

CRS supports services and the workload management framework that maintains continuous availability of the services. CRS also supports the other RAC resources such as the database, the database cluster aliases, and the resources that are local to every node that supports RAC.

Node resources include the virtual internet protocol (VIP) address for the node, the Global Services Daemon, the Enterprise Manager Agent, and the Oracle Net listeners. These resources are automatically started when CRS starts with the node and CRS automatically restarts them if they fail.

7.6.6 Configure Service Callouts to Notify Middle-Tier Applications and Clients

Configure service callouts to notify middle-tier applications and clients about UP, DOWN, and NOT_RESTARTING events. RAC automatically notifies JDBC clients through JDBC fast connection failover without any adjustments. In the rare case that the entire RAC cluster fails, a separate notification and callout is required to notify the middle-tier applications to connect to a disaster recovery or secondary database.

If the middle-tier application or clients are not JDBC clients, then you must use RAC's event management and service callout facilities to configure a customized callout. The callout needs to notify the middle-tier application to do the following:

  • Interrupt existing connections to problematic or unavailable nodes or instances

  • Redirect new connections to available production instances or nodes

Interrupting existing connections helps avoid long TCP/IP timeout delays. Redirecting new connections to available production instances or nodes may require passing a new connection descriptor that does not include any inaccessible hosts so that TCP/IP timeouts can be avoided.

7.6.7 Publish Standby or Nonproduction Services

When RAC and Enterprise Manager are integrated, standby or nonproduction services can be published automatically. If the standby database is not managed by Enterprise Manager or is not part of a RAC environment, then you can manually alter the SERVICE_NAME database initialization parameter to be a nonproduction service. For example:

SQL> ALTER SYSTEM SET SERVICE_NAME='STANDBY';

7.6.8 Publish Production Services

When RAC and Enterprise Manager are integrated, production services can be published automatically. If the new production database is not managed by Enterprise Manager or is not part of a RAC environment, then you can manually alter the SERVICE_NAME database initialization parameter to be set to different production services. For example:

SQL> ALTER SYSTEM SET SERVICE_NAME='PROD_SVC1, PROD_SVC2, PROD_SVC3'; 

PROD_SVC1 can be SALES or HR, for example.