4 High-Availability Architectures and Solutions

The Maximum Availability Architecture (MAA) is Oracle's best practices blueprint. It is based on proven Oracle high-availability technologies and recommendations. The goal of the MAA is to remove the complexity in designing the optimal high-availability architecture by providing configuration recommendations and tuning tips to get the most out of your architecture and Oracle features.

This chapter describes the various high-availability architectures in an Oracle environment and helps you to choose the correct architecture for your organization.

It includes the following sections:

Oracle Database High-Availability Architectures
Choosing the Correct High-Availability Architecture
Integrating Application Server High Availability
Integrating High Availability for All Applications

4.1 Oracle Database High-Availability Architectures

The following sections provide an overview of the Oracle Database high-availability architectures:

Oracle Database
Oracle Database with Oracle Clusterware (Cold Cluster Failover)
Oracle Database with Oracle Real Application Clusters
Oracle Database with Oracle RAC on Extended Distance Clusters
Oracle Database with Data Guard
Oracle Database with Oracle Clusterware and Data Guard
Oracle Database with Oracle RAC and Data Guard
Oracle Database with Streams

All of these architectures must leverage the MAA best practices.

The comparison of the different architectures highlighting their benefits and considerations is described in Choosing the Correct High-Availability Architecture.

Once you have chosen an architecture, you can them implement it using the operational and configuration best practices described in the MAA white papers and the Oracle Database High Availability Best Practices. These best practices are required to maximize the full benefits of each architecture. See Chapter 5, "MAA and High Availability Best Practices" for more information about the best practices documentation.

4.1.1 Oracle Database

Oracle Database is a single-instance, noncluster database. Although this architecture does not have the node or database redundancy, there are numerous high-availability features that can be used in this architecture and any subsequent database architectures. These features make the standalone database on a single computer attractive and available for certain failures and planned maintenance activities.

Oracle recommends that you leverage the following Oracle features for this architecture. This is the base foundation for subsequent high-availability architectures.

Fast-Start Fault Recovery bounds and optimizes instance and database recovery times.
Automatic Storage Management tolerates storage failures and optimizes storage performance and usage.
Oracle Flashback Technology optimizes logical failure repair. Oracle recommends that you use automatic undo management with sufficient space to attain your desired undo retention guarantee, enable Flashback Database and allocate sufficient space and I/O bandwidth in the flash recovery area.
Recovery Manager optimizes local repair of data failures. Oracle recommends that you create and store the local backups in the flash recovery area.
Flash Recovery Area manages local recovery related files.
Online Reorganization and Redefinition allows for dynamic data changes.
Oracle Security Features prevent unauthorized access and changes.
Hardware Assisted Resilient Data (HARD) Initiative detects and prevents data corruptions and stray or misdirected writes (that result in a lost write to the intended location).
Data Recovery Advisor provides intelligent advise and repair of different data failures
Data Block Corruption Prevention and Detection Parameters detects and prevents some corruptions and lost writes.
Dynamic Resource Provisioning allows for dynamic system changes.
Online Patching allows for dynamic database patches of typical diagnostic patches
Oracle Secure Backup provides a centralized tape backup management solution.

Figure 4-1 shows a basic, single-node Oracle Database that includes an ASM instance.^Foot 1 This architecture takes advantage of several high-availability features, including Flashback Database, Online Redefinition, Recovery Manager, and Oracle Secure Backup.

Figure 4-1 Single-Node, Nonclustered Oracle Database with an ASM Instance

Description of "Figure 4-1 Single-Node, Nonclustered Oracle Database with an ASM Instance"

4.1.2 Oracle Database with Oracle Clusterware (Cold Cluster Failover)

Oracle Clusterware is software that manages the availability of user applications and Oracle databases. The servers on which you want to run Oracle Clusterware must be running the same operating system.

Many high-availability architectures today use clusters alone to provide some rudimentary node redundancy and automatic node failover. However, when you use Oracle Clusterware, there is no need or advantage to using third-party clusterware.

Oracle Clusterware provides a number of benefits over third-party clusterware:

Oracle Clusterware enables you to use an entire software solution from Oracle, avoiding the cost and complexity of maintaining additional cluster software.

By reducing the number of combinations of software that you need to coordinate and support, you can increase the manageability and availability of your system software.
Oracle Clusterware provides seamless integration with, and migration to, Oracle RAC and Oracle Data Guard.

Section 4.1.7 describes how you can achieve the highest level of availability with Oracle RAC and Oracle Data Guard
Oracle Clusterware includes all of the features required for cluster management, including node membership, group services, global resource management, and high-availability functions such as managing third-party applications, event management, and Oracle notification services that enable Oracle clients to reconnect to the new primary database after a failure.
Oracle Clusterware uses a private network and a voting disk to detect and resolve split brain^Foot 2 scenarios.

With Oracle Clusterware you can provide a cold failover cluster to protect an Oracle instance from a system or server failure. The basic function of a cold failover cluster is to monitor a database instance running on a server, and if a failure is detected, to restart the instance on a spare server in the cluster. Network addresses are failed over to the backup node. Clients on the network experience a period of lockout while the failover takes place and are then served by the other database instance once the instance has started. Also, you can use the Oracle Clusterware ability to relocate applications and application resources (using the CRS_RELOCATE command) as a way to move the workload to another node so you can perform planned system maintenance on the production server.

The cold cluster failover solution with Oracle Clusterware provides these additional advantages over a basic database architecture:

Automatic recovery of node and instance failures in minutes
Automatic notification and reconnection of Oracle integrated clients^Foot 3
Ability to customize the failure detection mechanism.

For example, you can use your favorite application query in the database check action. Providing application-specific failure detection means Oracle Clusterware can fail over not only during the obvious cases such as when the instance is down, but also in the cases when, for example, an application query is not meeting a particular service level.
High availability functionality to manage third-party applications
Rolling release upgrades of Oracle Clusterware

The operation of an Oracle Clusterware cold failover cluster is depicted in Figure 4-2 and Figure 4-3. These figures show how you can use the Oracle Clusterware framework to make both the Oracle database and your custom applications highly available.

Figure 4-2 shows a configuration that uses Oracle Clusterware to extend the basic Oracle Database architecture and provide cold cluster failover. In the figure, the configuration is operating in normal mode in which Node 1 is the active instance connected to the Oracle Database that is servicing applications and users. Node 2 is connected to Node 1 and to the Oracle Database, but it is currently standby mode.

Figure 4-2 Oracle Database with Oracle Clusterware (Before Cold Cluster Failover)

Description of "Figure 4-2 Oracle Database with Oracle Clusterware (Before Cold Cluster Failover)"

Figure 4-3 the Oracle Clusterware configuration after a cold cluster failover has occurred. In the figure, Node 2 is now the active instance connected to the Oracle Database and servicing applications and users. Node 1 is connected to Node 2 and to the Oracle Database but Node 1 is currently idle, in standby mode.

To provide this transparent failover capability, Oracle Clusterware requires a virtual IP address for each node in the cluster. With Oracle Clusterware you also define an application virtual IP address so users can access the application independently of the node in the cluster where the application is running. You can define multiple application VIPs, with generally one application VIP defined for each application running. The application VIP is tied to the application by making it dependent on the application resource defined by Cluster Ready Services (CRS).

Figure 4-3 Oracle Database with Oracle Clusterware (After Cold Cluster Failover)

Description of "Figure 4-3 Oracle Database with Oracle Clusterware (After Cold Cluster Failover)"

Note:

Neither Oracle Enterprise Manager or Oracle Universal Installer (OUI) provide configuration support for Oracle Clusterware. To configure an Oracle Clusterware environment, follow the step-by-step instructions in your platform-specific Oracle Clusterware installation guide.

4.1.3 Oracle Database with Oracle Real Application Clusters

An architecture that combines the Oracle Database with Real Application Clusters (Oracle RAC) is inherently a highly available system. Unlike a traditional monolithic database server that is expensive and is not flexible to changing capacity and resource demands, Oracle RAC combines the processing power of multiple interconnected computers to provide system redundancy, scalability, and high availability.

The clusters that are typical of Oracle RAC environments can provide continuous service for both planned and unplanned outages. Oracle RAC builds higher levels of availability on top of the standard Oracle features. All single instance high-availability features, such as the Flashback technologies and online reorganization, also apply to Oracle RAC. Applications scale in an Oracle RAC environment to meet increasing data processing demands without changing the application code. In addition, allowing maintenance operations to occur on a subset of components in the cluster while the application continues to run on the rest of the cluster can reduce planned downtime.

Oracle RAC exploits the redundancy that is provided by clustering to deliver availability with n - 1 node failures in an n-node cluster. Unlike the cold cluster model where one node is completely idle, all instances and nodes can be active to scale your application.

The Oracle Database with Oracle RAC architecture provides the following benefits over a traditional monolithic database server and the cold cluster failover model:

Scalability across database instances
Flexibility to increase processing capacity using commodity hardware without downtime or changes to the application
Ability to tolerate and quickly recover from computer and instance failures (measured in seconds)
Rolling upgrades for system and hardware changes
Rolling patch upgrades for some interim patches
Fast, automatic, and intelligent connection and service relocation and failover
Load balancing advisory and runtime connection load balancing
Comprehensive manageability integrating database and cluster features

Figure 4-4 shows the Oracle Database with Oracle RAC architecture.

Figure 4-4 Oracle Database with Oracle RAC Architecture

Description of "Figure 4-4 Oracle Database with Oracle RAC Architecture"

4.1.4 Oracle Database with Oracle RAC on Extended Distance Clusters

The Oracle Database with Oracle RAC architecture is designed primarily as a scalability and availability solution that resides in a single data center. It is possible, under certain circumstances, to build and deploy an Oracle RAC system where the nodes in the cluster are separated by greater distances. This architecture is referred to as an extended distance cluster.

An Oracle RAC extended distance cluster is an architecture that provides extremely fast recovery from a site failure and allows for all nodes, at all sites, to actively process transactions as part of single database cluster. For example, if a customer has a corporate campus the extended Oracle RAC configuration could consist of individual Oracle RAC nodes being located in separate buildings. Oracle RAC on an extended distance cluster provides greater high availability than a local Oracle RAC cluster, but it may not fit the full disaster recovery requirements of your organization.

When the two data centers are located relatively close, extended distance clusters can provide great protection for some disasters, but not all. You should do an analysis to determine if both sites are likely to be affected by the same disaster. For example, if the extended cluster configuration is set up properly, it can provide protection against disasters such as a local power outage, an airplane crash, or server room flooding. However, they cannot protect against comprehensive disasters such as earthquakes, hurricanes, and regional floods that affect a greater area. (For disaster recovery, use the architecture described in Section 4.1.7, "Oracle Database with Oracle RAC and Data Guard - MAA".)

The advantages to using Oracle RAC on extended distance clusters include:

Ability to fully use all system resources without jeopardizing the overall failover times for instance and node failures
Extremely rapid recovery if one site should fail
All of the Oracle RAC benefits listed in Section 4.1.3

Note:

While this architecture can be effective and it has been successfully implemented, you should implement it only in the environments (distance, latency, and degree of protection) recommended in this discussion.

When configuring the extended cluster architecture, Oracle recommends that you:

Use ASM normal or high redundancy so that a storage array failure does not affect the application and database availability.

Beginning with Oracle Database Release 11g, ASM includes a preferred read capability that ensures that a read I/O accesses the local storage instead of unnecessarily reading from a remote failure group. When you configure ASM failure groups extended distance clusters, you can specify that a particular node read from a failure group extent that is closest to the node, even if it is a secondary extent. This is especially useful in extended distance clusters where remote nodes have asymmetric access with respect to performance, thus leading to better usage and lower network loading.

See Also:
Oracle Database Storage Administrator's Guide for information about configuring preferred read failure groups with the ASM_PREFERRED_READ_FAILURE_GROUPS initialization parameter.
Add a third voting disk to a third site

Because most extended distance cluster have only two storage systems (one at each site). During normal processing, each node writes and reads a disk heartbeat at regular intervals, but if the heartbeat cannot complete the node exits, generally causing a node restart. Thus, the site that houses the majority of the voting disks is a potential single point of failure for the entire cluster. For availability reasons, you should add a third site that can act as the arbitrator in case either one of the sites fail or a communication failure occurs between the sites.

To build an Oracle RAC database on an extended distance cluster environment, you must:

Configure one set of nodes at Site A.
Configure another set of nodes at Site B.
Use a fast, redundant dedicated connection between the nodes (or buildings) for Oracle RAC cross-instance communication.

You can optionally configure Dense Wavelength Division Multiplexing (referred to as DWDM, or Dark Fiber) to allow communication to occur between the sites without using repeaters and to allow greater distances between the sites. However, the disadvantage is that Dark Fiber can be prohibitively expensive.
Use server-based or array-based mirroring to host all of the data on both sites and keep it synchronously mirrored. Oracle recommends server-based mirroring using ASM to internal mirror across the two storage arrays. Implementing mirroring with ASM provides and active/active storage environment in which system write I/Os are propagated to both sets of disks, making them appear as a single set of disks independent of location.

The ASM volume manager provides flexible server-based mirroring redundancy options. You can choose to use external redundancy to defer the mirroring protection function to the hardware RAID storage subsystem. The ASM normal and high-redundancy options allow two-way and three-way mirroring, respectively.
Configure a third site for a voting disk^Foot 4

Figure 4-5 shows an Oracle RAC extended distance cluster for a configuration that has multiple active instances on six nodes at two different locations: three nodes at Site A and three at Site B. The public and private interconnects, and the Storage Area Network (SAN) are all on separate dedicated channels, with each one configured redundantly. For availability reasons, the Oracle Database is a single database that is mirrored at both of the sites. Also, to prevent a full cluster outage if either site fails, the configuration includes a third voting disk on an inexpensive, low-end standard Network File System (NFS) mounted device.

Figure 4-5 Oracle RAC On an Extended Distance Cluster

Description of "Figure 4-5 Oracle RAC On an Extended Distance Cluster"

Finally, consider the following when implementing this architecture:

High internode and interstorage latency can have a major effect on performance and throughput. Performance testing is mandatory to assess the impact of latency. In general, distances of 50 km or less are recommended.
Network, storage, and management costs will increase
Write performance incurs the overhead of network latency
Because this is a single database without Oracle Data Guard, there is no protection from data corruption or data failures
A third site is recommended for high availability to be leveraged as another location for the voting disk (quorum disk) and as an arbitrator in case of connection issues between the nodes

4.1.5 Oracle Database with Data Guard

Oracle Data Guard is a high availability and disaster-recovery solution that provides very fast automatic failover (referred to as fast-start failover) in the case of database failures, node failures, corruption, and media failures. Furthermore, the standby databases can be used for read-only access and subsequently for reader farms, for reporting purposes, and for testing and development purposes.

While traditional solutions (such as backup and recovery from tape, storage based remote mirroring, and database log shipping) can deliver some level of high availability, Data Guard provides the most comprehensive high availability and disaster recovery solution for Oracle databases.

Data Guard provides a number of advantages over traditional solutions, including the following:

Fast, automatic or automated failover for data corruptions, lost writes, and database and site failures
Protection against data corruptions and lost writes on the primary database
Reduced downtime with Data Guard rolling upgrade capabilities
Ability to offload primary database activities, such as backups, queries or reporting without sacrificing RTO and RPO
Site failures do not require instance restart, storage remastering, or application reconnections
Transparent to applications
Effective network utilization

In addition, for data resident in Oracle databases, Oracle Data Guard, with its built in zero data loss capability, is more efficient, less expensive and better optimized for data protection and disaster recovery than traditional remote mirroring solutions. Oracle Data Guard provides a compelling set of technical and business reasons that justify its adoption as the disaster recovery and data protection technology of choice, over traditional remote mirroring solutions.The following list summarizes the advantages of using Oracle Data Guard compared to using remote mirroring solutions:

Better Network Efficiency—With Oracle Data Guard, only the redo data needs to be sent to the remote site. However, if a remote mirroring solution is used for data protection, typically you must mirror the database files, the online redo logs, the archived redo logs and the control file. If the flash recovery area is on the source volume that is remotely mirrored, then you must also remotely mirror the flashback logs. This means that compared to Data Guard, a remote mirroring solution sends each change many more times to the remote site.
Better Performance—Data Guard only transmits writes to the redo logs of the primary database, whereas remote mirroring solutions must transmit these writes and every write I/O to data files, additional members of online log file groups, archived redo log files, and control files. Data Guard is designed so that it does not affect the Oracle DBWR process that writes to data files, because anything that slows down DBWR impacts database performance. However, remote mirroring solutions do impact DBWR performance because they subject all DBWR writes to network and disk I/O induced delays inherent to synchronous, zero-data-loss configurations. Compared to mirroring, Data Guard provides better performance and is more efficient, Data Guard always verifies the state of the standby database and validates the data before applying redo, and Data Guard enables you to use the standby database for updates while it continues to protect the primary database.
Better suited for WANs—Remote mirroring solutions based on storage systems often have a distance limitation due to the underlying communication technology (Fibre Channel, ESCON) used by the storage systems. In a typical example, the maximum distance between these two boxes connected in a point-to-point fashion and running synchronously can be only 10 km. Using specialized devices this distance can be extended to 66 km. However, when the standby data center is more than 66 km apart, you have to use a series of repeaters and converters from third-party vendors. These devices convert ESCON/Fibre Channel to the appropriate IP, ATM or SONET networks.
Better resilience and data protection—Oracle Data Guard ensures much better data protection and data resilience than remote mirroring solutions, because corruptions introduced on the production database probably can be mirrored by remote mirroring solutions to the standby site, but corruptions are eliminated by Data Guard. For example, if a stray write occurs to a disk, or there is a corruption in the file system, or the Host Bus Adaptor corrupts a block as it is written to disk, then a remote mirroring solution may propagate this corruption to the DR site. Because Data Guard only propagates the redo data in the logs, and the log file consistency is checked before it is applied, all such external corruptions are eliminated by Data Guard.
Higher Flexibility—Data Guard is implemented on top of pure commodity hardware. It only requires a standard TCP/IP-based network link between the two computers. There is no fancy or expensive hardware required. It also allows the storage to be laid out in a different fashion from the primary. For example, you can put the files on different disks, volumes, file systems, and so on.
Better Functionality—Data Guard, with its full suite of data protection features (Redo Apply for physical standby databases and SQL Apply for logical standby databases, multiple protection modes, push-button automated switchover and failover capabilities, automatic gap detection and resolution, GUI-driven management and monitoring framework, cascaded redo log destinations), is a much more comprehensive and effective solution optimized for data protection and disaster recovery than remote mirroring solutions.
Higher ROI—Businesses have to ensure that they are getting as much value as possible from their IT investments, and no IT infrastructure is sitting idle. Data Guard is designed to allow businesses get something useful out of their expensive investment in a disaster-recovery site. Typically, this is not possible with remote mirroring solutions.

The recommended high availability and disaster-recovery architectures that leverage Oracle Data Guard are described in the following sections:

Overview of Single Standby Database Architectures
Overview of Multiple Standby Database Architectures

4.1.5.1 Overview of Single Standby Database Architecture s

A single standby database architecture consists of the following key traits and recommendations:

Primary database resides in Site A.
Standby database resides in Site B. If zero data loss is required with minimum performance impact on the primary database, the best practice is to locate the secondary site within 200 miles from the primary database. Note, however, that the synchronous redo transport does not impose any physical distance limitation.
Fast-start failover is recommended to provide automatic failover without user intervention and bounded recovery time. If the primary database uses the asynchronous redo transport, configure your maximum data loss tolerance or the Data Guard broker's FastStartFailoverLagLimit property to meet your business requirements. The Observer (thin client watchdog) resides in the application tier and monitors the availability of the primary database. The observer is described in more detail in Oracle Data Guard Broker.
Use a physical standby database if read-only access is sufficient.
Evaluate logical standby databases if additional indexes are required for reporting purposes and if your application only uses data types supported by logical standby database and SQL Apply.

Figure 4-6 shows the relationships between the primary database, target standby database, and the observer before, during, and after a fast-start failover occurs.

Figure 4-6 Relationship of Primary and Standby Databases and the Observer During Fast-Start Failover

Description of "Figure 4-6 Relationship of Primary and Standby Databases and the Observer During Fast-Start Failover"

The following list describes examples of Data Guard configurations using single standby databases:

A national energy company uses a standby database located in a separate facility 10 miles away from its primary data center. Outages or data loss that could impact customer service and safety are avoided by using Data Guard synchronous transport and automatic failover (fast-start failover).
An infrastructure services provider to the telecommunication industry utilizes a single standby database located over 400 miles away from the primary configured for synchronous redo transport, enabling zero data loss failover for maximum data protection and high availability.
A telecommunications provider uses asynchronous redo transport to synchronize a primary database on the west cost of the United Sates, with a standby database on the east coast, over 2,200 miles away. This enables the provider to use existing data centers that are geographically isolated, offering a unique level of high availability.
A global manufacturing company used Data Guard to replace storage-based remote mirroring and maintain a standby database at its recovery site 50 miles away from the primary site. Data Guard provides more comprehensive data protection and its more efficient network utilization means there is plenty of headroom to grow without incurring the additional expense of upgrading their network.

4.1.5.2 Overview of Multiple Standby Database Architectures

This architecture is identical to the single-standby database architecture that was described in Section 4.1.5.1, except that there are multiple standby databases in the same Data Guard configuration. The following list describes some implementations for a multiple standby database architecture:

Continuous and transparent disaster or high-availability protection in case of an outage at the primary database or the targeted standby database
Reader farms or look up databases
Reporting databases
Regional reporting or reader databases for better response time
Synchronous transport transmits to a more local standby database, and asynchronous transport transmits to a more remote standby database to provide optimum levels of performance and data protection
Testing and development clones using snapshot standby databases
Rolling upgrades

Note that it is possible to convert a physical standby database to a logical standby database or to a snapshot standby database, or you can create additional logical standby databases or snapshot standby databases:

Transient logical standby databases can be used to minimize downtime for database upgrades. Using transient logical standby databases is helpful in Data Guard architectures where there are no logical standby databases.

In a multiple standby database environment, you can create a transient logical standby database temporarily (for planned maintenance) and then convert it back to the physical standby database role. For example, you can use transient logical standby databases to minimize downtime for database upgrades, when required. There is no need to create a separate logical standby database to perform upgrades. The high-level steps for rolling upgrades with a transient logical standby database are as follows:
1. Start performing a rolling database upgrade with the physical standby database.
2. Temporarily convert the physical standby database to a logical standby database to perform the upgrade. (Note that data type restrictions are limited for the short window of time required to perform an upgrade.)
3. Revert the logical standby database back to the physical standby database role.
See Also:
Oracle Data Guard Concepts and Administration or the Oracle Database High Availability Best Practices for step-by-step instructions about performing a rolling upgrade with a transient logical standby database
Snapshot standby databases can be used as a clone or a test database to test new functionality and new releases. The snapshot standby database continues to receive and queue redo data so data protection and RPO are not sacrificed.

Snapshot standby databases diverge from the primary database over time because redo data from the primary database is not applied when it is received. Redo Apply does not apply the redo data until you convert the snapshot standby database back into a physical standby database, and all local updates that were made to the snapshot standby database are discarded. Although the local updates to the snapshot standby database cause additional divergence, the data in the primary database is fully protected by means of the redo logs that are located at the standby site.

Figure 4-7 shows the production database at the primary site and multiple standby databases at secondary sites. Also, see Figure 2-7, "Standby Database Reader Farms" for another example of a multiple standby database environment.

Figure 4-7 Oracle Database with Data Guard Architecture on Primary and Multiple Standby Sites

Description of "Figure 4-7 Oracle Database with Data Guard Architecture on Primary and Multiple Standby Sites"

4.1.6 Oracle Database with Oracle Clusterware and Data Guard

If your business does not require the scalability and additional high availability benefits provided by Oracle RAC, but you still need all the benefits of Oracle Data Guard and cold cluster failover, then this architecture is a good compromise. With Oracle 11g, Oracle Clusterware cold cluster failover combined with Oracle Data Guard makes a tightly integrated solution in which failover to the secondary node in the cold cluster failover is transparent and does not require you to reconfigure the Data Guard environment or perform additional steps.

Figure 4-8 shows an Oracle Clusterware and Oracle Data Guard architecture that consists of a primary and a secondary site. Both the primary and secondary sites contain Oracle application servers, two database instances, and an Oracle Database.

Figure 4-8 Oracle Clusterware (Cold Cluster Failover) and Oracle Data Guard

Description of "Figure 4-8 Oracle Clusterware (Cold Cluster Failover) and Oracle Data Guard"

In Figure 4-8:

The application servers on the secondary site are connected to the WAN traffic manager by a dotted line to indicate that they are not actively processing client requests at this time. The application server on the secondary site can be active and processing client requests such as queries if the standby database is a physical standby database with real-time query enabled, or if it is a logical standby database.
Oracle Data Guard transmits redo data from the primary database to the secondary site to keep the databases synchronized.
Oracle Clusterware manages the availability of both the user applications and Oracle databases.
Oracle Clusterware provides tolerance of node failures, while Data Guard provides additional protection against data corruptions, lost writes, and database and site failures. (See Oracle Database with Data Guard for a complete description.)
Although Cold Cluster Failover is not shown in Figure 4-8, you can configure it by adding a passive node on the secondary site.

4.1.7 Oracle Database with Oracle RAC and Data Guard

You can achieve the highest level of availability when using Oracle RAC and Oracle Data Guard without application changes. These Oracle features provide the most comprehensive architecture for reducing downtime for scheduled outages and preventing, detecting, and recovering from unscheduled outages. This architecture combines the benefits of both Oracle RAC and Data Guard and it is the recommended architecture for Maximum Availability Architecture (MAA).

To protect against site failures, the MAA recommends Oracle RAC and Data Guard reside on separate systems (clusters) and data centers. Figure 4-9 shows the recommended MAA configuration, with Oracle Database, Oracle RAC, and Data Guard. Configuring symmetric sites is recommended to ensure that each site can accommodate the performance and scalability requirements of the application after any role transition. Furthermore, operational practices across role transitions is simplified when the sites are symmetric.

Figure 4-9 Oracle Database with Oracle RAC and Data Guard - MAA

Description of "Figure 4-9 Oracle Database with Oracle RAC and Data Guard - MAA"

4.1.8 Oracle Database with Streams

Like Oracle Data Guard in SQL Apply mode, Oracle Streams can capture database changes, propagate them to destinations, and apply the changes at these destinations. Streams is optimized for replicating data. Streams can capture changes at a source database, and the captured changes can be propagated asynchronously to replica databases. A logical copy configured and maintained using Streams is called a replica, not a logical standby database, because it provides many capabilities that are beyond the scope of the normal definition of a standby database.

You might choose to use Streams to configure and maintain a logical copy of your production database. Although using Streams might require additional work, it offers increased flexibility that might be required to meet specific business requirements.

Oracle Database with Streams provides granularity and control over what is replicated and how it is replicated. It supports bidirectional replication, data transformations, subsetting, custom apply functions, and heterogeneous platforms. It also gives users complete control over the routing of change records from the primary database to a replica database. The capture of data changes can be performed at the primary database or downstream at a replica database. This enables users to build hub and spoke network configurations that can support hundreds of replica databases.

Consider using Oracle Database with Streams if one or more of the following conditions are true:

Updates are required on both sites or databases, and the changes need to be propagated bidirectionally
Site configurations are on heterogeneous platforms
Different character sets are required between the primary database and its replicas
Fine control of information and data sharing are required
More investment and expertise to build and maintain an integrated high-availability solution is available

Figure 4-10 shows a sample Oracle Database using Streams to replicate data for a schema among three Oracle databases. DML and DDL changes made to tables in the hr schema are captured at all databases in the environment and propagated to each of the other databases in the environment.

4.2 Choosing the Correct High-Availability Architecture

This section summarizes the advantages of the different high-availability architectures and provides guidelines for you to choose the correct high-availability architecture for your business.

Chapter 3, "Determining Your High Availability Requirements" describes how the high-availability requirements for the business plus its allotted budget determine the appropriate architecture. The key factors include:

Recovery time objective (RTO) and recovery point objective (RPO) for unplanned outages and planned maintenance
Manageability Overhead (MO)
Total Cost of Ownership (TCO) and Return On Investment (ROI)

For example, Table 4-1 provides some insight into the probability of different outages during unplanned and planned activities. The data is derived from actual user experiences and from Oracle service requests.

Table 4-1 Frequency of Outages

Activity	Outage
Media or disk failures	High
Application patches	High
Application failures	High
Logical or user failures that manipulate logical data (DMLs and DDLs)	High
Data corruptions (hardware or software induced)	Medium
Computer failures	Medium
Database patches	Medium
Hardware patches and upgrades	Low
Operating system patches and upgrades	Low
Database or application upgrades	Low
Database failures	Low
Platform migrations	Very low
Site failures	Very low

Table 4-2 recommends architectures based on your business requirements for RTO, RPO, MO, scalability, and other factors.

Table 4-2 High-Availability Architecture Recommendations

Consider Using ..	Business or Application Impact ...
Oracle Database with Oracle Clusterware (Cold Cluster Failover)	Maximum RTO for instance or node failure is in minutes MO is low ROI is low Rolling upgrade and patch capabilities for Oracle Clusterware with zero database downtime.
Oracle Database with Oracle Real Application Clusters	Maximum RTO for instance or node failure is zero for the database^Foot 1 MO is medium ROI is high Rolling upgrade for system, clusterware, operating system and some Oracle interim patches Database scalability beyond one instance or node
Oracle Database with Oracle RAC on Extended Distance Clusters	All of the business benefits of Oracle Real Application Clusters MO is high^Foot 2 ROI is medium Additional protection from data center failure with special considerations that are documented in "Oracle Database with Oracle RAC on Extended Distance Clusters"
Oracle Database with Data Guard	Maximum RTO for instance or node failure is in seconds to minutes Maximum RTO for data corruptions, database, or site failures is in seconds to minutes MO is low ROI is high Rolling upgrade for system, clusterware, database, and operating system Offload read-only, reporting, testing and backup activities to the standby database
Oracle Database with Oracle Clusterware and Data Guard	All of the business benefits of Oracle Clusterware (Cold Cluster Failover) and Oracle Data Guard MO is low ROI is medium
Oracle Database with Oracle RAC and Data Guard	All of the business benefits of Oracle RAC and Oracle Data Guard MO is medium ROI is high
Oracle Database with Streams	Maximum RTO for instance or node failure is in seconds to minutes Maximum RTO for data corruption, cluster, database, or site failures is in seconds to minutes MO is high^Footref 2 ROI is high Rolling upgrade for system, clusterware, operating system, database and application Replication between multiple updatable databases or support for heterogeneous platforms Flexible propagation and management of data, transactions, and events With Oracle RAC integration, database scalability is possible

^Footnote 1Database is still available, but a portion of the application connected to the failed system is temporarily affected.

^Footnote 2Architectures for which the MO is "High" might require additional time and expertise to build and maintain, but offer increased flexibility and capabilities required to meet specific business requirements.

Table 4-3 identifies the additional capabilities provided by the architectures that build on the Oracle Database and attempts to label each architecture with its greatest strengths.

Table 4-3 Additional Capabilities of High Level Oracle High-Availability Architectures

Oracle High-Availability Architecture	Key Characteristics and Additional Capabilities
Oracle Database (Base Architecture) The foundation for all high-availability architectures	Fast-Start Fault Recovery bounds and optimizes instance and database recovery times to minutes. Automatic Storage Management tolerates storage failures and optimizes storage performance and utilization. Oracle Flashback Technology optimizes logical failure repair. Recovery Manager optimizes local repair of data failures using local backups. Oracle Secure Backup provides a centralized tape backup management solution. Flash Recovery Area manages local recovery related files automatically. Online Reorganization and Redefinition allows for dynamic data changes. Oracle Security Features prevents unauthorized access and changes. Hardware Assisted Resilient Data (HARD) Initiative detects and prevents data corruptions and stray or misdirected writes (that result in a lost write to the intended location). Data Block Corruption Prevention and Detection Parameters detects and prevents some corruptions and lost writes. Dynamic Resource Provisioning allows for dynamic system changes. Online Patching allows for dynamic database patches of typically diagnostic patches. Data Recovery Advisor diagnoses persistent (on disk) data failures, presents appropriate repair options, and runs repair operations at your request. Support is for single-instance databases only.
Oracle Database with Oracle Clusterware (Cold Cluster Failover)	All of the benefits of Oracle Database Automatic and fast failover for computer failure Minimum rolling upgrade capabilities for system, clusterware, and operating system^Footref 1
Oracle Database with Oracle Real Application Clusters High availability, scalability, and foundation of server database grids	All of the benefits of Oracle Database Scalability beyond a single system Automatic recovery of failed nodes and instances Fast application notification (FAN) with integrated Oracle client failover Rolling upgrade for system, clusterware, operating system and some Oracle interim patches^Foot 1
Oracle Database with Oracle RAC on Extended Distance Clusters Database Grid with site failure protection	All of the benefits of Oracle Database Protection from site failure
Oracle Database with Data Guard Simplest high availability, data protection, and disaster-recovery solution	All of the benefits of Oracle Database Automatic and fast failover for computer failure, storage failure, data corruption, for configured ORA- errors or conditions and database failures Protection from site failure Rolling upgrade for system, clusterware, database, and operating system^Foot 2 Offload backups to the standby database Offload read and reporting workload to the standby database Only comprehensive lost write protection
Oracle Database with Oracle Clusterware and Data Guard Simple high availability solution with added data and disaster recovery protection.	The sum of benefits of Oracle Clusterware with Data Guard
Oracle Database with Oracle RAC and Data Guard Best high-availability, data protection and disaster-recovery solution with scalability built in	The sum of benefits of Oracle RAC with Data Guard
Oracle Database with Streams ^Foot 3 Bidirectional replication and information management	Replica database (or databases) are available for read/write use Provides heterogeneous platform support Fast failover for computer failure and storage failure Protection from site failure Minimizes downtime for computer or site maintenance and database and application upgrades

^Footnote 1Rolling upgrades with Oracle Clusterware and Oracle RAC incur zero downtime.

^Footnote 2Rolling upgrades with Oracle Data Guard incur minimal downtime.

^Footnote 3The initial investment to build a robust solution is well worth the long-term flexibility and capabilities that Streams delivers to meet specific business requirements.

Table 4-4 shows the recovery time including detection and client failover time of an integrated Oracle client, whenever relevant. You should adopt the MAA best practices to achieve the optimal recovery time and configuration. Oracle High Availability Best Practice recommendations can be found in the Oracle Database High Availability Best Practices and in the white papers that can be downloaded from:

http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm

Table 4-4 Attainable Recovery Times for Unplanned Outages

Outage Type	Oracle Database	Cold Cluster	Oracle RAC and RAC on Extended Distance Clusters	Data Guard	Oracle RAC and Data Guard	Streams
Computer failure	Minutes to hours^Foot 1	Minutes	No downtime^Foot 2	Seconds to a minute	No downtime^Footref 2	No downtime^Footref 2
Storage failure	No downtime^Foot 3	No downtime^Footref 3	No downtime³	No downtime³	No downtime³	No downtime³
Human error	< 30 minutes^Foot 4	< 30 minutes^Footref 4	< 30 minutes⁴	< 30 minutes⁴	< 30 minutes⁴	< 30 minutes⁴
Data corruption	Potentially hours^Foot 5	Potentially hours^Footref 5	Potentially hours^Footref 5	Seconds to a minute	Seconds to a minute	Seconds to a minute
Site failure	Hours to days	Hours to days	No downtime^Footref 2 if the outage affects one building Hours to days if the outage affects building	Seconds to a minute^Foot 6	Seconds to a minute^Footref 6	No downtime^Foot 7

^Footnote 1Recovery time consists largely of the time it takes to restore the failed system

^Footnote 2Database is still available, but a portion of the application connected to the failed system is temporarily affected.

^Footnote 3Storage failures are prevented by using ASM with mirroring and its automatic rebalance capability.

^Footnote 4Recovery time for human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, it typically only requires seconds to flashback the appropriate transactions. Longer detection time usually leads to longer recovery time required to repair the appropriate transactions. An exception is undropping a table, which is literally instantaneous regardless of detection time.

^Footnote 5Recovery time depends on the age of the backup used for recovery and the number of log changes scanned to make the corrupt data consistent with the database.

^Footnote 6Recovery time indicated applies to database and existing connection failover. Network connection changes and other site-specific failover activities may lengthen overall recovery time.

^Footnote 7The portion of any application connected to the failed system is temporarily affected. You can configure the failed application connections to fail over to the replica.

Table 4-5 compares the attainable recovery times of each Oracle high-availability architecture for all types of planned downtime.

Table 4-5 Attainable Recovery Times for Planned Outages

System Change or Data Change	Outage Type	Oracle Database	Oracle RAC	Data Guard	MAA	Streams
System change - Dynamic Resource Provisioning	--	No downtime	No downtime	No downtime	No downtime	No downtime
System change - Rolling Upgrade	System level upgrade	Minutes to hours	No downtime	Seconds to five minutes	No downtime	No downtime
System change - Rolling Upgrade	Cluster or site wide upgrade	Minutes to hours	Minutes to hours	Seconds to five minutes	Seconds to five minutes	No downtime^Foot 1
System change - Rolling Upgrade	Storage migration	No downtime^Foot 2	No downtime²	No downtime²	No downtime²	No downtime²
System change - Rolling Upgrade	Database one-off patch	Minutes to an hour	No downtime^Foot 3	Seconds to five minutes	No downtime³	No downtime
System change - Rolling Upgrade	Database patch set and version upgrade	Minutes to hours	Minutes to hours	Seconds to five minutes	Seconds to five minutes	No downtime¹
System change - Rolling Upgrade	Platform migration	Minutes to hours	Minutes to hours	Minutes to hours	Minutes to hours	No downtime¹
Data change	Online Reorganization and Redefinition	No downtime	No downtime	No downtime^Foot 4	No downtime⁴	No downtime⁴

^Footnote 1Applications (or a portion of an application) connected to the system that is being maintained may be temporarily affected.

^Footnote 2ASM automatically rebalances stored data when disks are added or removed while the database remains online. For storage migration, you are required to leverage both storage arrays by ASM temporarily.

^Footnote 3For qualified one-off patches only

^Footnote 4Tables can be reorganized online using the DBMS_REDEFINITION package. However, the online changes are not supported by SQL Apply or data capture, and therefore the effects of this subprogram are not visible on the logical standby database or replica database. For more information, see Oracle Data Guard Concepts and Administration or Oracle Streams Replication Administrator's Guide.

4.3 Integrating Application Server High Availability

The Oracle Application Server provides flexible and automated high availability solutions for Oracle Application Server to ensure that applications that you deploy on Oracle Application Server meet the required availability to achieve your business goals. The solutions introduced in this book are described in detail in the Oracle Application Server High Availability Guide.

This section contains the following topics:

Oracle Application Server High Availability Architectures
Redundant Architectures
High Availability Services in Oracle Application Server

4.3.1 Oracle Application Server High Availability Architectures

Oracle Application Server provides high availability and disaster recovery solutions for maximum protection against any kind of failure with flexible installation, deployment, and security options. The redundancy of Oracle Application Server local high availability and disaster recovery originates from its redundant high availability architectures.

At a high level, Oracle Application Server local high availability architectures include several active-active and active-passive architectures for the OracleAS middle-tier and the OracleAS Infrastructure. Although both types of solutions provide high availability, active-active solutions generally offer higher scalability and faster failover, although, they tend to be more expensive as well. With either the active-active or the active-passive category, multiple solutions exist that differ in ease of installation, cost, scalability, and security.

Building on top of the local high-availability solutions is the Oracle Application Server disaster recovery solution, Oracle Application Server Guard. This unique solution combines the proven Oracle Data Guard technology in the Oracle Database with advanced disaster recovery technologies in the application realm to create a comprehensive disaster recovery solution for the entire application system. This solution requires homogenous production and standby sites, but other Oracle Application Server instances can be installed in either site as long as they do not interfere with the instances in the disaster recovery setup. Configurations and data must be synchronized regularly between the two sites to maintain homogeneity.

4.3.2 Redundant Architectures

Oracle Application Server provides redundancy by offering support for multiple instances supporting the same workload. These redundant configurations provide increased availability either through a distributed workload, through a failover setup, or both.

From the entry point to an Oracle Application Server system (content cache) to the back end layer (data sources), all the tiers that are crossed by a request can be configured in a redundant manner with Oracle Application Server. The configuration can be an active-active configuration using OracleAS Cluster or an active-passive configuration using OracleAS Cold Failover Cluster.

4.3.3 High Availability Services in Oracle Application Server

Oracle Application Server provides different features and topologies to support high availability across the its stack. This includes solutions that extend across both the OracleAS middle-tier and the OracleAS Infrastructure tier.

The Oracle Application Server High Availability Guide describes the following high availability services in Oracle Application Server in detail:

Process death detection and automatic restart
Configuration management
State replication
Server load balancing and failover
Backup and recovery
Disaster recovery

4.4 Integrating High Availability for All Applications

A highly available and resilient application requires that every component of the application must be highly available or tolerate failures and changes. For example, a highly available application must analyze every component that affects the application including the network topology, application server, application flow and design, systems, and the database configuration and architecture. This book has focused primarily on the database high availability solutions.

See the high availability solutions and recommendations for Oracle Application Server, Enterprise Manager and Applications on the MAA Web site at:

http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm

Footnote Legend

Footnote 1: Single-instance databases can use clustered ASM (Storage GRID) or nonclustered ASM.
Footnote 2: Network splits, commonly referred to as split brains, occur when nodes on each side of the cluster cannot see the nodes on the other side of the cluster.
Footnote 3: Oracle Clusterware sends the service events and FAN-integrated clients automatically react to those events.
Footnote 4: If you have an extended distance cluster and do not configure a third site, you must make one of the sites the primary site and make the other site a secondary site. Then, if the primary site fails, you must manually restart the secondary site. See the white paper about using standard NFS to support a third voting disk on a stretch cluster configuration that is available on the Oracle Real Application Clusters Web site at http://www.oracle.com/technology/products/database/clustering/index.html