1.8 Matching Failures to Backup and Recovery Techniques

In planning your database backup and recovery strategy, you must try to anticipate the errors that will arise, and put in place the backups needed to recover from them.While there are several types of problem that can halt the normal operation of an Oracle database or affect database I/O operations, only two typically require DBA intervention and media recovery: media failure, and user errors. Instance failures, network failures, failures of Oracle database background processes and failure of a statement to execute due to, for instance, exhaustion of some resource such as space in a datafile may require DBA intervention, and might even crash a database instance, but will not generally cause data loss or the need to recover from backup.

This section contains these topics:

Responding to Media Failure
Responding to User Error

1.8.1 Responding to Media Failure

Database operation after a media failure of online redo log files or control files depends on whether the online redo log or control file is protected by multiplexing, as recommended. When an online redo log or control file is multiplexed, multiple copies of the file are maintained on the system. Multiplexed files should be stored on separate disks.

If a media failure damages a disk containing one copy of a multiplexed online redo log, then the database can usually continue to operate without significant interruption. Damage to a nonmultiplexed online redo log causes database operation to halt and may cause permanent loss of data.

Damage to any control file, whether it is multiplexed or not, halts database operation when the database attempts to read or write to the damaged control file (which happens frequently, for example at every checkpoint and log switch).

Media failures are either read errors or write errors. In a read error, the instance cannot read a datafile and an operating system error is returned to the application, along with an error indicating that the file cannot be found, cannot be opened, or cannot be read. The database continues to run, but the error is returned each time an unsuccessful read occurs. At the next checkpoint, a write error will occur when the database attempts to write the file header as part of the standard checkpoint process.

The effect of a datafile write error depends upon which tablespace the datafile is in.

If the instance cannot write to a datafile in the SYSTEM tablespace, an undo tablespace (if the database is in automatic undo management mode, which is the preferred choice in Release 10g), or a datafile in a tablespace containing active rollback segments (if in manual undo management mode), then the database issues an error and shuts down the instance. All files in the SYSTEM tablespace and all datafiles containing rollback segments must be online in order for the database to operate properly.

If the instance cannot write to a datafile other than those in the preceding list, then the result depends on whether the database is running in ARCHIVELOG mode or not.

In ARCHIVELOG mode, the database records an error in the database writer trace file and takes the affected datafile offline. (All other datafiles in the tablespace containing this datafile remain online.) You can then rectify the underlying problem and restore and recover the affected tablespace.

In NOARCHIVELOG mode, the database writer background process DBWR fails, and the instance fails, the cause of the problem determines the required response. If the problem is temporary (for example, a disk controller was powered off), then crash recovery usually can be performed using the online redo log files. In such situations, the instance can be restarted without resorting to media recovery. However, if a datafile is damaged, then you must restore a consistent backup of the entire database.

1.8.2 Responding to User Error

Typically, a user error like dropping a table or deleting rows from a table requires one of the following responses:

Re-importing the dropped object, if a suitable export file exists or if the object is still available at a standby database
Performing tablespace point-in-time recovery (TSPITR) of one or more tablespaces
Re-entering the lost data manually, if a record of them exists
Returning the database to a past state using database point-in-time recovery
Using one of the flashback features of the Oracle database to recover from logical corruption by returning affected objects to a past state

The recovery options available to you will be a function of your backup strategy. For example, if you are running in NOARCHIVELOG mode then you have limited point-in-time recovery options.