3 Monitoring Using Oracle Grid Control

This chapter provides best practices for using Oracle Grid Control to monitor and maintain a highly available environment across all tiers of the application stack.

Overview of Monitoring and Detection for High Availability
Using Oracle Grid Control for System Monitoring
Managing the High-Availability Environment with Oracle Grid Control

3.1 Overview of Monitoring and Detection for High Availability

Continuous monitoring of the system, network, database operations, application, and other system components, ensures early detection of problems. Early detection improves the user's system experience because problems can be resolved faster. In addition, monitoring captures system metrics to indicate trends in system performance growth and recurring problems. This information can facilitate prevention, enforce security policies, and manage job processing. For the database server, a sound monitoring system must measure availability and detect events that can cause the database server to become unavailable, and provide immediate notification to responsible parties for critical failures.

The monitoring system itself must be highly available and adhere to the same operational best practices and availability practices as the resources it monitors. Failure of the monitoring system leaves all monitored systems unable to capture diagnostic data or alert the administrator of problems.

Oracle Grid Control provides the management and monitoring capabilities with many different notification options. This chapter provides best practices for using Oracle Grid Control to monitor and maintain a highly available environment across all tiers of the application stack. Recommendations are available for methods of monitoring the environment's availability and performance, and for using the tools in response to changes in the environment.

3.2 Using Oracle Grid Control for System Monitoring

This section provides an overview of the concepts and facilities available in Oracle Grid Control.

A major benefit of Oracle Grid Control is its ability to manage components across the entire application stack from the host operating system to a user or packaged application. Oracle Grid Control treats each of the layers in the application as a target. Targets—such as databases, application servers, and hardware—can then be viewed along with other targets of the same type, or can be grouped together by application type. All targets can also be reviewed in a single view. Each target type has a default generated home page that displays a summary of relevant details for a specific target. Different types of targets can be grouped together by function, that is, as resources that support the same application.

Every target is monitored by an Oracle Management Agent. Every Management Agent runs on a machine and is responsible for a set of targets. The targets can be on a machine that is different from the machine that the Management Agent is on. For example, a Management Agent can monitor a storage array that cannot host an agent natively. When a Management Agent is installed on a host, the host is automatically discovered along with other targets that are on the machine.

The Oracle Grid Control home page shown in Figure 3-1 provides a picture of the availability of all of the discovered targets.

Figure 3-1 Oracle Grid Control Home Page

Description of "Figure 3-1 Oracle Grid Control Home Page"

The Oracle Grid Control home page shows the following major kinds of information:

A snapshot of the current availability of all targets. The pie chart associated with availability gives the administrator an immediate indication of any target that is unavailable (Down) or has lost communication with the console (Unknown).
An overview of how many alerts (for events) and problems (for jobs) are known in the entire monitored system. You can display detailed information by clicking the links, or by navigating to Alerts from the upper right portion of any Oracle Grid Control page.
A target shortcut intended for administrators who have to perform a task for a specific target.
An overview of what is actually discovered in the system. This list can be shown at the hardware level and the Oracle level.
A set of useful links to other Oracle online resources.

Alerts are generated by a combination of factors and are defined on specific metrics. A metric is a data point sampled by a Management Agent and sent to the Oracle Management Repository. It could be the availability of a component through a simple heartbeat test, or an evaluation of a specific performance measurement such as "disk busy" or percentage of processes waiting for a specific wait event.

There are four states that can be checked for any metric: error, warning, critical, and clear. The administrator must make policy decisions such as:

What objects should be monitored (databases, nodes, listeners, or other services)?
What instrumentation should be sampled (such as availability, CPU percent busy)?
How frequently should the event be sampled?
What should be done when the metric exceeds a predefined threshold?

All of these decisions are predicated on the business needs of the system. For example, all components might be monitored for availability, but some systems might be monitored only during business hours. Systems with specific performance problems can have additional performance tracing enabled to debug a problem.

The rest of this section includes the following topics:

Set Up Default Notification Rules for Each System
Use Database Target Views to Monitor Health, Availability, and Performance
Use Event Notifications to React to Metric Changes
Use Events to Monitor Data Guard System Availability

See Also:
Oracle Enterprise Manager Concepts for more information about monitoring and using metrics in Oracle Grid Control

3.2.1 Set Up Default Notification Rules for Each System

Notification Rules are defined sets of alerts on metrics that are automatically applied to a target when it is discovered by Oracle Grid Control. For example, an administrator can create a rule that monitors the availability of database targets and generates an e-mail message if a database fails. After that rule is generated, it is applied to all existing databases and any database created in the future. Access these rules by navigating to Preferences and then choosing Rules.

The rules monitor problems that require immediate attention, such as those that can affect service availability, and Oracle or application errors. Service availability can be affected by an outage in any layer of the application stack: node, database, listener, and critical application data. A service availability failure, such as the inability to connect to the database, or the inability to access data critical to the functionality of the application, must be identified, reported, and reacted to quickly. Potential service outages such as a full archive log directory also must be addressed correctly to avoid a system outage.

Oracle Grid Control provides a series of default rules that provide a strong framework for monitoring availability. A default rule is provided for each of the preinstalled target types that come with Oracle Grid Control. These rules can be modified to conform to the policies of each individual site, and new rules can be created for site-specific targets or applications. The rules can also be set to notify users during specific time periods to create an automated coverage policy.

Consider the following recommendations:

Modify each rule for high-value components in the target architecture to suit the required availability requirements by using the rules modification wizard. For the database rule, set the events in Table 3-1, Table 3-2, and Table 3-3 for each target. The frequency of the monitoring is determined by the service level agreement (SLA) for each component.
Use Beacon functionality to track the performance of individual applications. A Beacon can be set to perform a user transaction representative of normal application work. Enterprise Manager can then break down the response time of that transaction into its component pieces for analysis. In addition, an alert can be triggered if the execution time of that transaction exceeds a predefined limit.
See Also:
- Oracle Enterprise Manager Concepts for conceptual information about Beacons
- Oracle Enterprise Manager Advanced Configuration for information about configuring service tests and Beacons
Add Notification Methods and use them in each Notification Rule. By default, the easiest method for alerting an administrator to a potential problem is to send e-mail. Supplement this notification method by adding a callout to an SNMP trap or operating system script that sends an alert by some method other than e-mail. This avoids the problem that might occur if a component of the e-mail system has failed. Set additional Notification Methods by using the Set-up link at the top of any Oracle Grid Control page.
Modify Notification Rules to notify the administrator when there are errors in computing target availability. This might generate a false positive reading on the availability of the component, but it ensures the highest level of notification to system administrators.

Figure 3-2 shows the Notification Rule property page for choosing availability states with Down, Agent Unreachable, Agent Unreachable Resolved, and Metric Error Detected chosen.

Figure 3-2 Setting Notification Rules for Availability

Description of "Figure 3-2 Setting Notification Rules for Availability"

In addition, modify the metrics monitored by the database rule to report the metrics shown in Table 3-1, Table 3-2, and Table 3-3. This ensures that these metrics are captured for all database targets and that trend data will be available for future analysis. All events described in Table 3-1, Table 3-2, and Table 3-3 can be accessed from the Database Homepage by choosing All Metrics > Expand All.

Space management conditions that have the potential to cause a service outage should be monitored using the events shown in Table 3-1.

Table 3-1 Recommendations for Monitoring Space

Metric	Recommendation
Tablespace Space Used (%)	Set this metric to monitor root file systems for any critical hardware server. This metric enables the administrator to choose the threshold percentages that Oracle Grid Control tests against, as well as the number of samples that must occur in error before a message is generated to the administrator. The recommended default settings are 70 percent for a warning and 90 percent for an error, but these values should be adjusted depending on system usage. This metric can be customized to monitor only specific tablespaces. This metric and similar events can be set in the Tablespace Full metric group.
Archiver Hung Alert Log Error	Set this metric to monitor the alert log for ORA-00257 errors, which indicate a full archive log directory. This metric can be set in the Alert Log Error Status metric group.
Dump Area Used (%)	Set this metric to monitor the dump directory destinations. Dump space must be available so that the maximum amount of diagnostic information is saved the first time an error occurs. The recommended default settings are 70 percent for a warning and 90 percent for an error, but these should be adjusted depending on system usage. This metric can be set in the Dump Area metric group.

Metric

Recommendation

Tablespace Space Used (%)

Set this metric to monitor root file systems for any critical hardware server. This metric enables the administrator to choose the threshold percentages that Oracle Grid Control tests against, as well as the number of samples that must occur in error before a message is generated to the administrator. The recommended default settings are 70 percent for a warning and 90 percent for an error, but these values should be adjusted depending on system usage. This metric can be customized to monitor only specific tablespaces.

This metric and similar events can be set in the Tablespace Full metric group.

Archiver Hung Alert Log Error

Set this metric to monitor the alert log for ORA-00257 errors, which indicate a full archive log directory.

This metric can be set in the Alert Log Error Status metric group.

Dump Area Used (%)

Set this metric to monitor the dump directory destinations. Dump space must be available so that the maximum amount of diagnostic information is saved the first time an error occurs. The recommended default settings are 70 percent for a warning and 90 percent for an error, but these should be adjusted depending on system usage.

This metric can be set in the Dump Area metric group.

From the Alert Log Metric group, set Oracle Grid Control to monitor the alert log for errors as shown in Table 3-2.

Table 3-2 Recommendations for Monitoring the Alert Log

Metric	Recommendation
Alert	Set this metric to send an alert when an ORA-6XX, ORA-1578 (database corruption), or ORA-0060 (deadlock detected) error occurs. If any other error is recorded, then a warning message is generated.
Data Block Corruption	Set this metric to monitor the alert log for ORA-01157 and ORA-27048 errors. They signal a corruption in an Oracle Database datafile.

Monitor the system to ensure that the processing capacity is not exceeded. The warning and critical levels for these events should be modified based on the usage pattern of the system. Set the events from the Database Limits metric group using the recommendations in Table 3-3.

Table 3-3 Recommendations for Monitoring Processing Capacity

Metric	Recommendation
Process limit	Set thresholds for this metric to warn if the number of current processes approaches the value of the `PROCESSES` initialization parameter.
Session limit	Set thresholds for this metric to warn if the instance is approaching the maximum number of concurrent connections allowed by the database.

Figure 3-3 shows the Notification Rule property page for setting choosing metrics. The user has chosen Critical and Warning as the severity states for notification. The list of Available Metrics is shown in the left list box. The metrics that have been selected for notification are shown in the right list box.

Figure 3-3 Setting Notification Rules for Metrics

Description of "Figure 3-3 Setting Notification Rules for Metrics"

See Also:

Oracle Database 2 Day DBA for information about setting up notification rules and metric thresholds
Enterprise Manager Framework, Host, and Third-Party Metric Reference Manual for information on available metrics

3.2.2 Use Database Target Views to Monitor Health, Availability, and Performance

The Database Targets page in Figure 3-4 shows an overview of system performance, space utilization, and the configuration of important availability components like archived redo log status, flashback log status, and estimated instance recovery time. Alerts are displayed immediately. Each of the alert values can be configured from links on this page.

Figure 3-4 Overview of System Performance

Description of "Figure 3-4 Overview of System Performance"

Many of the metrics from the Oracle Grid Control pertain to performance. A system without adequate performance is not an HA system, regardless of the status of any individual component. While performance problems seldom cause a major system outage, they can still cause an outage to a subset of customers. Outages of this type are commonly referred to as application service brownouts. The primary cause of brownouts is the intermittent or partial failure of one or more infrastructure components. IT managers must be aware of how the infrastructure components are performing (their response time, latency, and availability), and how they are affecting the quality of application service delivered to the end user.

A performance baseline, derived from normal operations that meet the SLA, should determine what constitutes a performance metric alert. Baseline data should be collected from the first day that an application is in production and should include the following:

Application statistics (transaction volumes, response time, Web service times)
Database statistics (transaction rate, redo rate, hit ratios, top 5 wait events, top 5 SQL transactions)
Operating system statistics (CPU, memory, I/O, network)

You can use Oracle Grid Control to capture a snapshot of database performance as a baseline. Oracle Grid Control compares these values against system performance and displays the result on the database Target page. It can also send alerts if the values deviate too far from the established baseline.

Set the database notification rule to capture the metrics listed in Table 3-4 for all database targets. Analysis of these parameters can then be done using one tool and historical data will be available.

Table 3-4 Recommended Notification Rules for Metrics

Metric	Recommendation
Disk I/O per Second	This is a database-level metric that monitors I/O operations done by the database. It sends an alert when the number of operations exceeds a user-defined threshold. Use this metric with operating system-level events that are also available with Oracle Grid Control. Set this metric based on the total I/O throughput available to the system, the number of I/O channels available, network bandwidth (in a SAN environment), the effects of the disk cache if you are using a storage array device, and the maximum I/O rate and number of spindles available to the database.
% CPU Busy	Set this metric to warn at 75 percent and to show a critical alert between 85 percent and 90 percent. This usage might be normal at peak periods, but it might also be an indication of a runaway process or of a potential resource shortage.
% Wait Time	Excessive idle time indicates that a bottleneck for one or more resources is occurring. Set this metric based on the system wait time when the application is performing as expected.
Network Bytes per Second	This metric reports network traffic that Oracle generates. It can indicate a potential network bottleneck. Set this metric based in actual usage during peak periods.
Total Parses per Second	This metric measures SQL performance. It can indicate an application change or change in usage that has created a shortage of resources. Set it based on peak periods.

See Also:

Oracle Database Performance Tuning Guide for more information about performance monitoring
Oracle Database 2 Day DBA for more information on monitoring and tuning using Enterprise Manager

3.2.3 Use Event Notifications to React to Metric Changes

There are many operating system events that can be used to supplement a suggested metric. Such operating system events are not required for each host and instance. All metrics defined here can be set individually by instance or database using the Manage Metrics link at the bottom of the navigation bar on the object target page. The values that trigger a warning or critical alert can be changed here, and an operating system script can be activated to respond to an metric threshold, in addition to the standard alert being generated to the Oracle Grid Control.

3.2.4 Use Events to Monitor Data Guard System Availability

Set Oracle Grid Control metrics to monitor the availability of logical and physical Data Guard configurations. If a Data Guard environment is registered with the Data Guard Manager extension of Oracle Grid Control, then set the events shown in Table 3-5.

Table 3-5 Recommendations for Setting Data Guard Events

Metric Recommendation

Metric	Recommendation
Data Guard Status	Set this metric to be notified of system problems in a Data Guard configuration.
Data Not Applied	Set this metric to be notified when the gap (measured in minutes) between the last archived redo log received and the last log applied on the standby database exceeds a user-defined threshold. This information can be used to warn the administrator if the recovery time for a standby instance will exceed the defined outage recovery service level. Set this metric based on the specifications for log application for the standby database.
Data Not Received	Set this metric to be notified if there is an extended delay in moving archived redo logs from the production database to the standby database. This metric occurs when the difference between the number of archived redo logs on the production database and the number of archived redo logs shipped to the standby site exceeds a user-defined threshold. The threshold should be based on the amount of time it takes to transport an archived redo log across the network. Set the sample time for the metric to be approximately the redo transport time, and set the number of occurrences to be 2 or greater to avoid false positives. Recommended starting values for the warning and critical thresholds are `1` and `2`.

Data Guard Status

Set this metric to be notified of system problems in a Data Guard configuration.

Data Not Applied

Set this metric to be notified when the gap (measured in minutes) between the last archived redo log received and the last log applied on the standby database exceeds a user-defined threshold. This information can be used to warn the administrator if the recovery time for a standby instance will exceed the defined outage recovery service level. Set this metric based on the specifications for log application for the standby database.

Data Not Received

Set this metric to be notified if there is an extended delay in moving archived redo logs from the production database to the standby database. This metric occurs when the difference between the number of archived redo logs on the production database and the number of archived redo logs shipped to the standby site exceeds a user-defined threshold. The threshold should be based on the amount of time it takes to transport an archived redo log across the network.

Set the sample time for the metric to be approximately the redo transport time, and set the number of occurrences to be 2 or greater to avoid false positives. Recommended starting values for the warning and critical thresholds are 1 and 2.

3.3 Managing the High-Availability Environment with Oracle Grid Control

Use Oracle Grid Control as a proactive part of administering any system as well as for problem notification and analysis. This section includes the following recommendations:

Check Oracle Grid Control Policy Violations
Use Oracle Grid Control to Manage Oracle Patches and Maintain System Baselines
Use Oracle Grid Control to Manage Data Guard Targets

See Also:
Oracle Enterprise Manager Administrator's Guide

3.3.1 Check Oracle Grid Control Policy Violations

Oracle Grid Control comes with a pre-installed set of policies and recommendations of best practices for all databases. These policies are checked by default, and the number of violations is displayed on the Targets page shown in Figure 3-4. To see a list of all violations, select Policy Violations from the Targets page.

See Also:

Oracle Enterprise Manager Policy Reference Manual for definitions of existing policies

3.3.2 Use Oracle Grid Control to Manage Oracle Patches and Maintain System Baselines

You can use Oracle Grid Control to download and manage patches from https://metalink.oracle.com for any monitored system in the application environment. A job can be set up to routinely check for patches that are relevant to the user environment. Those patches can be downloaded and stored directly in the Management Repository. Patches can be staged from the Management Repository to multiple systems and applied during maintenance windows.

You can examine patch levels for one machine and compare them between machines in either a one-to-one or one-to-many relationship. In this case, a machine can be identified as a baseline and used to demonstrate maintenance requirements in other machines. This can be done for operating system patches as well as database patches.

3.3.3 Use Oracle Grid Control to Manage Data Guard Targets

Oracle Grid Control can be used to set up logical and physical standby databases for any database target. It also provides the ability to manage switchover and failover of database targets other than the database that contains the Management Repository.

Oracle Grid Control can also be used to monitor the health of a Data Guard configuration at a glance. From any database target page, navigate to the Data Guard status section by using the link in the High Availability section. The page shows the active standby databases for the primary target, the amount of log data waiting for shipment and receipt by the standby database, and the data protection mode. You can also modify the data protection mode from this page.

This page contains a link to the Verify function, which checks the Data Guard environment and redo transport services to display warnings and errors. The Verify function is not automatic and must be run manually.

See Also:

Oracle Data Guard Broker for use case scenarios