Best Practices for Your Alarms

Read about best practices for alarms.

Create a Set of Alarms for Each Metric

For each metric emitted by resources, create alarms that define the following resource behaviors:

  • At risk. The resource is at risk of becoming inoperable, as indicated by metric values.
  • Nonoptimal. The resource is performing at nonoptimal levels, as indicated by metric values.
  • Resource is up or down. The resource is either not reachable or not operating.

The following examples use the CpuUtilization metric emitted by the oci_computeagent metric namespace. This metric monitors the utilization of the compute instance and the activity level of any services and applications running on the instance. CpuUtilization is a key performance metric for a cloud service because it indicates CPU usage for the compute instance and it can be used to investigate performance issues. To learn more about CPU usage, see the following URL: https://en.wikipedia.org/wiki/CPU_time.

At-Risk Example

A typical at-risk threshold for the CpuUtilization metric is any value greater than 80 percent. A compute instance breaching this threshold is at risk of becoming inoperable. Often the cause of this behavior is one or more applications consuming a high percentage of the CPU.

In this example, you decide to notify the operations team immediately, setting the severity of the alarm as "Critical" because repair is required to bring the instances back to optimal operational levels. You configure alarm notifications to the responsible team by both PagerDuty and email, requesting an investigation and appropriate fixes before the instances go into an inoperable state. You set repeat notifications every minute. When someone responds to the alarm notifications, you temporarily stop notifications using the best practice of suppressing the alarm . When metrics return to optimal values, you remove the suppression.

NonOptimal Example

A typical nonoptimal threshold for the CpuUtilization metric is from 60 to 80 percent. When the metric  values for a compute instance are within this range, the instance exceeds the optimal operational range.

In this example, you decide to notify the appropriate individual or team that an application or process is consuming more CPU than usual. You configure a threshold alarm to notify the appropriate contacts, setting the severity of the alarm as "Warning," as no immediate actions are required to investigate and reduce the CPU. You set notification to email only, directed to the appropriate developer or team, with repeat notifications every 24 hours to reduce email notification noise.

Resource is Up or Down Example

A typical indicator of resource availability is a five-minute absence of the CpuUtilization metric. A compute instance breaching this threshold is either not reachable or not operating. The resource might have stopped responding, or it might have become unavailable because of connectivity issues.

In this example, you decide to notify the operations team immediately, setting the severity of your absence alarm as "Critical" because repair is required to bring the instances online. You configure alarm notifications to the responsible team by both PagerDuty and email, requesting an investigation and a move of the workloads to another available resource. You set repeat notifications every minute. When someone responds to the alarm notifications, you temporarily stop notifications using the best practice of suppressing the alarm. When the CpuUtilization metric is again detected from the resource, you remove the suppression.

Select the Correct Alarm Interval for Your Metric

Select an alarm interval based on the frequency at which the metric is emitted. For example, a metric emitted every five minutes requires a 5-minute alarm interval or greater. Most metrics are emitted every minute, which means most metrics support any alarm interval. To determine valid alarm intervals for a specific metric, check the relevant service's metric reference.

Suppress Alarms During Investigations

When a team member responds to an alarm, suppress  notifications during the effort to investigate or mitigate the issue. Temporarily stopping notifications helps to avoid distractions during the investigation and mitigation. Remove the suppression when the issue has been resolved. For instructions, see Suppressing a Single Alarm and Suppressing Multiple Alarms.

Routinely Tune Your Alarms

On a regular basis, such as weekly, review your alarms to ensure optimal configuration. Calibrate each alarm's threshold, severity, and notification details, including method, frequency, and targeted audience.

This image shows a weekly review of alarms for routine tuning.

Optimal alarm configuration addresses the following factors:

  • Criticality of the resource.
  • Appropriate resource behavior. Assess behavior singly and within the context of the service ecosystem. Review metric value fluctuations for a specific period of time and then adjust thresholds as needed.
  • Acceptable notification noise. Assess the notification method (for example, email or PagerDuty), the appropriate recipients, and the frequency of repeated notifications.

The following table shows an example alarm calibration.

Threshold CPU % Severity Notification Method Frequency Targeted Audience
>80% Critical PagerDuty + Email 1 minute Compute, Ops, and Customer Communications
>60% & <80% Warning Email Once per day Compute + Ops

For instructions, see Updating an Alarm.