Managing Cluster Metrics

You can monitor the health, capacity, and performance of your Big Data Service resources by using metrics, alarms, and notifications.

Required IAM Policy

To monitor resources, you must be given the required type of access in a policy written by an administrator, whether you're using the Cloud Console or the REST API with an SDK, CLI, or other tool. The policy must give you access to the monitoring services and the resources being monitored. If you perform an action and get a message that you don't have permission or are unauthorized, confirm with your administrator the type of access you've been granted and which compartment you should work in. For information on user authorizations for monitoring and notifications, see the Authentication and Authorization section for the following services: Monitoring and Notifications.

Available Metrics: oci_big_data_service

Two types metrics are available for Big Data Service.

Cluster metrics

Cluster metrics enable you to obtain a cluster level report and monitor the different distributed key performance indicators.

Node metrics

Node metrics enable you to obtain node level reports and monitor status of individual nodes of the cluster.

Big Data Service emits metrics when the VMS isn't healthy. For example, one metric is emitted when the VM is down, and no metrics when the VMS is up or the VM is in STOPPED state.

Note

Big Data Service doesn't expose DenseIO related maintenance events through metrics if the compute action is either DISABLE or TERMINATE.

Big Data Service metrics include the following dimensions:

  • resourceId

    The Oracle Cloud ID (OCID) of the Big Data Service cluster (for cluster metrics).

    The Oracle Cloud ID (OCID) of the Big Data Service node (for node metrics)

  • resourceType

    BigDataCluster (for cluster metrics)

    BigDataClusterNode (for node metrics)

  • resourceDisplayName

    This field serves as a unique identifier for each metric entity. The field is the node name that can be found from the Cluster details page.

MaintenanceStatus specific dimensions
  • maintenanceDueTime

    The scheduled start time of the 24-hour maintenance window.

  • computeMaintenanceAction

    The action that Oracle Cloud Infrastructure performs on an instance during a scheduled maintenance.

    • REBOOT: The instance is migrated from the physical host that needs maintenance to a healthy host. If live migration isn't possible, then the instance is reboot migrated.
    • REBUILD_IN_PLACE: The instance is stopped, rebuilt on the same physical hardware, and restarted. A downtime of several hours occurs during the maintenance process.
  • recommendedAction

    The action that you can take before the scheduled maintenance event, so that you can control how and when your applications experience downtime.

The metrics listed in the following table are automatically available for any cluster that you create. You don't need to enable monitoring on the resource to get these metrics.

Metric Metric Display Name Unit Description Resource Type
HdfsSpaceUsed HDFS Space Used Bytes Total HDFS space used on the cluster Cluster
HdfsSpaceFree HDFS Space Free Bytes Total free HDFS space on the cluster Cluster
YarnJobsCompleted Yarn Jobs Completed Jobs/Min Number of YARN jobs completed on this cluster Cluster
SparkJobsCompleted Spark Jobs Completed Jobs/Min Number of Spark jobs completed on this cluster Cluster
ServiceCertificateExpiryTime Service Certificate Expiry Time Days Number of days left for a particular service certificate to expire in the cluster Cluster
CpuUtilization CPU Utilization Percentage CPU Percentage used Node
DiskUtilization Disk Utilization Bytes Disk space used Node
MemoryUtilization Memory Utilization Bytes Total memory used Node
NetworkBytesIn Network Bytes In Bytes/Min Network bytes in per minute Node
NetworkBytesOut Network Bytes Out Bytes/Min Network bytes out per minute Node
CertificateExpiryTime Certificate Expiration Time Days Days until certificate expiration Node
MaintenanceStatus Maintenance Status Count A value of 0 indicates that the node has no scheduled maintenance reboot. A value of 1 indicates that the node has scheduled maintenance reboot. Node