Managing Cluster Metrics

You can monitor the health, capacity, and performance of your Big Data Service resources by using metrics, alarms, and notifications.

Required IAM Policy

To monitor resources, you must have the required type of access in a policy written by an administrator, whether you're using the Console or the REST API with an SDK, CLI, or other tool. The policy must give you access to the monitoring services and the resources being monitored. If you perform an action and get a message that you don't have permission or are unauthorized, confirm with your administrator the type of access you've been granted and which compartment to work in. For information on user authorizations for monitoring and notifications, see the Authentication and Authorization section for the following services: Monitoring and Notifications.

Available Metrics: oci_big_data_service

Two types metrics are available for Big Data Service.

Cluster metrics

Cluster metrics enable you to obtain a cluster level report and monitor the different distributed key performance indicators.

Node metrics

Node metrics enable you to obtain node level reports and monitor status of individual nodes of the cluster.

Big Data Service emits metrics when the VMS isn't healthy. For example, one metric is emitted when the VM is down, and no metrics when the VMS is up or the VM is in STOPPED state.

Note

Big Data Service doesn't expose DenseIO related maintenance events through metrics if the compute action is either DISABLE or TERMINATE.

Resource Principal Metrics

Metrics for Resource Principal Session Tokens (RPST) help proactively monitor token lifecycle, validity, and refresh status.

Metric Dimensions


Dimension	Description
resourceId	OCID of the Big Data Service node or cluster, depending on the metric.
clusterOcid	OCID of the Big Data Service cluster.
clusterName	Name of the Big Data Service cluster.
resourceType	`BigDataClusterNode` (for node metrics) or `BigDataCluster` (for cluster level).
resourceDisplayName	Node name, available in the cluster details UI.

Big Data Service metrics include the following dimensions:

resourceId
The Oracle Cloud ID (OCID) of the Big Data Service cluster (for cluster metrics).

The Oracle Cloud ID (OCID) of the Big Data Service node (for node metrics)
resourceType
BigDataCluster (for cluster metrics)

BigDataClusterNode (for node metrics)
resourceDisplayName
This field serves as a unique identifier for each metric entity. The field is the node name that can be found from the Cluster details page.

MaintenanceStatus specific dimensions

maintenanceDueTime
The scheduled start time of the 24-hour maintenance window.
computeMaintenanceAction
The action that Oracle Cloud Infrastructure performs on an instance during a scheduled maintenance.
- REBOOT: The instance is migrated from the physical host that needs maintenance to a healthy host. If live migration isn't possible, then the instance is reboot migrated.
- REBUILD_IN_PLACE: The instance is stopped, rebuilt on the same physical hardware, and restarted. A downtime of several hours occurs during the maintenance process.
recommendedAction
The action that you can take before the scheduled maintenance event to control how and when your applications experience downtime.
- REBOOT: You can reboot a cluster node, see Restarting a Cluster's Node.

The metrics listed in the following table are automatically available for any cluster that you create. You don't need to enable monitoring on the resource to get these metrics.


Metric	Metric Display Name	Unit	Description	Resource Type
`HdfsSpaceUsed`	HDFS Space Used	Bytes	Total HDFS space used on the cluster	Cluster
`HdfsSpaceFree`	HDFS Space Free	Bytes	Total free HDFS space on the cluster	Cluster
`YarnJobsCompleted`	Yarn Jobs Completed	Jobs/Min	Number of YARN jobs completed on this cluster	Cluster
`SparkJobsCompleted`	Spark Jobs Completed	Jobs/Min	Number of Spark jobs completed on this cluster	Cluster
`ServiceCertificateExpiryTime`	Service Certificate Expiry Time	Days	Number of days left for a particular service certificate to expire in the cluster	Cluster
`CpuUtilization`	CPU Utilization	Percentage	CPU Percentage used	Node
`DiskUtilization`	Disk Utilization	Bytes	Disk space used	Node
`MemoryUtilization`	Memory Utilization	Bytes	Total memory used	Node
`NetworkBytesIn`	Network Bytes In	Bytes/Min	Network bytes in per minute	Node
`NetworkBytesOut`	Network Bytes Out	Bytes/Min	Network bytes out per minute	Node
`CertificateExpiryTime`	Certificate Expiration Time	Days	Days until certificate expiration	Node
`MaintenanceStatus`	Maintenance Status	Count	A value of 0 indicates that the node has no scheduled maintenance reboot. A value of 1 indicates that the node has scheduled maintenance reboot.	Node
`ResourcePrincipalTokenExpiryTimeExceeding80PercentThreshold`	Token Expiry Alert	Boolean	Indicates if the RPST token has exceeded 80% of its lifespan.	BigDataClusterNode
`ResourcePrincipalSessionTokenStatus`	RPST Status	Count	1: Token expired, 2: Token missing. 0: healthy token.	BigDataClusterNode
`ResourcePrincipalTokenRefreshedInLast30Mins`	Token Refresh Status	Boolean	Indicates whether the RPST token was refreshed in the last 30 minutes at cluster level.	BigDataCluster

Oracle Cloud Infrastructure Documentation

Managing Cluster Metrics

Required IAM Policy

Available Metrics: oci_big_data_service