Managing Cluster Metrics
You can monitor the health, capacity, and performance of your Big Data Service resources by using metrics, alarms, and notifications.
Required IAM Policy
To monitor resources, you must be given the required type of access in a policy written by an administrator, whether you're using the Cloud Console or the REST API with an SDK, CLI, or other tool. The policy must give you access to the monitoring services and the resources being monitored. If you perform an action and get a message that you don't have permission or are unauthorized, confirm with your administrator the type of access you've been granted and which compartment you should work in. For information on user authorizations for monitoring and notifications, see the Authentication and Authorization section for the following services: Monitoring and Notifications.
Available Metrics: oci_big_data_service
Two types metrics are available for Big Data Service.
- Cluster metrics
-
Cluster metrics enable you to obtain a cluster level report and monitor the different distributed key performance indicators.
- Node metrics
-
Node metrics enable you to obtain node level reports and monitor status of individual nodes of the cluster.
Big Data Service emits metrics when the VMS isn't healthy. For example, one metric is emitted when the VM is down, and no metrics when the VMS is up or the VM is in STOPPED state.
Note
Big Data Service doesn't expose DenseIO related maintenance events through metrics if the compute action is either DISABLE or TERMINATE.
Big Data Service metrics include the following dimensions:
- resourceId
The Oracle Cloud ID (OCID) of the Big Data Service cluster (for cluster metrics).
The Oracle Cloud ID (OCID) of the Big Data Service node (for node metrics)
- resourceType
BigDataCluster
(for cluster metrics)BigDataClusterNode
(for node metrics) - resourceDisplayName
This field serves as a unique identifier for each metric entity. The field is the node name that can be found from the Cluster details page.
- maintenanceDueTime
The scheduled start time of the 24-hour maintenance window.
- computeMaintenanceAction
The action that Oracle Cloud Infrastructure performs on an instance during a scheduled maintenance.
REBOOT
: The instance is migrated from the physical host that needs maintenance to a healthy host. If live migration isn't possible, then the instance is reboot migrated.REBUILD_IN_PLACE
: The instance is stopped, rebuilt on the same physical hardware, and restarted. A downtime of several hours occurs during the maintenance process.
- recommendedAction
The action that you can take before the scheduled maintenance event, so that you can control how and when your applications experience downtime.
REBOOT
: You can reboot a cluster node, see Restarting a Cluster's Node.
The metrics listed in the following table are automatically available for any cluster that you create. You don't need to enable monitoring on the resource to get these metrics.
Metric | Metric Display Name | Unit | Description | Resource Type |
---|---|---|---|---|
HdfsSpaceUsed |
HDFS Space Used | Bytes | Total HDFS space used on the cluster | Cluster |
HdfsSpaceFree |
HDFS Space Free | Bytes | Total free HDFS space on the cluster | Cluster |
YarnJobsCompleted |
Yarn Jobs Completed | Jobs/Min | Number of YARN jobs completed on this cluster | Cluster |
SparkJobsCompleted |
Spark Jobs Completed | Jobs/Min | Number of Spark jobs completed on this cluster | Cluster |
ServiceCertificateExpiryTime |
Service Certificate Expiry Time | Days | Number of days left for a particular service certificate to expire in the cluster | Cluster |
CpuUtilization |
CPU Utilization | Percentage | CPU Percentage used | Node |
DiskUtilization |
Disk Utilization | Bytes | Disk space used | Node |
MemoryUtilization |
Memory Utilization | Bytes | Total memory used | Node |
NetworkBytesIn |
Network Bytes In | Bytes/Min | Network bytes in per minute | Node |
NetworkBytesOut |
Network Bytes Out | Bytes/Min | Network bytes out per minute | Node |
CertificateExpiryTime |
Certificate Expiration Time | Days | Days until certificate expiration | Node |
MaintenanceStatus |
Maintenance Status | Count | A value of 0 indicates that the node has no scheduled maintenance reboot. A value of 1 indicates that the node has scheduled maintenance reboot. | Node |