High Performance Computing
High Performance Computing (HPC) performs complex calculations and processes data faster than traditional Compute. HPC uses bare metal servers, ultralow latency cluster networking, high-performance storage options, and parallel file systems. This infrastructure enables parallel processing for compute-intensive workloads such as artificial intelligence, deep learning, data analysis, scientific simulations, and any other highly intensive workload.
Getting Started with High Performance Computing
You can create a single-node HPC instance with the standard instance creation workflow. If you want to use multiple HPC instances in a RDMA network group, you can create them through a Cluster Networks with Instance Pools or Compute Clusters.
Using RDMA Cluster Networks
Remote Direct Memory Access (RDMA) cluster networks are groups of high performance computing (HPC), GPU, or optimized instances that are connected with a high-bandwidth, ultra low-latency network. Each node in the cluster is a bare metal machine located in close physical proximity to the other nodes. A remote direct memory access (RDMA) network between nodes provides latency as low as single-digit microseconds, comparable to on-premises HPC clusters.
Cluster networks are designed for highly demanding parallel computing workloads. For example:
- Computational fluid dynamics simulations for automotive or aerospace modeling
- Financial modeling and risk analysis
- Biomedical simulations
- Trajectory analysis and design for space exploration
- Artificial intelligence and big data workloads
Oracle Cloud Infrastructure offers two types of cluster networks. In both cases, the networks are groups of bare metal instances that are connected with an ultra low latency network.
- Cluster networks with instance pools let you use instance pools to manage groups of identical instances in the RDMA network group. If you want predictable capacity for a specific number of identical instances that are managed as a group, use cluster networks with instance pools.
- Compute clusters let you manage instances in the cluster individually. When you create a compute cluster, you create an empty RDMA network group. After the group is created, you can add instances to the group, or delete instances from the group. If you want to manage instances in the RDMA network independently of each other or use different types of instances in the network group, use compute clusters.
Oracle Cloud Agent Plugins for HPC
Oracle Cloud Infrastructure offers a cloud agent plugin specific for HPC bare metal instances to simplify the configuration and authentication of HPC networks, and provide specialized monitoring for high performance computing.
The HPC plugin is available for HPC in all commercial regions.
Shape | Supported Images | Default Setting |
---|---|---|
BM.GPU.A10.4 | Ubuntu 20.04+, OL7, OL8, CentOS 7+ | Recommended on OCA 1.37.0 or above |
BM.GPU.A100 | Ubuntu 20.04+, OL7, OL8, CentOS 7+ | Recommended on OCA 1.37.0 or above |
BM.GPU.H100.8 | Ubuntu 20.04+, OL7, OL8 | Enabled on OCA 1.37.0 or above |
BM.GPU4.8 | Ubuntu 20.04+, OL7, OL8, CentOS 7+ | Recommended on OCA 1.37.0 or above |
BM.HPC2.36 | Ubuntu 20.04+, OL7, OL8, CentOS 7+ | Recommended on OCA 1.37.0 or above |
BM.Optimized3.36 | Ubuntu 20.04+, OL7, OL8 | Enabled on OCA 1.37.0 or above |
- Auto Configuration
- Applies recommended network adapter settings on GPU shapes
- Applies recommended Mellanox Connect-X settings on GPU shapes
- Assigns IP addresses to RDMA network interfaces based on the primary VCN
- RDMA Authentication/Configuration
- Configures RDMA network interfaces with recommended QoS and MTU
- Configures and maintains required RDMA network authentication
- GPU & RDMA Monitoring
- Emits additional RDMA and GPU performance metrics
To enable the HPC plugin on an existing bare metal instance, you must create or migrate the existing instance to Oracle Cloud Agent 1.35.0 or above. See Oracle Cloud Agent for more information.
Enabling GPU and RDMA Metrics
When you install the Oracle Cloud Agent and enable the HPC monitoring plugin, the GPU and RDMA metrics are automatically enabled. OCI sends the metrics to the customer namespace and bills them against the tenancy.
To determine if these metrics will result in additional charges, see metering pricing.
For a detailed list of HPC metrics, see Compute Instance Metrics.