NVIDIA GPU Plugin

When you enable the NVIDIA GPU Plugin cluster add-on, you can pass the following key/value pairs as arguments.

Note that to ensure that workloads running on NVIDIA GPU worker nodes are not interrupted unexpectedly, we recommend that you choose the version of the NVIDIA GPU Plugin add-on to deploy, rather than specifying that you want Oracle to update the add-on automatically.

Configuration Arguments Common to all Cluster Add-ons
Key (API and CLI) Key's Display Name (Console) Description Required/Optional Default Value Example Value
affinity affinity

A group of affinity scheduling rules.

JSON format in plain text or Base64 encoded.

Optional null null
nodeSelectors node selectors

You can use node selectors and node labels to control the worker nodes on which add-on pods run.

For a pod to run on a node, the pod's node selector must have the same key/value as the node's label.

Set nodeSelectors to a key/value pair that matches both the pod's node selector, and the worker node's label.

JSON format in plain text or Base64 encoded.

Optional null {"foo":"bar", "foo2": "bar2"}

The pod will only run on nodes that have the foo=bar or foo2=bar2 label.

numOfReplicas numOfReplicas The number of replicas of the add-on deployment.

(For CoreDNS, use nodesPerReplica instead.)

Required 1

Creates one replica of the add-on deployment per cluster.

2

Creates two replicas of the add-on deployment per cluster.

rollingUpdate rollingUpdate

Controls the desired behavior of rolling update by maxSurge and maxUnavailable.

JSON format in plain text or Base64 encoded.

Optional null null
tolerations tolerations

You can use taints and tolerations to control the worker nodes on which add-on pods run.

For a pod to run on a node that has a taint, the pod must have a corresponding toleration.

Set tolerations to a key/value pair that matches both the pod's toleration, and the worker node's taint.

JSON format in plain text or Base64 encoded.

Optional null [{"key":"tolerationKeyFoo", "value":"tolerationValBar", "effect":"noSchedule", "operator":"exists"}]

Only pods that have this toleration can run on worker nodes that have the tolerationKeyFoo=tolerationValBar:noSchedule taint.

topologySpreadConstraints topologySpreadConstraints

How to spread matching pods among the given topology.

JSON format in plain text or Base64 encoded.

Optional null null
Configuration Arguments Specific to this Cluster Add-on
Key (API and CLI) Key's Display Name (Console) Description Required/Optional Default Value Example Value
deviceIdStrategy Device ID Strategy

Which strategy to use for passing device IDs to the underlying runtime.

One of:

  • uuid
  • index
Optional uuid
deviceListStrategy Device List Strategy

Which strategy to use for passing the device list to the underlying runtime.

Supported values:

  • envvar
  • volume-mounts
  • cdi-annotations
  • cdi-cri

Multiple values are supported, in a comma-separated list.

Optional envvar
driverRoot Driver Root The root path for the NVIDIA driver installation. Optional /
failOnInitError FailOnInitError

Whether to fail the plugin if an error is encountered during initialization.

When set to false, blocks the plugin indefinitely instead of failing.

Optional true
migStrategy MIG Strategy

Which strategy to use for exposing MIG (Multi-Instance GPU) devices on GPUs that support it.

One of:

  • none
  • single
  • mixed
Optional none
nvidia-gpu-device-plugin.ContainerResources nvidia-gpu-device-plugin container resources

You can specify the resource quantities that the add-on containers request, and set resource usage limits that the add-on containers cannot exceed.

JSON format in plain text or Base64 encoded.

Optional null {"limits": {"cpu": "500m", "memory": "200Mi" }, "requests": {"cpu": "100m", "memory": "100Mi"}}

Create add-on containers that request 100 milllicores of CPU, and 100 mebibytes of memory. Limit add-on containers to 500 milllicores of CPU, and 200 mebibytes of memory.

passDeviceSpecs Pass Device Specs Whether to pass the paths and desired device node permissions for any NVIDIA devices being allocated to the container. Optional false
useConfigFile Use Config File from ConfigMap

Whether to use a configuration file to configure the Nvidia Device Plugin for Kubernetes. The configuration file is derived from a ConfigMap.

If set to true, you have to create a ConfigMap in the cluster, name the ConfigMap nvidia-device-plugin-config, and specify values for configuration arguments. See Example.

The ConfigMap is referenced by the nvidia-gpu-device-plugin daemonset.

Optional false

Example of nvidia-device-plugin-config ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata: 
  name: nvidia-device-plugin-config 
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    flags:
      migStrategy: "none"
      failOnInitError: true
      nvidiaDriverRoot: "/"
      plugin:
        passDeviceSpecs: false
        deviceListStrategy: envvar
        deviceIDStrategy: uuid