NVIDIA GPU Plugin

When you enable the NVIDIA GPU Plugin cluster add-on, you can pass the following key/value pairs as arguments.

Note that to ensure that workloads running on NVIDIA GPU worker nodes are not interrupted unexpectedly, we recommend that you choose the version of the NVIDIA GPU Plugin add-on to deploy, rather than specifying that you want Oracle to update the add-on automatically.

Configuration Arguments Common to all Cluster Add-ons


Key (API and CLI)	Key's Display Name (Console)	Description	Required/Optional	Default Value	Example Value
`affinity`	affinity	A group of affinity scheduling rules. JSON format in plain text or Base64 encoded.	Optional	null	null
`nodeSelectors`	node selectors	You can use node selectors and node labels to control the worker nodes on which add-on pods run. For a pod to run on a node, the pod's node selector must have the same key/value as the node's label. Set `nodeSelectors` to a key/value pair that matches both the pod's node selector, and the worker node's label. JSON format in plain text or Base64 encoded.	Optional	null	`{"foo":"bar", "foo2": "bar2"}` The pod will only run on nodes that have the `foo=bar` or `foo2=bar2` label.
`numOfReplicas`	numOfReplicas	The number of replicas of the add-on deployment. For AMD GPU Plugin, not used. For CoreDNS, use `nodesPerReplica` instead.	Required	`1` Creates one replica of the add-on deployment per cluster.	`2` Creates two replicas of the add-on deployment per cluster.
`rollingUpdate`	rollingUpdate	Controls the desired behavior of rolling update by maxSurge and maxUnavailable. JSON format in plain text or Base64 encoded.	Optional	null	null
`tolerations`	tolerations	You can use taints and tolerations to control the worker nodes on which add-on pods run. For a pod to run on a node that has a taint, the pod must have a corresponding toleration. Set `tolerations` to a key/value pair that matches both the pod's toleration, and the worker node's taint. JSON format in plain text or Base64 encoded.	Optional	null	`[{"key":"tolerationKeyFoo", "value":"tolerationValBar", "effect":"noSchedule", "operator":"exists"}]` Only pods that have this toleration can run on worker nodes that have the `tolerationKeyFoo=tolerationValBar:noSchedule` taint.
`topologySpreadConstraints`	topologySpreadConstraints	How to spread matching pods among the given topology. JSON format in plain text or Base64 encoded.	Optional	null	null

Configuration Arguments Specific to this Cluster Add-on


Key (API and CLI)	Key's Display Name (Console)	Description	Required/Optional	Default Value	Example Value
`deviceIdStrategy`	Device ID Strategy	Which strategy to use for passing device IDs to the underlying runtime. One of: `uuid` `index`	Optional	`uuid`
`deviceListStrategy`	Device List Strategy	Which strategy to use for passing the device list to the underlying runtime. Supported values: `envvar` `volume-mounts` `cdi-annotations` `cdi-cri` Multiple values are supported, in a comma-separated list.	Optional	`envvar`
`driverRoot`	Driver Root	The root path for the NVIDIA driver installation.	Optional	`/`
`failOnInitError`	FailOnInitError	Whether to fail the plugin if an error is encountered during initialization. When set to `false`, blocks the plugin indefinitely instead of failing.	Optional	`true`
`migStrategy`	MIG Strategy	Which strategy to use for exposing MIG (Multi-Instance GPU) devices on GPUs that support it. One of: `none` `single` `mixed`	Optional	`none`
`nvidia-gpu-device-plugin.ContainerResources`	nvidia-gpu-device-plugin container resources	You can specify the resource quantities that the add-on containers request, and set resource usage limits that the add-on containers cannot exceed. JSON format in plain text or Base64 encoded.	Optional	null	`{"limits": {"cpu": "500m", "memory": "200Mi" }, "requests": {"cpu": "100m", "memory": "100Mi"}}` Create add-on containers that request 100 milllicores of CPU, and 100 mebibytes of memory. Limit add-on containers to 500 milllicores of CPU, and 200 mebibytes of memory.
`passDeviceSpecs`	Pass Device Specs	Whether to pass the paths and desired device node permissions for any NVIDIA devices being allocated to the container.	Optional	`false`
`useConfigFile`	Use Config File from ConfigMap	Whether to use a configuration file to configure the Nvidia Device Plugin for Kubernetes. The configuration file is derived from a ConfigMap. If set to `true`, you have to create a ConfigMap in the cluster, name the ConfigMap `nvidia-device-plugin-config`, and specify values for configuration arguments. See Example. The ConfigMap is referenced by the `nvidia-gpu-device-plugin` daemonset.	Optional	`false`

Example of nvidia-device-plugin-config ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata: 
  name: nvidia-device-plugin-config 
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    flags:
      migStrategy: "none"
      failOnInitError: true
      nvidiaDriverRoot: "/"
      plugin:
        passDeviceSpecs: false
        deviceListStrategy: envvar
        deviceIDStrategy: uuid

Oracle Cloud Infrastructure Documentation

NVIDIA GPU Plugin