The InstaSlice operator requires NVIDIA GPU drivers to be installed on OpenShift nodes with NVIDIA GPUs. It also requires Multi-Instance GPU (MIG) to be enabled on a node's GPUs without any MIG partitions defined. The recommended way to accomplish both on OpenShift is via the NVIDIA GPU Operator. The operator will install the drivers, and its MIG manager will gracefully take care of everything that is needed to set the correct MIG mode.
-
Install the NVIDIA GPU Operator for OpenShift (not for Kubernetes).
-
Create a cluster policy with the following changes:
-
Disable the device plugin because the task of managing and allocating GPU resources (MIG partitions) will be performed by InstaSlice.
devicePlugin: enabled: false
-
Disable the CUDA validator so that it does not try to find a GPU to run on. None will be available until InstaSlice creates a MIG partition, but that partition will be allocated to a customer workload.
validator: <...> cuda: env: - name: WITH_WORKLOAD value: 'false' <...>
-
Configure the MIG manager so that it can be used to enable MIG on the GPUs, but does not interfere with InstaSlice:
migManager: config: default: "" name: default-mig-parted-config enabled: true env: - name: WITH_REBOOT value: 'true' - name: MIG_PARTED_MODE_CHANGE_ONLY value: 'true'
⚠️ Warning: SettingMIG_PARTED_MODE_CHANGE_ONLY=true
will prevent the MIG Manager from trying to delete MIG partitions managed by InstaSlice in some corner cases (e.g. restarting a MIG manager pod). However, this also means that you will have to clean up any existing MIG partitions before enabling InstaSlice. -
Change the MIG strategy to
mixed
:mig: strategy: mixed
- Wait for the NVIDIA GPU Operator pods to run successfully.
# oc get pod -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-7pz2r 1/1 Running 0 6m47s
gpu-operator-9588668b5-l5vbr 1/1 Running 0 9m50s
nvidia-container-toolkit-daemonset-tzdkb 1/1 Running 0 6m48s
nvidia-dcgm-8mzps 1/1 Running 0 6m48s
nvidia-dcgm-exporter-z5lj9 1/1 Running 0 6m48s
nvidia-driver-daemonset-417.94.202409121747-0-xvdpr 2/2 Running 0 7m32s
nvidia-mig-manager-ww2cf 1/1 Running 0 2m22s
nvidia-node-status-exporter-w28lj 1/1 Running 0 7m25s
nvidia-operator-validator-bv4zc 1/1 Running 0 6m48s
- Apply all-enabled profile to enable MIG on the GPU nodes:
oc label node $node nvidia.com/mig.config=all-enabled --overwrite
- Verify that MIG has been enabled on the labeled nodes. You can use the following command to query MIG mode of a node:
oc exec -ti $(oc get pod -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=$node -o name) -n nvidia-gpu-operator -- nvidia-smi --query-gpu mig.mode.current,mig.mode.pending --format=csv,noheader
The expected output is:
Enabled, Enabled