Handle Node Reboots #336

empovit · 2024-12-16T11:26:01Z

On NVIDIA GPUs, MIG partitions are not persisted between node reboots.

If a node crashes and InstaSlice does not have a chance to gracefully de-allocate and delete MIG partitions, it will end up listing "dangling" allocations and trying to assign them to re-started pods according to the outdated InstaSlice object. In that case the workloads will fail to resume running.

rphillips · 2024-12-21T00:07:51Z

Can the daemonset remove all the mig partitions on boot?

empovit · 2024-12-22T07:58:58Z

@rphillips if you mean MIG partitions on the GPU, that's not needed as they don't survive reboots.

What I meant was deleting all the existing allocation entries in InstaSlice (precisely because the respective MIG partitions don't exist anymore), and updating all previously running MIG workloads to assign new partitions so that the pods can resume running. I think that should be the controller's job.

Sorry if the issue description is confusing. Feel free to suggest a better wording.

sairameshv · 2025-01-31T16:23:21Z

/assign

sairameshv · 2025-01-31T17:00:52Z

The other day, I discussed with @asm582 on this issue and following is the draft that we came up with to resolve it.

Controller watches for the Node events (so the shutdown)
Update the instaslice status object's Processed field to false
When the daemonset comes up after a reboot, verify the Processed field and create the slices for all those pods that are in Ungated state.

#421 may fix this issue.
However, the code changes need to be tested on kind/ocp clusters with emulated and non-emulated modes yet.
I would also add few more unit tests and e2e tests to cover such scenarios.

PR is in the draft state and please feel free to review the approach.

asm582 · 2025-01-31T17:04:44Z

The other day, I discussed with @asm582 on this issue and following is the draft that we came up with to resolve it.

Controller watches for the Node events (so the shutdown)

Update the instaslice status object's Processed field to false

When the daemonset comes up after a reboot, verify the Processed field and create the slices for all those pods that are in Ungated state.

#421 may fix this issue. However, the code changes need to be tested on kind/ocp clusters with emulated and non-emulated modes yet. I would also add few more unit tests and e2e tests to cover such scenarios.

PR is in the draft state and please feel free to review the approach.

Thanks@sairameshv I think this design is in line with what the GPU operator does when the node reboots for now. we should create slices for both states created and ungated, thoughts?

rphillips · 2025-02-07T15:11:28Z

/retitle Handle Node Reboots

harche self-assigned this Jan 6, 2025

openshift-ci bot assigned sairameshv Jan 31, 2025

sairameshv mentioned this issue Jan 31, 2025

OCPNODE-2882: Handle unexpected node reboots #421

Closed

openshift-ci bot changed the title ~~Handle unexpected node reboots~~ Handle Node Reboots Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Node Reboots #336

Handle Node Reboots #336

empovit commented Dec 16, 2024

rphillips commented Dec 21, 2024

empovit commented Dec 22, 2024

sairameshv commented Jan 31, 2025

sairameshv commented Jan 31, 2025

asm582 commented Jan 31, 2025 •

edited

Loading

rphillips commented Feb 7, 2025

Handle Node Reboots #336

Handle Node Reboots #336

Comments

empovit commented Dec 16, 2024

rphillips commented Dec 21, 2024

empovit commented Dec 22, 2024

sairameshv commented Jan 31, 2025

sairameshv commented Jan 31, 2025

asm582 commented Jan 31, 2025 • edited Loading

rphillips commented Feb 7, 2025

asm582 commented Jan 31, 2025 •

edited

Loading