Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Node Reboots #336

Open
empovit opened this issue Dec 16, 2024 · 6 comments
Open

Handle Node Reboots #336

empovit opened this issue Dec 16, 2024 · 6 comments
Assignees

Comments

@empovit
Copy link
Contributor

empovit commented Dec 16, 2024

On NVIDIA GPUs, MIG partitions are not persisted between node reboots.

If a node crashes and InstaSlice does not have a chance to gracefully de-allocate and delete MIG partitions, it will end up listing "dangling" allocations and trying to assign them to re-started pods according to the outdated InstaSlice object. In that case the workloads will fail to resume running.

@rphillips
Copy link
Contributor

Can the daemonset remove all the mig partitions on boot?

@empovit
Copy link
Contributor Author

empovit commented Dec 22, 2024

@rphillips if you mean MIG partitions on the GPU, that's not needed as they don't survive reboots.

What I meant was deleting all the existing allocation entries in InstaSlice (precisely because the respective MIG partitions don't exist anymore), and updating all previously running MIG workloads to assign new partitions so that the pods can resume running. I think that should be the controller's job.

Sorry if the issue description is confusing. Feel free to suggest a better wording.

@harche harche self-assigned this Jan 6, 2025
@sairameshv
Copy link
Member

/assign

@sairameshv
Copy link
Member

The other day, I discussed with @asm582 on this issue and following is the draft that we came up with to resolve it.

  • Controller watches for the Node events (so the shutdown)
  • Update the instaslice status object's Processed field to false
  • When the daemonset comes up after a reboot, verify the Processed field and create the slices for all those pods that are in Ungated state.

#421 may fix this issue.
However, the code changes need to be tested on kind/ocp clusters with emulated and non-emulated modes yet.
I would also add few more unit tests and e2e tests to cover such scenarios.

PR is in the draft state and please feel free to review the approach.

@asm582
Copy link
Contributor

asm582 commented Jan 31, 2025

The other day, I discussed with @asm582 on this issue and following is the draft that we came up with to resolve it.

  • Controller watches for the Node events (so the shutdown)
  • Update the instaslice status object's Processed field to false
  • When the daemonset comes up after a reboot, verify the Processed field and create the slices for all those pods that are in Ungated state.

#421 may fix this issue. However, the code changes need to be tested on kind/ocp clusters with emulated and non-emulated modes yet. I would also add few more unit tests and e2e tests to cover such scenarios.

PR is in the draft state and please feel free to review the approach.

Thanks@sairameshv I think this design is in line with what the GPU operator does when the node reboots for now. we should create slices for both states created and ungated, thoughts?

@rphillips
Copy link
Contributor

/retitle Handle Node Reboots

@openshift-ci openshift-ci bot changed the title Handle unexpected node reboots Handle Node Reboots Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants