-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Node Reboots #336
Comments
Can the daemonset remove all the mig partitions on boot? |
@rphillips if you mean MIG partitions on the GPU, that's not needed as they don't survive reboots. What I meant was deleting all the existing allocation entries in InstaSlice (precisely because the respective MIG partitions don't exist anymore), and updating all previously running MIG workloads to assign new partitions so that the pods can resume running. I think that should be the controller's job. Sorry if the issue description is confusing. Feel free to suggest a better wording. |
/assign |
The other day, I discussed with @asm582 on this issue and following is the draft that we came up with to resolve it.
#421 may fix this issue. PR is in the draft state and please feel free to review the approach. |
Thanks@sairameshv I think this design is in line with what the GPU operator does when the node reboots for now. we should create slices for both states created and ungated, thoughts? |
/retitle Handle Node Reboots |
On NVIDIA GPUs, MIG partitions are not persisted between node reboots.
If a node crashes and InstaSlice does not have a chance to gracefully de-allocate and delete MIG partitions, it will end up listing "dangling" allocations and trying to assign them to re-started pods according to the outdated InstaSlice object. In that case the workloads will fail to resume running.
The text was updated successfully, but these errors were encountered: