OCPNODE-2882: Handle unexpected node reboots #421

sairameshv · 2025-01-31T16:54:49Z

Potentially
Fixes #336

make test, make lint, pass locally
make test-e2e-kind-emulated passes with some tweaks to the config files.

/hold

until a few more e2e test scenarios are added
thoroughly tested on the real GPU covering all the scenarios

Signed-off-by: Sai Ramesh Vanka <[email protected]>

openshift-ci · 2025-01-31T16:55:15Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-01-31T16:56:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sairameshv

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sairameshv]
~~internal/controller/daemonset/OWNERS~~ [sairameshv]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

harche · 2025-01-31T17:02:29Z

@sairameshv not sure if I got this correctly, but reading this comment #336 (comment) I get the impression that if we inject the code in deamonset to delete all allocations during the deamonset bootstrap, would it fix the issue?

harche · 2025-01-31T17:06:23Z

by that I meant, what happens if you fetch the corresponding instaslice object and delete all allocations somewhere here, https://github.com/openshift/instaslice-operator/blob/main/internal/controller/daemonset/instaslice_daemonset.go#L98 ?

sairameshv · 2025-01-31T17:16:11Z

by that I meant, what happens if you fetch the corresponding instaslice object and delete all allocations somewhere here, https://github.com/openshift/instaslice-operator/blob/main/internal/controller/daemonset/instaslice_daemonset.go#L98 ?

If we delete all the previous allocations when a daemonset comes up after a reboot, can we gate the pods again and create the slices ? Any thoughts @asm582 ?

asm582 · 2025-01-31T17:35:01Z

by that I meant, what happens if you fetch the corresponding instaslice object and delete all allocations somewhere here, https://github.com/openshift/instaslice-operator/blob/main/internal/controller/daemonset/instaslice_daemonset.go#L98 ?

If we delete all the previous allocations when a daemonset comes up after a reboot, can we gate the pods again and create the slices ? Any thoughts @asm582 ?

good point, all partitions will be deleted by the hardware after reboot. I think if an operator or deployment is used to spawn pods, then we will always have new ungated pods. we need to come up with a solution for plain pods.

openshift-ci-robot · 2025-02-03T12:52:55Z

@sairameshv: This pull request references OCPNODE-2882 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Potentially
Fixes #336

make test, make lint, pass locally
make test-e2e-kind-emulated passes with some tweaks to the config files.

/hold

until a few more e2e test scenarios are added

thoroughly tested on the real GPU covering all the scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rphillips · 2025-02-03T17:53:27Z

internal/controller/daemonset/instaslice_daemonset.go

+		return ctrl.Result{RequeueAfter: controller.Requeue1sDelay}, nil
+	}
+	for _, cond := range node.Status.Conditions {
+		if cond.Type == v1.NodeReady && cond.Status == v1.ConditionFalse {


A node can go NotReady and not be rebooted... We might have to look at the bootid changing at startup.

I know bootId can be fetched from the Node Status. How can we determine if it is changed? Do we have to persist boot ids somewhere?

The status on the Instaslice object might need to persist it.

sairameshv · 2025-02-14T18:56:57Z

Closing this PR in favor of #433

Handle unexpected node reboots

d21d9f4

Signed-off-by: Sai Ramesh Vanka <[email protected]>

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 31, 2025

openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 31, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2025

sairameshv mentioned this pull request Jan 31, 2025

Handle Node Reboots #336

Open

sairameshv changed the title ~~Handle unexpected node reboots~~ OCPNODE-2882: Handle unexpected node reboots Feb 3, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 3, 2025

rphillips reviewed Feb 3, 2025

View reviewed changes

sairameshv mentioned this pull request Feb 6, 2025

API refactor to improve resource representation and state transitions #418

Merged

sairameshv closed this Feb 14, 2025

sairameshv deleted the node_reboot branch February 14, 2025 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPNODE-2882: Handle unexpected node reboots #421

OCPNODE-2882: Handle unexpected node reboots #421

sairameshv commented Jan 31, 2025

openshift-ci bot commented Jan 31, 2025

openshift-ci bot commented Jan 31, 2025

harche commented Jan 31, 2025 •

edited

Loading

harche commented Jan 31, 2025

sairameshv commented Jan 31, 2025

asm582 commented Jan 31, 2025 •

edited

Loading

openshift-ci-robot commented Feb 3, 2025 •

edited by openshift-ci bot

Loading

rphillips Feb 3, 2025 •

edited

Loading

sairameshv Feb 4, 2025

rphillips Feb 4, 2025

sairameshv commented Feb 14, 2025

OCPNODE-2882: Handle unexpected node reboots #421

OCPNODE-2882: Handle unexpected node reboots #421

Conversation

sairameshv commented Jan 31, 2025

openshift-ci bot commented Jan 31, 2025

openshift-ci bot commented Jan 31, 2025

harche commented Jan 31, 2025 • edited Loading

harche commented Jan 31, 2025

sairameshv commented Jan 31, 2025

asm582 commented Jan 31, 2025 • edited Loading

openshift-ci-robot commented Feb 3, 2025 • edited by openshift-ci bot Loading

rphillips Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

sairameshv Feb 4, 2025

Choose a reason for hiding this comment

rphillips Feb 4, 2025

Choose a reason for hiding this comment

sairameshv commented Feb 14, 2025

harche commented Jan 31, 2025 •

edited

Loading

asm582 commented Jan 31, 2025 •

edited

Loading

openshift-ci-robot commented Feb 3, 2025 •

edited by openshift-ci bot

Loading

rphillips Feb 3, 2025 •

edited

Loading