-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health check the OVS process and restart if it dies #16742
Health check the OVS process and restart if it dies #16742
Conversation
2f8801b
to
24aa622
Compare
Similar in spirit to #16740, where the node process is responsible for checking its dependencies rather than systemd, which means we can tolerate running in pods more easily and reduce required node configuration. |
pkg/network/node/node.go
Outdated
return fmt.Errorf("detected network plugin mismatch between OpenShift node(%q) and master(%q)", pluginName, clusterNetwork.PluginName) | ||
} else { | ||
// Do not return error in this case | ||
glog.Warningf(`either there is network plugin mismatch between OpenShift node(%q) and master or OpenShift master is running an older version where we did not persist plugin name`, pluginName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They'd have to be running a 3.2 or earlier master for ClusterNetwork.PluginName to be unset. There's no way we'd support a 3.7 node against a 3.2 master even during an upgrade, right? So we could just drop the inner if
here now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, dead. Will remove.
@@ -13,9 +13,10 @@ import ( | |||
"sync" | |||
"time" | |||
|
|||
log "github.com/golang/glog" | |||
"github.com/golang/glog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it so that this commit doesn't compile... would be better to put the log
->glog
commit first and move this change there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. A few comments. (Oops, already submitted some as individual comments)
defer c.Close() | ||
|
||
err = c.WaitForDisconnect() | ||
utilruntime.HandleError(fmt.Errorf("SDN healthcheck disconnected from OVS server: %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know anything about the OVS raw protocol, but if it eventually times out idle connections then this might result in spurious errors in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to configure OVS to do timeouts, but at least ootb on our deployed systems it does not. I also have the 5s disconnect. I think if in practice we see this error showing up we would increase the timeout on connections significantly and still be ok. It's effectively a deadman switch (and it works really well at it).
pkg/network/node/sdn_controller.go
Outdated
// TODO: make it possible to safely reestablish node configuration after restart | ||
// If OVS goes down and fails the health check, restart the entire process | ||
healthFn := func() bool { return plugin.alreadySetUp(gwCIDR, clusterNetworkCIDRs) } | ||
runOVSHealthCheck("unix", "/var/run/openvswitch/db.sock", healthFn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make "unix"
and "/var/run/openvswitch/db.sock"
be constants in healthcheck.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k
Prevents races when the all-in-one is used with multi-tenant SDN
24aa622
to
189e581
Compare
updated |
A periodic background process watches for when OVS is reset to the default state and causes the entire process to restart. This avoids the need to order the SDN process with OVS, and makes it easier to run the process in a pod. In the future it should be possible to avoid restarting the process to perform this check.
189e581
to
572d44b
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, smarterclayton The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue (batch tested with PRs 16737, 16638, 16742, 16765, 16711). |
Reorganize the existing setup code to perform a periodic background check on the state of the OVS database. If the SDN setup is lost, force the node/network processes to restart. Use the JSONRPC endpoint to perform a few simple checks of status, and detect failure quickly. This reuses our existing health check code, which does not appear to be a performance issue when checked periodically.
Node waiting for OVS to start:
Let node start, then stop OVS, node detects immediately
Fixes #16630
@openshift/sig-networking