Health check the OVS process and restart if it dies #16742

smarterclayton · 2017-10-08T06:37:01Z

Reorganize the existing setup code to perform a periodic background check on the state of the OVS database. If the SDN setup is lost, force the node/network processes to restart. Use the JSONRPC endpoint to perform a few simple checks of status, and detect failure quickly. This reuses our existing health check code, which does not appear to be a performance issue when checked periodically.

Node waiting for OVS to start:

I1008 06:41:25.661293   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:26.690356   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:27.653112   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:28.671950   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:29.653713   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
W1008 06:41:30.285617   11598 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d
E1008 06:41:30.286780   11598 kubelet.go:2093] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
I1008 06:41:30.661441   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:31.653232   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:32.674697   11598 sdn_controller.go:180] [SDN setup] full SDN setup required

Let node start, then stop OVS, node detects immediately

I1008 06:41:40.208239   11598 kubelet_node_status.go:433] Recording NodeReady event message for node localhost.localdomain
I1008 06:41:43.076299   11598 nodecontroller.go:770] NodeController detected that some Nodes are Ready. Exiting master disruption mode.
E1008 06:41:50.941351   11598 healthcheck.go:55] SDN healthcheck disconnected from OVS server: <nil>
I1008 06:41:50.941541   11598 healthcheck.go:60] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:51.045661   11598 healthcheck.go:60] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
F1008 06:41:51.148105   11598 healthcheck.go:76] SDN healthcheck detected unhealthy OVS server, restarting: OVS health check failed

Fixes #16630

@openshift/sig-networking

smarterclayton · 2017-10-08T06:46:21Z

Similar in spirit to #16740, where the node process is responsible for checking its dependencies rather than systemd, which means we can tolerate running in pods more easily and reduce required node configuration.

danwinship · 2017-10-09T15:01:12Z

pkg/network/node/node.go

+			return fmt.Errorf("detected network plugin mismatch between OpenShift node(%q) and master(%q)", pluginName, clusterNetwork.PluginName)
+		} else {
+			// Do not return error in this case
+			glog.Warningf(`either there is network plugin mismatch between OpenShift node(%q) and master or OpenShift master is running an older version where we did not persist plugin name`, pluginName)


They'd have to be running a 3.2 or earlier master for ClusterNetwork.PluginName to be unset. There's no way we'd support a 3.7 node against a 3.2 master even during an upgrade, right? So we could just drop the inner if here now.

Yeah, dead. Will remove.

danwinship · 2017-10-09T15:03:58Z

pkg/network/node/node.go

@@ -13,9 +13,10 @@ import (
 	"sync"
 	"time"

-	log "github.com/golang/glog"
+	"github.com/golang/glog"


This makes it so that this commit doesn't compile... would be better to put the log->glog commit first and move this change there.

danwinship

Cool. A few comments. (Oops, already submitted some as individual comments)

danwinship · 2017-10-09T15:11:20Z

pkg/network/node/healthcheck.go

+		defer c.Close()
+
+		err = c.WaitForDisconnect()
+		utilruntime.HandleError(fmt.Errorf("SDN healthcheck disconnected from OVS server: %v", err))


I don't know anything about the OVS raw protocol, but if it eventually times out idle connections then this might result in spurious errors in the logs.

It's possible to configure OVS to do timeouts, but at least ootb on our deployed systems it does not. I also have the 5s disconnect. I think if in practice we see this error showing up we would increase the timeout on connections significantly and still be ok. It's effectively a deadman switch (and it works really well at it).

danwinship · 2017-10-09T15:13:32Z

pkg/network/node/sdn_controller.go

+	// TODO: make it possible to safely reestablish node configuration after restart
+	// If OVS goes down and fails the health check, restart the entire process
+	healthFn := func() bool { return plugin.alreadySetUp(gwCIDR, clusterNetworkCIDRs) }
+	runOVSHealthCheck("unix", "/var/run/openvswitch/db.sock", healthFn)


Maybe make "unix" and "/var/run/openvswitch/db.sock" be constants in healthcheck.go

Prevents races when the all-in-one is used with multi-tenant SDN

smarterclayton · 2017-10-10T13:43:04Z

updated

A periodic background process watches for when OVS is reset to the default state and causes the entire process to restart. This avoids the need to order the SDN process with OVS, and makes it easier to run the process in a pod. In the future it should be possible to avoid restarting the process to perform this check.

danwinship · 2017-10-10T15:16:41Z

/lgtm

openshift-merge-robot · 2017-10-10T15:16:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, smarterclayton

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/cmd/OWNERS~~ [smarterclayton]
~~pkg/network/OWNERS~~ [danwinship,smarterclayton]
~~pkg/util/OWNERS~~ [smarterclayton]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-merge-robot · 2017-10-10T21:06:25Z

Automatic merge from submit-queue (batch tested with PRs 16737, 16638, 16742, 16765, 16711).

openshift-merge-robot assigned dcbw and knobunc Oct 8, 2017

openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2017

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2017

smarterclayton force-pushed the healthcheck_ovs branch from 2f8801b to 24aa622 Compare October 8, 2017 06:42

danwinship reviewed Oct 9, 2017

View reviewed changes

smarterclayton added 2 commits October 10, 2017 09:42

Use glog local name instead of log

b5f117a

Move network type check to inside the network code

b0073eb

Prevents races when the all-in-one is used with multi-tenant SDN

smarterclayton force-pushed the healthcheck_ovs branch from 24aa622 to 189e581 Compare October 10, 2017 13:42

smarterclayton force-pushed the healthcheck_ovs branch from 189e581 to 572d44b Compare October 10, 2017 15:00

openshift-ci-robot assigned danwinship Oct 10, 2017

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 10, 2017

openshift-merge-robot merged commit 16e9703 into openshift:master Oct 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check the OVS process and restart if it dies #16742

Health check the OVS process and restart if it dies #16742

smarterclayton commented Oct 8, 2017 •

edited

Loading

smarterclayton commented Oct 8, 2017

danwinship Oct 9, 2017

smarterclayton Oct 10, 2017

danwinship Oct 9, 2017

danwinship left a comment

danwinship Oct 9, 2017

smarterclayton Oct 10, 2017

danwinship Oct 9, 2017

smarterclayton Oct 10, 2017

smarterclayton commented Oct 10, 2017

danwinship commented Oct 10, 2017

openshift-merge-robot commented Oct 10, 2017

openshift-merge-robot commented Oct 10, 2017

Health check the OVS process and restart if it dies #16742

Health check the OVS process and restart if it dies #16742

Conversation

smarterclayton commented Oct 8, 2017 • edited Loading

smarterclayton commented Oct 8, 2017

danwinship Oct 9, 2017

Choose a reason for hiding this comment

smarterclayton Oct 10, 2017

Choose a reason for hiding this comment

danwinship Oct 9, 2017

Choose a reason for hiding this comment

danwinship left a comment

Choose a reason for hiding this comment

danwinship Oct 9, 2017

Choose a reason for hiding this comment

smarterclayton Oct 10, 2017

Choose a reason for hiding this comment

danwinship Oct 9, 2017

Choose a reason for hiding this comment

smarterclayton Oct 10, 2017

Choose a reason for hiding this comment

smarterclayton commented Oct 10, 2017

danwinship commented Oct 10, 2017

openshift-merge-robot commented Oct 10, 2017

openshift-merge-robot commented Oct 10, 2017

smarterclayton commented Oct 8, 2017 •

edited

Loading