sdn: miscellaneous fixes after the CNI merge #11613

dcbw · 2016-10-27T13:16:54Z

The podManager must be started (so it can process requests) before we try to call Update on pods at startup if networking has changed.

Fixes: cf69a41
Fixes bug 1388856
Fixes bug 1389717

Fixes: cf69a41

dcbw · 2016-10-27T13:17:19Z

@eparis @openshift/networking @knobunc

danwinship · 2016-10-27T13:28:55Z

LGTM. Did you mean to link to a bug rather than a commit id above?

dcbw · 2016-10-27T13:29:26Z

@danwinship meant the commit ID, though I could do the github issue too if you like.

pravisankar · 2016-10-27T13:30:41Z

LGTM

eparis · 2016-10-27T13:48:49Z

[test]

eparis · 2016-10-27T13:52:05Z

[testextended][extended:networking]

liggitt · 2016-10-27T13:53:35Z

pkg/sdn/plugin/node.go

@@ -207,6 +207,9 @@ func (node *OsdnNode) Start() error {
 	if err != nil {
 		return err
 	}
+	if err := node.podManager.Start(cniserver.CNIServerSocketPath); err != nil {


Is the listener always ready once this returns?

Yeah, processCNIRequests() will be running in a goroutine at this point, handling subsequent requests from UpdatePod and kubelet.

But there's one more part to this bug that I'll push as a second commit and get another round of LGTM. @DirectXMan12 found the issues yesterday when running a combined process master and node that this PR should fix.

is the listener established synchronously before the goroutine is spawned?

yup, found l, err := net.Listen("unix", s.path) in CNIServer#Start before the Serve goroutine is kicked off

if go s.Serve(l) returns an error, nothing will log/handle/restart/exit... if it panics, the process will exit. Do we need to put handling around the Serve call inside the goroutine?

@liggitt in this specific case, the CNI listener/server isn't actually in the hot path, since the Update request that this PR fixes doesn't go through the listener. We just need to make sure that processCNIRequests() is running.

dcbw · 2016-10-27T14:13:54Z

@danwinship @pravisankar @knobunc PTAL thanks; one more commit to fix @DirectXMan12 issue from yesterday.

dcbw · 2016-10-27T14:17:32Z

@DirectXMan12 can you test these 2 commits with your setup?

We need the kubelet network Host object before we can update pods, so wait for it.

knobunc

LGTM

dcbw · 2016-10-27T14:45:07Z

if go s.Serve(l) returns an error, nothing will log/handle/restart/exit...

Good point, will look at it.

danwinship · 2016-10-27T15:13:46Z

Looks right. plugin.go:Init() could return an error without closing kubeletInitReady, but the node will eventually hit a glog.Fatal() in that case, so it doesn't matter.

dcbw · 2016-10-28T21:39:08Z

if go s.Serve(l) returns an error, nothing will log/handle/restart/exit...

Good point, will look at it.

@liggitt Handled; @danwinship and I looked at it yesterday and concluded that running Serve() forever was OK since we're using Unix domain sockets and thus most of the reasons Serve()/Accept() will return aren't really valid.

danwinship · 2016-10-31T13:31:43Z

pkg/sdn/plugin/cniserver/cniserver.go

-	go s.Serve(l)
+	go utilwait.Forever(func() {
+		if err := s.Serve(l); err != nil {
+			glog.Warningf("CNI server Serve() failed: %v", err)


based on other sdn code, I think the preferred behavior is to call (kubernetes/pkg/util/)runtime.HandleError()

The only reason Serve() will return an error is from Accept(), and most of the reasons Accept() (and accept(2)) will return errors are relevant for network sockets, not for unix domain sockets which is what the SDN code uses. So if an error does happen, like broken connections or something, just keep serving.

Due to a misguided attempt to harmonize addresses and routes checking in alreadySetUp(). Turns out addresses can simply be checked for equality since they are returned from GetAddresses() as plain CIDRs, but routes need the extra " " in the check because the entire '/sbin/ip route' line is returned. Fixes: openshift#11082 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1388856

dcbw · 2016-10-31T19:00:29Z

Extended networking test failure seems like a dind or docker issue:

Oct 31 11:16:25.304: INFO: At 2016-10-31 11:15:44 -0400 EDT - event for flow-checkl35gu: {kubelet nettest-node-1} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "flow-check" with RunContainerError: "runContainer: Error response from daemon: Cannot start container ec598cdedcc673560b770debe1b1d705e3fc0751182dcc82b0ba5c740ec846f1: Path /var/run/openvswitch/br0.mgmt is mounted on /run but it is not a shared or slave mount."

knobunc · 2016-11-01T14:03:14Z

@danw @pravisankar PTAL

dcbw · 2016-11-01T18:05:21Z

re-[testextended][extended:networking]

dcbw · 2016-11-01T18:11:01Z

re-[testextended][extended:networking] flake is #11707

marun · 2016-11-01T21:23:12Z

Consider adding mount --make-shared /run after https://github.com/openshift/origin/blob/master/images/dind/dind-setup.sh#L44 and see if the test passes.

dcbw · 2016-11-01T22:26:15Z

re-[testextended][extended:networking] flake is #11707

openshift-bot · 2016-11-01T22:29:23Z

Evaluated for origin testextended up to a311f01

openshift-bot · 2016-11-01T23:53:24Z

continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/720/) (Base Commit: 8ecb3f5) (Extended Tests: networking)

knobunc · 2016-11-02T13:48:25Z

[merge]

openshift-bot · 2016-11-02T13:53:30Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11027/) (Image: devenv-rhel7_5305)

openshift-bot · 2016-11-02T13:53:31Z

Evaluated for origin merge up to a311f01

dcbw · 2016-11-02T14:10:57Z

Extended net tests failed with:

SetupNetworkError: "Failed to setup network for pod "service-wget_e2e-tests-net-services2-sr3qi(90a873e7-a08c-11e6-adad-0242d009400c)" using network plugins "cni": CNI request failed with status 400: 'Failed to ensure that nat chain OUTPUT jumps to KUBE-HOSTPORTS: error checking rule: exit status 2: iptables v1.4.21: Couldn't load target KUBE-HOSTPORTS':No such file or directory\n\nTryiptables -h' or 'iptables --help' for more information.\n\n'; Skipping pod"

but that's not related to this PR, that's coming from pkg/kubelet/network/hostport/hostport.go:

    glog.V(4).Info("Ensuring kubelet hostport chains")
    // Ensure kubeHostportChain
    if _, err := h.iptables.EnsureChain(utiliptables.TableNAT, kubeHostportsChain); err != nil {
        return fmt.Errorf("Failed to ensure that %s chain %s exists: %v", utiliptables.TableNAT, kubeHostportsChain, err)
    }

danwinship · 2016-11-02T15:15:13Z

but that's not related to this PR, that's coming from pkg/kubelet/network/hostport/hostport.go:

actually, it's coming from the EnsureRule() call a few lines down from that EnsureChain() call... but it looks like it's failing because the chain that it just called EnsureChain() on doesn't exist...

Is it possible for SyncHostports() to get called from multiple threads at once?

dcbw · 2016-11-02T20:13:50Z

actually, it's coming from the EnsureRule() call a few lines down from that EnsureChain() call... but it looks like it's failing because the chain that it just called EnsureChain() on doesn't exist...

Is it possible for SyncHostports() to get called from multiple threads at once?

All the calls should be synchronized through the podManager since they are done from the setup/teardown liek the rest of the pod network operations. They should also be synchronized by the iptables commands and the locking that the go iptables stuff uses, I think?

danwinship · 2016-11-02T20:19:57Z

[test] again so it will hopefully pass before the merge queue reaches it

openshift-bot · 2016-11-02T20:22:59Z

Evaluated for origin test up to a311f01

openshift-bot · 2016-11-02T21:54:57Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11027/) (Base Commit: cf6ed4b)

sdn: start pod manager before trying to update if VNIDs have changed

d861f06

Fixes: cf69a41

dcbw added the component/networking label Oct 27, 2016

liggitt reviewed Oct 27, 2016

View reviewed changes

sdn: wait for kubelet network plugin init before processing pods

75252d7

We need the kubelet network Host object before we can update pods, so wait for it.

dcbw force-pushed the sdn-start-pod-manager-earlier branch from 15567df to 75252d7 Compare October 27, 2016 14:26

knobunc approved these changes Oct 27, 2016

View reviewed changes

dcbw changed the title ~~sdn: start pod manager before trying to update if VNIDs have changed~~ sdn: miscellaneous fixes after the CNI merge Oct 28, 2016

danwinship reviewed Oct 31, 2016

View reviewed changes

dcbw added 2 commits October 31, 2016 09:15

dcbw force-pushed the sdn-start-pod-manager-earlier branch from deaf7b0 to 162c0be Compare October 31, 2016 14:15

danwinship approved these changes Nov 1, 2016

View reviewed changes

dcbw mentioned this pull request Nov 1, 2016

Bug 1389213 - Fix join/isolate project network #11679

Merged

dind: make /run a shared mount

a311f01

marun mentioned this pull request Nov 2, 2016

Extended networking tests fail on "Path /var/run/openvswitch/br0.mgmt is mounted on /run but it is not a shared or slave mount." #11707

Closed

openshift-bot merged commit 605a036 into openshift:master Nov 2, 2016

marun mentioned this pull request Nov 3, 2016

dind: fix /run mount prop for docker 1.12 compat #11720

Closed

sdn: miscellaneous fixes after the CNI merge #11613

sdn: miscellaneous fixes after the CNI merge #11613

Conversation

dcbw commented Oct 27, 2016 • edited by knobunc Loading

dcbw commented Oct 27, 2016

danwinship commented Oct 27, 2016

dcbw commented Oct 27, 2016

pravisankar commented Oct 27, 2016

eparis commented Oct 27, 2016

eparis commented Oct 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcbw commented Oct 27, 2016

dcbw commented Oct 27, 2016

knobunc left a comment

Choose a reason for hiding this comment

dcbw commented Oct 27, 2016

danwinship commented Oct 27, 2016

dcbw commented Oct 28, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcbw commented Oct 31, 2016

knobunc commented Nov 1, 2016

dcbw commented Nov 1, 2016

dcbw commented Nov 1, 2016

marun commented Nov 1, 2016

dcbw commented Nov 1, 2016

openshift-bot commented Nov 1, 2016

openshift-bot commented Nov 1, 2016

knobunc commented Nov 2, 2016

openshift-bot commented Nov 2, 2016 • edited Loading

openshift-bot commented Nov 2, 2016

dcbw commented Nov 2, 2016

danwinship commented Nov 2, 2016

dcbw commented Nov 2, 2016

danwinship commented Nov 2, 2016

openshift-bot commented Nov 2, 2016

openshift-bot commented Nov 2, 2016

dcbw commented Oct 27, 2016 •

edited by knobunc

Loading

dcbw commented Oct 28, 2016 •

edited

Loading

openshift-bot commented Nov 2, 2016 •

edited

Loading