Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdn: miscellaneous fixes after the CNI merge #11613

Merged
merged 5 commits into from
Nov 2, 2016

Conversation

dcbw
Copy link
Contributor

@dcbw dcbw commented Oct 27, 2016

The podManager must be started (so it can process requests) before we try to call Update on pods at startup if networking has changed.

Fixes: cf69a41
Fixes bug 1388856
Fixes bug 1389717

@dcbw
Copy link
Contributor Author

dcbw commented Oct 27, 2016

@eparis @openshift/networking @knobunc

@danwinship
Copy link
Contributor

LGTM. Did you mean to link to a bug rather than a commit id above?

@dcbw
Copy link
Contributor Author

dcbw commented Oct 27, 2016

@danwinship meant the commit ID, though I could do the github issue too if you like.

@pravisankar
Copy link

LGTM

@eparis
Copy link
Member

eparis commented Oct 27, 2016

[test]

@eparis
Copy link
Member

eparis commented Oct 27, 2016

[testextended][extended:networking]

@@ -207,6 +207,9 @@ func (node *OsdnNode) Start() error {
if err != nil {
return err
}
if err := node.podManager.Start(cniserver.CNIServerSocketPath); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the listener always ready once this returns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, processCNIRequests() will be running in a goroutine at this point, handling subsequent requests from UpdatePod and kubelet.

But there's one more part to this bug that I'll push as a second commit and get another round of LGTM. @DirectXMan12 found the issues yesterday when running a combined process master and node that this PR should fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the listener established synchronously before the goroutine is spawned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, found l, err := net.Listen("unix", s.path) in CNIServer#Start before the Serve goroutine is kicked off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if go s.Serve(l) returns an error, nothing will log/handle/restart/exit... if it panics, the process will exit. Do we need to put handling around the Serve call inside the goroutine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt in this specific case, the CNI listener/server isn't actually in the hot path, since the Update request that this PR fixes doesn't go through the listener. We just need to make sure that processCNIRequests() is running.

@dcbw
Copy link
Contributor Author

dcbw commented Oct 27, 2016

@danwinship @pravisankar @knobunc PTAL thanks; one more commit to fix @DirectXMan12 issue from yesterday.

@dcbw
Copy link
Contributor Author

dcbw commented Oct 27, 2016

@DirectXMan12 can you test these 2 commits with your setup?

We need the kubelet network Host object before we can update pods, so
wait for it.
@dcbw dcbw force-pushed the sdn-start-pod-manager-earlier branch from 15567df to 75252d7 Compare October 27, 2016 14:26
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dcbw
Copy link
Contributor Author

dcbw commented Oct 27, 2016

if go s.Serve(l) returns an error, nothing will log/handle/restart/exit...

Good point, will look at it.

@danwinship
Copy link
Contributor

Looks right. plugin.go:Init() could return an error without closing kubeletInitReady, but the node will eventually hit a glog.Fatal() in that case, so it doesn't matter.

@dcbw
Copy link
Contributor Author

dcbw commented Oct 28, 2016

if go s.Serve(l) returns an error, nothing will log/handle/restart/exit...

Good point, will look at it.

@liggitt Handled; @danwinship and I looked at it yesterday and concluded that running Serve() forever was OK since we're using Unix domain sockets and thus most of the reasons Serve()/Accept() will return aren't really valid.

@dcbw dcbw changed the title sdn: start pod manager before trying to update if VNIDs have changed sdn: miscellaneous fixes after the CNI merge Oct 28, 2016
go s.Serve(l)
go utilwait.Forever(func() {
if err := s.Serve(l); err != nil {
glog.Warningf("CNI server Serve() failed: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on other sdn code, I think the preferred behavior is to call (kubernetes/pkg/util/)runtime.HandleError()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

dcbw added 2 commits October 31, 2016 09:15
The only reason Serve() will return an error is from Accept(),
and most of the reasons Accept() (and accept(2)) will return errors
are relevant for network sockets, not for unix domain sockets
which is what the SDN code uses.  So if an error does happen,
like broken connections or something, just keep serving.
Due to a misguided attempt to harmonize addresses and routes
checking in alreadySetUp().  Turns out addresses can simply be
checked for equality since they are returned from GetAddresses()
as plain CIDRs, but routes need the extra " " in the check because
the entire '/sbin/ip route' line is returned.

Fixes: openshift#11082
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1388856
@dcbw dcbw force-pushed the sdn-start-pod-manager-earlier branch from deaf7b0 to 162c0be Compare October 31, 2016 14:15
@dcbw
Copy link
Contributor Author

dcbw commented Oct 31, 2016

Extended networking test failure seems like a dind or docker issue:

Oct 31 11:16:25.304: INFO: At 2016-10-31 11:15:44 -0400 EDT - event for flow-checkl35gu: {kubelet nettest-node-1} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "flow-check" with RunContainerError: "runContainer: Error response from daemon: Cannot start container ec598cdedcc673560b770debe1b1d705e3fc0751182dcc82b0ba5c740ec846f1: Path /var/run/openvswitch/br0.mgmt is mounted on /run but it is not a shared or slave mount."

@knobunc
Copy link
Contributor

knobunc commented Nov 1, 2016

@danw @pravisankar PTAL

@dcbw
Copy link
Contributor Author

dcbw commented Nov 1, 2016

re-[testextended][extended:networking]

@dcbw
Copy link
Contributor Author

dcbw commented Nov 1, 2016

re-[testextended][extended:networking] flake is #11707

@marun
Copy link
Contributor

marun commented Nov 1, 2016

Consider adding mount --make-shared /run after https://github.com/openshift/origin/blob/master/images/dind/dind-setup.sh#L44 and see if the test passes.

@dcbw
Copy link
Contributor Author

dcbw commented Nov 1, 2016

re-[testextended][extended:networking] flake is #11707

@openshift-bot
Copy link
Contributor

Evaluated for origin testextended up to a311f01

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/720/) (Base Commit: 8ecb3f5) (Extended Tests: networking)

@knobunc
Copy link
Contributor

knobunc commented Nov 2, 2016

[merge]

@openshift-bot
Copy link
Contributor

openshift-bot commented Nov 2, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11027/) (Image: devenv-rhel7_5305)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to a311f01

@dcbw
Copy link
Contributor Author

dcbw commented Nov 2, 2016

Extended net tests failed with:

SetupNetworkError: "Failed to setup network for pod "service-wget_e2e-tests-net-services2-sr3qi(90a873e7-a08c-11e6-adad-0242d009400c)" using network plugins "cni": CNI request failed with status 400: 'Failed to ensure that nat chain OUTPUT jumps to KUBE-HOSTPORTS: error checking rule: exit status 2: iptables v1.4.21: Couldn't load target KUBE-HOSTPORTS':No such file or directory\n\nTryiptables -h' or 'iptables --help' for more information.\n\n'; Skipping pod"

but that's not related to this PR, that's coming from pkg/kubelet/network/hostport/hostport.go:

    glog.V(4).Info("Ensuring kubelet hostport chains")
    // Ensure kubeHostportChain
    if _, err := h.iptables.EnsureChain(utiliptables.TableNAT, kubeHostportsChain); err != nil {
        return fmt.Errorf("Failed to ensure that %s chain %s exists: %v", utiliptables.TableNAT, kubeHostportsChain, err)
    }

@danwinship
Copy link
Contributor

but that's not related to this PR, that's coming from pkg/kubelet/network/hostport/hostport.go:

actually, it's coming from the EnsureRule() call a few lines down from that EnsureChain() call... but it looks like it's failing because the chain that it just called EnsureChain() on doesn't exist...

Is it possible for SyncHostports() to get called from multiple threads at once?

@dcbw
Copy link
Contributor Author

dcbw commented Nov 2, 2016

actually, it's coming from the EnsureRule() call a few lines down from that EnsureChain() call... but it looks like it's failing because the chain that it just called EnsureChain() on doesn't exist...

Is it possible for SyncHostports() to get called from multiple threads at once?

All the calls should be synchronized through the podManager since they are done from the setup/teardown liek the rest of the pod network operations. They should also be synchronized by the iptables commands and the locking that the go iptables stuff uses, I think?

@danwinship
Copy link
Contributor

[test] again so it will hopefully pass before the merge queue reaches it

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to a311f01

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11027/) (Base Commit: cf6ed4b)

@openshift-bot openshift-bot merged commit 605a036 into openshift:master Nov 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants