-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sdn: miscellaneous fixes after the CNI merge #11613
sdn: miscellaneous fixes after the CNI merge #11613
Conversation
LGTM. Did you mean to link to a bug rather than a commit id above? |
@danwinship meant the commit ID, though I could do the github issue too if you like. |
LGTM |
[test] |
[testextended][extended:networking] |
@@ -207,6 +207,9 @@ func (node *OsdnNode) Start() error { | |||
if err != nil { | |||
return err | |||
} | |||
if err := node.podManager.Start(cniserver.CNIServerSocketPath); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the listener always ready once this returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, processCNIRequests() will be running in a goroutine at this point, handling subsequent requests from UpdatePod and kubelet.
But there's one more part to this bug that I'll push as a second commit and get another round of LGTM. @DirectXMan12 found the issues yesterday when running a combined process master and node that this PR should fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the listener established synchronously before the goroutine is spawned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, found l, err := net.Listen("unix", s.path)
in CNIServer#Start before the Serve goroutine is kicked off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if go s.Serve(l)
returns an error, nothing will log/handle/restart/exit... if it panics, the process will exit. Do we need to put handling around the Serve call inside the goroutine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt in this specific case, the CNI listener/server isn't actually in the hot path, since the Update request that this PR fixes doesn't go through the listener. We just need to make sure that processCNIRequests() is running.
@danwinship @pravisankar @knobunc PTAL thanks; one more commit to fix @DirectXMan12 issue from yesterday. |
@DirectXMan12 can you test these 2 commits with your setup? |
We need the kubelet network Host object before we can update pods, so wait for it.
15567df
to
75252d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Good point, will look at it. |
Looks right. plugin.go:Init() could return an error without closing |
@liggitt Handled; @danwinship and I looked at it yesterday and concluded that running Serve() forever was OK since we're using Unix domain sockets and thus most of the reasons Serve()/Accept() will return aren't really valid. |
go s.Serve(l) | ||
go utilwait.Forever(func() { | ||
if err := s.Serve(l); err != nil { | ||
glog.Warningf("CNI server Serve() failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on other sdn code, I think the preferred behavior is to call (kubernetes/pkg/util/)runtime.HandleError()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
The only reason Serve() will return an error is from Accept(), and most of the reasons Accept() (and accept(2)) will return errors are relevant for network sockets, not for unix domain sockets which is what the SDN code uses. So if an error does happen, like broken connections or something, just keep serving.
Due to a misguided attempt to harmonize addresses and routes checking in alreadySetUp(). Turns out addresses can simply be checked for equality since they are returned from GetAddresses() as plain CIDRs, but routes need the extra " " in the check because the entire '/sbin/ip route' line is returned. Fixes: openshift#11082 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1388856
deaf7b0
to
162c0be
Compare
Extended networking test failure seems like a dind or docker issue:
|
@danw @pravisankar PTAL |
re-[testextended][extended:networking] |
re-[testextended][extended:networking] flake is #11707 |
Consider adding |
re-[testextended][extended:networking] flake is #11707 |
Evaluated for origin testextended up to a311f01 |
continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/720/) (Base Commit: 8ecb3f5) (Extended Tests: networking) |
[merge] |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11027/) (Image: devenv-rhel7_5305) |
Evaluated for origin merge up to a311f01 |
Extended net tests failed with: SetupNetworkError: "Failed to setup network for pod "service-wget_e2e-tests-net-services2-sr3qi(90a873e7-a08c-11e6-adad-0242d009400c)" using network plugins "cni": CNI request failed with status 400: 'Failed to ensure that nat chain OUTPUT jumps to KUBE-HOSTPORTS: error checking rule: exit status 2: iptables v1.4.21: Couldn't load target but that's not related to this PR, that's coming from pkg/kubelet/network/hostport/hostport.go:
|
actually, it's coming from the EnsureRule() call a few lines down from that EnsureChain() call... but it looks like it's failing because the chain that it just called EnsureChain() on doesn't exist... Is it possible for SyncHostports() to get called from multiple threads at once? |
All the calls should be synchronized through the podManager since they are done from the setup/teardown liek the rest of the pod network operations. They should also be synchronized by the iptables commands and the locking that the go iptables stuff uses, I think? |
[test] again so it will hopefully pass before the merge queue reaches it |
Evaluated for origin test up to a311f01 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11027/) (Base Commit: cf6ed4b) |
The podManager must be started (so it can process requests) before we try to call Update on pods at startup if networking has changed.
Fixes: cf69a41
Fixes bug 1388856
Fixes bug 1389717