Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spurious errors on pod termination #18414

Closed
danwinship opened this issue Feb 2, 2018 · 7 comments
Closed

spurious errors on pod termination #18414

danwinship opened this issue Feb 2, 2018 · 7 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/P2 sig/networking

Comments

@danwinship
Copy link
Contributor

danwinship commented Feb 2, 2018

There's some sort of race/bad coordination in kubelet when pods are terminated, such that it keeps trying to do status checks even after it has started terminating the pod, often resulting in:

Feb 01 18:20:44 ci-prtest-5a37c28-15429-ig-n-qzmw origin-node[10940]: I0201 18:20:44.622016   10940 kubelet.go:1852] SyncLoop (DELETE, "api"): "dns-test-8a31e45e-077c-11e8-9376-0e8c3420a542_e2e-tests-dns-g9vhj(8a3a5941-077c-11e8-b172-42010a8e0005)"                                                                   
Feb 01 18:20:44 ci-prtest-5a37c28-15429-ig-n-qzmw origin-node[10940]: I0201 18:20:44.622162   10940 kubelet.go:1846] SyncLoop (REMOVE, "api"): "dns-test-8a31e45e-077c-11e8-9376-0e8c3420a542_e2e-tests-dns-g9vhj(8a3a5941-077c-11e8-b172-42010a8e0005)"                                                                   
Feb 01 18:20:44 ci-prtest-5a37c28-15429-ig-n-qzmw origin-node[10940]: W0201 18:20:44.688273   10940 docker_sandbox.go:340] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "dns-test-8a31e45e-077c-11e8-9376-0e8c3420a542_e2e-tests-dns-g9vhj": Unexpected command output nsenter: cannot open /proc/49301/ns/net: No such file or directory
Feb 01 18:20:44 ci-prtest-5a37c28-15429-ig-n-qzmw origin-node[10940]: with error: exit status 1                                                                                                                                                                                                                           

While this doesn't seem to cause any actual problems, the log messages are spammy, and make it look like something has gone wrong.

Clayton says "Can you open a bug for that and assign it to Seth? I think it happens on every pod termination now."

@pravisankar
Copy link

This looks like existing bug: https://bugzilla.redhat.com/show_bug.cgi?id=1434950
cc: @dcbw

@dcbw
Copy link
Contributor

dcbw commented Feb 2, 2018

Yeah, and it shouldn't be happening. Any way to get better node logs here? I tried to reproduce yesterday but could not.

@danwinship
Copy link
Contributor Author

"Better node logs" as compared to what?

Note that you can click "View job in Jenkins" at the top of a test run and then "S3 artifacts" from there to get to the full origin-node.service logs. Eg, all 4 nodes on https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_conformance_gce/2380/s3/ are full of "failed to read pod IP" errors.

@dcbw
Copy link
Contributor

dcbw commented Feb 2, 2018

Ah I was looking at the extended networking test logs which don't appear to have the node journals.

But also, we apparently run with --loglevel=4 now, not 5, so we lose a lot of useful log messages.

@dcbw
Copy link
Contributor

dcbw commented Feb 3, 2018

So this is Kubelet being stupid. It will run StopPodSandbox() multiple times in parallel for the same sandbox. And dockershim's StopPodSanbox() itself calls GetPodSandboxStatus() which checks the IP and that's where the error is coming from. That's because there's a race where the first StopPodSandbox is in TearDownPod() (and thus the networkReady suppression code isn't triggered yet) and the second StopPodSandbox() then asks for the pod IP. Before that can run nsenter, TearDownPod() completes and the first StopPodSandbox() stops the sandbox container and destroys the netns. Then the second StopPodSanbox()'s GetPodSandboxStatus() finally gets around to running nsenter and fails.

@dcbw
Copy link
Contributor

dcbw commented Feb 3, 2018

upstream fix: kubernetes/kubernetes#59301

@danwinship
Copy link
Contributor Author

Ah I was looking at the extended networking test logs which don't appear to have the node journals.

They do, it's just not obvious where: artifacts/scripts/networking/logs/{subnet,multitenant,networkpolicy}/{nettest-master,nettest-node-1,nettest-node-2}/systemd.log.gz

@sjenning sjenning assigned dcbw and unassigned sjenning Feb 5, 2018
@0xmichalis 0xmichalis added kind/bug Categorizes issue or PR as related to a bug. sig/networking labels Feb 6, 2018
openshift-merge-robot added a commit that referenced this issue Mar 21, 2018
Automatic merge from submit-queue.

UPSTREAM: 59301: dockershim: don't check pod IP in StopPodSandbox

We're about to tear the container down, there's no point.  It also suppresses
an annoying error message due to kubelet stupidity that causes multiple
parallel calls to StopPodSandbox for the same sandbox.

docker_sandbox.go:355] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/22646/ns/net: No such file or directory

1) A first StopPodSandbox() request triggered by SyncLoop(PLEG) for
a ContainerDied event calls into TearDownPod() and thus the network
plugin.  Until this completes, networkReady=true for the
sandbox.

2) A second StopPodSandbox() request triggered by SyncLoop(REMOVE)
calls PodSandboxStatus() and calls into the network plugin to read
the IP address because networkReady=true

3) The first request exits the network plugin, sets networReady=false,
and calls StopContainer() on the sandbox.  This destroys the network
namespace.

4) The second request finally gets around to running nsenter but
the network namespace is already destroyed.  It returns an error
which is logged by getIP().

Fixes: #18414

@danwinship
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Mar 23, 2018
Automatic merge from submit-queue.

UPSTREAM: 59301: dockershim: don't check pod IP in StopPodSandbox

We're about to tear the container down, there's no point.  It also suppresses
an annoying error message due to kubelet stupidity that causes multiple
parallel calls to StopPodSandbox for the same sandbox.

docker_sandbox.go:355] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/22646/ns/net: No such file or directory

1) A first StopPodSandbox() request triggered by SyncLoop(PLEG) for
a ContainerDied event calls into TearDownPod() and thus the network
plugin.  Until this completes, networkReady=true for the
sandbox.

2) A second StopPodSandbox() request triggered by SyncLoop(REMOVE)
calls PodSandboxStatus() and calls into the network plugin to read
the IP address because networkReady=true

3) The first request exits the network plugin, sets networReady=false,
and calls StopContainer() on the sandbox.  This destroys the network
namespace.

4) The second request finally gets around to running nsenter but
the network namespace is already destroyed.  It returns an error
which is logged by getIP().

Fixes: openshift/origin#18414

@danwinship

Origin-commit: 73e9c8fa185d0f7bb9a0798185f78598cf4bb42b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/P2 sig/networking
Projects
None yet
Development

No branches or pull requests

6 participants