spurious errors on pod termination #18414

danwinship · 2018-02-02T18:44:15Z

There's some sort of race/bad coordination in kubelet when pods are terminated, such that it keeps trying to do status checks even after it has started terminating the pod, often resulting in:

Feb 01 18:20:44 ci-prtest-5a37c28-15429-ig-n-qzmw origin-node[10940]: I0201 18:20:44.622016   10940 kubelet.go:1852] SyncLoop (DELETE, "api"): "dns-test-8a31e45e-077c-11e8-9376-0e8c3420a542_e2e-tests-dns-g9vhj(8a3a5941-077c-11e8-b172-42010a8e0005)"                                                                   
Feb 01 18:20:44 ci-prtest-5a37c28-15429-ig-n-qzmw origin-node[10940]: I0201 18:20:44.622162   10940 kubelet.go:1846] SyncLoop (REMOVE, "api"): "dns-test-8a31e45e-077c-11e8-9376-0e8c3420a542_e2e-tests-dns-g9vhj(8a3a5941-077c-11e8-b172-42010a8e0005)"                                                                   
Feb 01 18:20:44 ci-prtest-5a37c28-15429-ig-n-qzmw origin-node[10940]: W0201 18:20:44.688273   10940 docker_sandbox.go:340] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "dns-test-8a31e45e-077c-11e8-9376-0e8c3420a542_e2e-tests-dns-g9vhj": Unexpected command output nsenter: cannot open /proc/49301/ns/net: No such file or directory
Feb 01 18:20:44 ci-prtest-5a37c28-15429-ig-n-qzmw origin-node[10940]: with error: exit status 1

While this doesn't seem to cause any actual problems, the log messages are spammy, and make it look like something has gone wrong.

Clayton says "Can you open a bug for that and assign it to Seth? I think it happens on every pod termination now."

The text was updated successfully, but these errors were encountered:

pravisankar · 2018-02-02T18:59:28Z

This looks like existing bug: https://bugzilla.redhat.com/show_bug.cgi?id=1434950
cc: @dcbw

dcbw · 2018-02-02T19:30:09Z

Yeah, and it shouldn't be happening. Any way to get better node logs here? I tried to reproduce yesterday but could not.

danwinship · 2018-02-02T20:37:37Z

"Better node logs" as compared to what?

Note that you can click "View job in Jenkins" at the top of a test run and then "S3 artifacts" from there to get to the full origin-node.service logs. Eg, all 4 nodes on https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_conformance_gce/2380/s3/ are full of "failed to read pod IP" errors.

dcbw · 2018-02-02T22:46:29Z

Ah I was looking at the extended networking test logs which don't appear to have the node journals.

But also, we apparently run with --loglevel=4 now, not 5, so we lose a lot of useful log messages.

dcbw · 2018-02-03T04:47:10Z

So this is Kubelet being stupid. It will run StopPodSandbox() multiple times in parallel for the same sandbox. And dockershim's StopPodSanbox() itself calls GetPodSandboxStatus() which checks the IP and that's where the error is coming from. That's because there's a race where the first StopPodSandbox is in TearDownPod() (and thus the networkReady suppression code isn't triggered yet) and the second StopPodSandbox() then asks for the pod IP. Before that can run nsenter, TearDownPod() completes and the first StopPodSandbox() stops the sandbox container and destroys the netns. Then the second StopPodSanbox()'s GetPodSandboxStatus() finally gets around to running nsenter and fails.

dcbw · 2018-02-03T05:04:03Z

upstream fix: kubernetes/kubernetes#59301

danwinship · 2018-02-03T21:24:54Z

Ah I was looking at the extended networking test logs which don't appear to have the node journals.

They do, it's just not obvious where: artifacts/scripts/networking/logs/{subnet,multitenant,networkpolicy}/{nettest-master,nettest-node-1,nettest-node-2}/systemd.log.gz

@danwinship

Automatic merge from submit-queue. UPSTREAM: 59301: dockershim: don't check pod IP in StopPodSandbox We're about to tear the container down, there's no point. It also suppresses an annoying error message due to kubelet stupidity that causes multiple parallel calls to StopPodSandbox for the same sandbox. docker_sandbox.go:355] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/22646/ns/net: No such file or directory 1) A first StopPodSandbox() request triggered by SyncLoop(PLEG) for a ContainerDied event calls into TearDownPod() and thus the network plugin. Until this completes, networkReady=true for the sandbox. 2) A second StopPodSandbox() request triggered by SyncLoop(REMOVE) calls PodSandboxStatus() and calls into the network plugin to read the IP address because networkReady=true 3) The first request exits the network plugin, sets networReady=false, and calls StopContainer() on the sandbox. This destroys the network namespace. 4) The second request finally gets around to running nsenter but the network namespace is already destroyed. It returns an error which is logged by getIP(). Fixes: #18414 @danwinship

@danwinship

Automatic merge from submit-queue. UPSTREAM: 59301: dockershim: don't check pod IP in StopPodSandbox We're about to tear the container down, there's no point. It also suppresses an annoying error message due to kubelet stupidity that causes multiple parallel calls to StopPodSandbox for the same sandbox. docker_sandbox.go:355] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/22646/ns/net: No such file or directory 1) A first StopPodSandbox() request triggered by SyncLoop(PLEG) for a ContainerDied event calls into TearDownPod() and thus the network plugin. Until this completes, networkReady=true for the sandbox. 2) A second StopPodSandbox() request triggered by SyncLoop(REMOVE) calls PodSandboxStatus() and calls into the network plugin to read the IP address because networkReady=true 3) The first request exits the network plugin, sets networReady=false, and calls StopContainer() on the sandbox. This destroys the network namespace. 4) The second request finally gets around to running nsenter but the network namespace is already destroyed. It returns an error which is logged by getIP(). Fixes: openshift/origin#18414 @danwinship Origin-commit: 73e9c8fa185d0f7bb9a0798185f78598cf4bb42b

danwinship assigned sjenning Feb 2, 2018

danwinship mentioned this issue Feb 2, 2018

Move pod-namespace calls out of process #18355

Merged

dcbw mentioned this issue Feb 2, 2018

[DO NOT MERGE] Sandbox status error test #18424

Closed

dcbw mentioned this issue Feb 3, 2018

UPSTREAM: 59301: dockershim: don't check pod IP in StopPodSandbox #18425

Merged

sjenning assigned dcbw and unassigned sjenning Feb 5, 2018

0xmichalis added kind/bug Categorizes issue or PR as related to a bug. sig/networking labels Feb 6, 2018

jwforres added the priority/P2 label Feb 6, 2018

openshift-merge-robot closed this as completed in #18425 Mar 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spurious errors on pod termination #18414

spurious errors on pod termination #18414

danwinship commented Feb 2, 2018 •

edited

Loading

pravisankar commented Feb 2, 2018

dcbw commented Feb 2, 2018

danwinship commented Feb 2, 2018

dcbw commented Feb 2, 2018 •

edited

Loading

dcbw commented Feb 3, 2018

dcbw commented Feb 3, 2018

danwinship commented Feb 3, 2018

spurious errors on pod termination #18414

spurious errors on pod termination #18414

Comments

danwinship commented Feb 2, 2018 • edited Loading

pravisankar commented Feb 2, 2018

dcbw commented Feb 2, 2018

danwinship commented Feb 2, 2018

dcbw commented Feb 2, 2018 • edited Loading

dcbw commented Feb 3, 2018

dcbw commented Feb 3, 2018

danwinship commented Feb 3, 2018

danwinship commented Feb 2, 2018 •

edited

Loading

dcbw commented Feb 2, 2018 •

edited

Loading