-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spurious errors on pod termination #18414
Comments
This looks like existing bug: https://bugzilla.redhat.com/show_bug.cgi?id=1434950 |
Yeah, and it shouldn't be happening. Any way to get better node logs here? I tried to reproduce yesterday but could not. |
"Better node logs" as compared to what? Note that you can click "View job in Jenkins" at the top of a test run and then "S3 artifacts" from there to get to the full origin-node.service logs. Eg, all 4 nodes on https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_conformance_gce/2380/s3/ are full of "failed to read pod IP" errors. |
Ah I was looking at the extended networking test logs which don't appear to have the node journals. But also, we apparently run with --loglevel=4 now, not 5, so we lose a lot of useful log messages. |
So this is Kubelet being stupid. It will run StopPodSandbox() multiple times in parallel for the same sandbox. And dockershim's StopPodSanbox() itself calls GetPodSandboxStatus() which checks the IP and that's where the error is coming from. That's because there's a race where the first StopPodSandbox is in TearDownPod() (and thus the networkReady suppression code isn't triggered yet) and the second StopPodSandbox() then asks for the pod IP. Before that can run nsenter, TearDownPod() completes and the first StopPodSandbox() stops the sandbox container and destroys the netns. Then the second StopPodSanbox()'s GetPodSandboxStatus() finally gets around to running nsenter and fails. |
upstream fix: kubernetes/kubernetes#59301 |
They do, it's just not obvious where: |
Automatic merge from submit-queue. UPSTREAM: 59301: dockershim: don't check pod IP in StopPodSandbox We're about to tear the container down, there's no point. It also suppresses an annoying error message due to kubelet stupidity that causes multiple parallel calls to StopPodSandbox for the same sandbox. docker_sandbox.go:355] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/22646/ns/net: No such file or directory 1) A first StopPodSandbox() request triggered by SyncLoop(PLEG) for a ContainerDied event calls into TearDownPod() and thus the network plugin. Until this completes, networkReady=true for the sandbox. 2) A second StopPodSandbox() request triggered by SyncLoop(REMOVE) calls PodSandboxStatus() and calls into the network plugin to read the IP address because networkReady=true 3) The first request exits the network plugin, sets networReady=false, and calls StopContainer() on the sandbox. This destroys the network namespace. 4) The second request finally gets around to running nsenter but the network namespace is already destroyed. It returns an error which is logged by getIP(). Fixes: #18414 @danwinship
Automatic merge from submit-queue. UPSTREAM: 59301: dockershim: don't check pod IP in StopPodSandbox We're about to tear the container down, there's no point. It also suppresses an annoying error message due to kubelet stupidity that causes multiple parallel calls to StopPodSandbox for the same sandbox. docker_sandbox.go:355] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "docker-registry-1-deploy_default": Unexpected command output nsenter: cannot open /proc/22646/ns/net: No such file or directory 1) A first StopPodSandbox() request triggered by SyncLoop(PLEG) for a ContainerDied event calls into TearDownPod() and thus the network plugin. Until this completes, networkReady=true for the sandbox. 2) A second StopPodSandbox() request triggered by SyncLoop(REMOVE) calls PodSandboxStatus() and calls into the network plugin to read the IP address because networkReady=true 3) The first request exits the network plugin, sets networReady=false, and calls StopContainer() on the sandbox. This destroys the network namespace. 4) The second request finally gets around to running nsenter but the network namespace is already destroyed. It returns an error which is logged by getIP(). Fixes: openshift/origin#18414 @danwinship Origin-commit: 73e9c8fa185d0f7bb9a0798185f78598cf4bb42b
There's some sort of race/bad coordination in kubelet when pods are terminated, such that it keeps trying to do status checks even after it has started terminating the pod, often resulting in:
While this doesn't seem to cause any actual problems, the log messages are spammy, and make it look like something has gone wrong.
Clayton says "Can you open a bug for that and assign it to Seth? I think it happens on every pod termination now."
The text was updated successfully, but these errors were encountered: