-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in status manager TerminatePod #41436
Conversation
/cc @yujuhong |
pkg/kubelet/status/status_manager.go
Outdated
@@ -135,6 +135,8 @@ func (m *manager) Start() { | |||
go wait.Forever(func() { | |||
select { | |||
case syncRequest := <-m.podStatusChannel: | |||
glog.V(10).Infof("Status Manager: syncing pod: %v, with status: (%v, %v) from podStatusChannel", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V(10)
seems a little bit too much, V(5)
should be fine, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
pkg/kubelet/status/status_manager.go
Outdated
@@ -326,6 +328,8 @@ func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUp | |||
|
|||
select { | |||
case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}: | |||
glog.V(10).Infof("Status Manager: adding pod: %v, with status: (%v, %v) to podStatusChannel", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use %q
for string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
pkg/kubelet/status/status_manager.go
Outdated
@@ -395,6 +399,7 @@ func (m *manager) syncBatch() { | |||
}() | |||
|
|||
for _, update := range updatedStatuses { | |||
glog.V(10).Infof("Status Manager: syncPod in syncbatch. pod UID: %v", update.podUID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use %q
for string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
status := expectPodStatus(t, syncer, testPod) | ||
for i := range status.ContainerStatuses { | ||
if status.ContainerStatuses[i].State.Terminated == nil { | ||
t.Errorf("expected containers to be terminated") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use assert
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
} | ||
for i := range status.InitContainerStatuses { | ||
if status.InitContainerStatuses[i].State.Terminated == nil { | ||
t.Errorf("expected init containers to be terminated") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
t.Errorf("expected init containers to be terminated") | ||
} | ||
} | ||
if status.Phase != v1.PodFailed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
@@ -583,6 +583,29 @@ func TestStaticPod(t *testing.T) { | |||
}) | |||
} | |||
|
|||
func TestTerminatePod(t *testing.T) { | |||
syncer := newTestManager(&fake.Clientset{}) | |||
testPod := getTestPod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you set a fake status here with Running
phase, and running container? And add some comment about why we add this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/lgtm |
/approved |
Good catch @dashpole |
can someone add the approved label? Not sure why @vishh 's comment didnt trigger it... |
/lgtm Thanks for the fix! |
[APPROVALNOTIFIER] This PR is APPROVED The following people have approved this PR: dashpole, yujuhong Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue |
Does this need backporting to 1.5? Violating the phase order is a very large problem, wasn't clear if this was introduced in 1.6 or earlier. |
Git history says this was introduced by #22155, which was merged before the 1.2 release. However, since a terminated pod is almost always deleted immediately, I have not been able to replicate the issues that #40239 encountered. Since #40239 delays deletion of pods in most cases, this bug manifested as a long delay in deletion with very high frequency. |
v1.4 branch is also getting new patch releases. We should cherrypick into
that as well.
…On Wed, Feb 15, 2017 at 12:48 PM, David Ashpole ***@***.***> wrote:
Git history says this was introduced by #22155
<#22155>, which was merged
before the 1.2 release. However, since a terminated pod is almost always
deleted immediately, I have not been able to replicate the issues that
#40239 <#40239> encountered.
Since #40239 <#40239> delays
deletion of pods in most cases, this bug manifested as a long delay in
deletion with very high frequency.
@smarterclayton <https://github.com/smarterclayton> I would recommend
cherrypicking this to 1.5. Thanks for bringing that up.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#41436 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKBnwSzr1zJy9375HNwyeM083jctDks5rc2SugaJpZM4MA8Kt>
.
|
can someone add the cherrypick candidate label? |
Removing label |
Automatic merge from submit-queue (batch tested with PRs 41466, 41456, 41550, 41238, 41416) Delay Deletion of a Pod until volumes are cleaned up #41436 fixed the bug that caused #41095 and #40239 to have to be reverted. Now that the bug is fixed, this shouldn't cause problems. @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc
Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
In TerminatePod, we previously pass pod.Status to updateStatusInternal. This is a bug, since it is the original status that we are given. Not only does it skip updates made to container statuses, but in some cases it reverted the pod's status to an earlier version, since it was being passed a stale status initially.
This was the case in #40239 and #41095. As shown in #40239, the pod's status is set to running after it is set to failed, occasionally causing very long delays in pod deletion since we have to wait for this to be corrected.
This PR fixes the bug, adds some helpful debugging statements, and adds a unit test for TerminatePod (which for some reason didnt exist before?).
@kubernetes/sig-node-bugs @vish @Random-Liu