Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in status manager TerminatePod #41436

Merged
merged 1 commit into from
Feb 15, 2017

Conversation

dashpole
Copy link
Contributor

In TerminatePod, we previously pass pod.Status to updateStatusInternal. This is a bug, since it is the original status that we are given. Not only does it skip updates made to container statuses, but in some cases it reverted the pod's status to an earlier version, since it was being passed a stale status initially.

This was the case in #40239 and #41095. As shown in #40239, the pod's status is set to running after it is set to failed, occasionally causing very long delays in pod deletion since we have to wait for this to be corrected.

This PR fixes the bug, adds some helpful debugging statements, and adds a unit test for TerminatePod (which for some reason didnt exist before?).

@kubernetes/sig-node-bugs @vish @Random-Liu

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 14, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Feb 14, 2017
@Random-Liu
Copy link
Member

/cc @yujuhong

@Random-Liu Random-Liu added release-note-none Denotes a PR that doesn't merit a release note. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed release-note-label-needed labels Feb 14, 2017
@Random-Liu Random-Liu self-assigned this Feb 14, 2017
@@ -135,6 +135,8 @@ func (m *manager) Start() {
go wait.Forever(func() {
select {
case syncRequest := <-m.podStatusChannel:
glog.V(10).Infof("Status Manager: syncing pod: %v, with status: (%v, %v) from podStatusChannel",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V(10) seems a little bit too much, V(5) should be fine, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@@ -326,6 +328,8 @@ func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUp

select {
case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}:
glog.V(10).Infof("Status Manager: adding pod: %v, with status: (%v, %v) to podStatusChannel",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use %q for string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@@ -395,6 +399,7 @@ func (m *manager) syncBatch() {
}()

for _, update := range updatedStatuses {
glog.V(10).Infof("Status Manager: syncPod in syncbatch. pod UID: %v", update.podUID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use %q for string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

status := expectPodStatus(t, syncer, testPod)
for i := range status.ContainerStatuses {
if status.ContainerStatuses[i].State.Terminated == nil {
t.Errorf("expected containers to be terminated")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

}
for i := range status.InitContainerStatuses {
if status.InitContainerStatuses[i].State.Terminated == nil {
t.Errorf("expected init containers to be terminated")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

t.Errorf("expected init containers to be terminated")
}
}
if status.Phase != v1.PodFailed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@@ -583,6 +583,29 @@ func TestStaticPod(t *testing.T) {
})
}

func TestTerminatePod(t *testing.T) {
syncer := newTestManager(&fake.Clientset{})
testPod := getTestPod()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you set a fake status here with Running phase, and running container? And add some comment about why we add this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Random-Liu
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2017
@vishh
Copy link
Contributor

vishh commented Feb 14, 2017

/approved

@vishh
Copy link
Contributor

vishh commented Feb 14, 2017

Good catch @dashpole

@dashpole
Copy link
Contributor Author

can someone add the approved label? Not sure why @vishh 's comment didnt trigger it...

@yujuhong
Copy link
Contributor

/lgtm
/approve

Thanks for the fix!

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

The following people have approved this PR: dashpole, yujuhong

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit a57967f into kubernetes:master Feb 15, 2017
@dashpole dashpole deleted the status_bug branch February 15, 2017 05:03
@smarterclayton
Copy link
Contributor

Does this need backporting to 1.5? Violating the phase order is a very large problem, wasn't clear if this was introduced in 1.6 or earlier.

@dashpole
Copy link
Contributor Author

Git history says this was introduced by #22155, which was merged before the 1.2 release. However, since a terminated pod is almost always deleted immediately, I have not been able to replicate the issues that #40239 encountered. Since #40239 delays deletion of pods in most cases, this bug manifested as a long delay in deletion with very high frequency.
@smarterclayton I would recommend cherrypicking this to 1.5. Thanks for bringing that up.

@vishh
Copy link
Contributor

vishh commented Feb 15, 2017 via email

@dashpole
Copy link
Contributor Author

can someone add the cherrypick candidate label?

@k8s-cherrypick-bot
Copy link

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@yujuhong yujuhong added this to the v1.5 milestone Feb 16, 2017
k8s-github-robot pushed a commit that referenced this pull request Feb 16, 2017
Automatic merge from submit-queue (batch tested with PRs 41466, 41456, 41550, 41238, 41416)

Delay Deletion of a Pod until volumes are cleaned up

#41436 fixed the bug that caused #41095 and #40239 to have to be reverted.  Now that the bug is fixed, this shouldn't cause problems.

 @vishh @derekwaynecarr @sjenning @jingxu97 @kubernetes/sig-storage-misc
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants