Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing build reason/message getting wiped out race condition #12873

Merged
merged 1 commit into from
Feb 17, 2017
Merged

Fixing build reason/message getting wiped out race condition #12873

merged 1 commit into from
Feb 17, 2017

Conversation

coreydaley
Copy link
Member

@coreydaley coreydaley commented Feb 8, 2017

@coreydaley
Copy link
Member Author

coreydaley commented Feb 8, 2017

[test]

@coreydaley
Copy link
Member Author

@bparees ptal

@coreydaley coreydaley requested a review from bparees February 8, 2017 21:15
@bparees
Copy link
Contributor

bparees commented Feb 8, 2017

[testextended][extended:core(builds)]

@@ -367,9 +368,9 @@ func (bc *BuildPodController) HandlePod(pod *kapi.Pod) error {
build.Status.StartTimestamp = &now
}
if err := bc.BuildUpdater.Update(build.Namespace, build); err != nil {
return fmt.Errorf("failed to update build %s/%s: %v", build.Namespace, build.Name, err)
return fmt.Errorf("Failed to update build %s/%s: %v", build.Namespace, build.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this lowercase, it's how we do error message formatting.

@@ -465,6 +466,16 @@ func (bc *BuildDeleteController) HandleBuildDeletion(build *buildapi.Build) erro
return nil
}

// Decides whether or not to reset the currentReason and currentMessage based
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godoc ("// resetReasonAndMessage decides whether.....")

// Update the build object when it:
// - Progresses to the next phase
// - The reason has changed
if (!hasBuildPodNameAnnotation(build) || previousPhase != currentPhase || previousReason != currentReason) && !buildutil.IsBuildComplete(build) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| previousMessage != currentMessage (yes i know, the Reason check probably covers this, but one day it might not)

@bparees
Copy link
Contributor

bparees commented Feb 8, 2017

i'm still unconvinced this solves whatever race we're hitting.

the controller already only updated the build if the phase had changed, which means it would only reset the message/reason when the phase changed. With these changes, the messages will still be reset under those conditions.

The primary race condition I can think of is the following sequence:

  1. build pod gets created
  2. build pod moves to pending. pod controller sets message/reason to empty and build phase to pending.
  3. build pod starts running
  4. build encounters a failure, sets the build failure reason
  5. pod controller sees the "pod is running" event, resets the reason and updates the build object (because the phase changed from pending to running), wiping out the failure reason in the process.

(Ideally steps 4+5 would occur in reverse order: pod controller sees pod is running, updates the build to phase running, then the failure occurs and updates the build failure reason, but since the events are asynchronous there's no guarantee of which order they will occur in)

this PR doesn't seem to address that scenario.

@bparees bparees self-assigned this Feb 8, 2017
@PI-Victor
Copy link
Contributor

you can always run the extended tests locally
test/extended/core.sh --ginkgo.focus="update failure"
if you run the tests locally a couple of times, one of them is bound to fail.

What i've discovered is that the build object update that is done here fails to update every time. Not sure if it's strongly connected with the overall failure or not.

I0209 02:47:56.353187    7378 controller.go:136] Failed to record changes to build extended-test-update-buildstatus-jh2q1-l22zb/statusfail-fetchsourcedocker-1: Operation cannot be fulfilled on builds "statusfail-fetchsourcedocker-1": the object has been modified; please apply your changes to the latest version and try again

@PI-Victor
Copy link
Contributor

PI-Victor commented Feb 9, 2017

i've done that by tailing the extended tests server instance logs that are stored here:

total 4856
-rw-r--r-- 1 codeflavor users    5494 Feb  9 02:49 container-docker-registry-1-0x4k8-registry.log
-rw-r--r-- 1 codeflavor users     591 Feb  9 02:49 container-router-2-9v7qq-router.log
-rw-r--r-- 1 codeflavor users   80580 Feb  9 02:49 container-statusfail-postcommithook-1-build-sti-build.log
-rw-r--r-- 1 codeflavor users 4876483 Feb  9 02:49 openshift.log
codeflavor:logs$ pwd
/tmp/openshift/test-extended/core/logs

@smarterclayton
Copy link
Contributor

Flake was openshift/origin-gce#12

@openshift-bot openshift-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 13, 2017
@coreydaley coreydaley changed the title (WIP) Fixing build reason getting wiped out race condition Fixing build reason/message getting wiped out race condition Feb 15, 2017
@coreydaley
Copy link
Member Author

@openshift/devex ptal

func convertS2IFailureType(reason s2iapi.StepFailureReason, message s2iapi.StepFailureMessage) (api.StatusReason, string) {
return api.StatusReason(reason), fmt.Sprintf("%s", message)
func convertS2IFailureType(reason s2iapi.StepFailureReason, message s2iapi.StepFailureMessage) (api.BuildPhase, api.StatusReason, string) {
return api.BuildPhaseFailed, api.StatusReason(reason), fmt.Sprintf("%s", message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we really expect different s2i failure types to result in different build phases, I don't think you should do this... it's basically a function w/ a hardcoded return value which doesn't make much sense. Just set the phase to failed explicitly in the places where you are calling this function. I know it's an extra line of code but it seems like a better separation of concerns.

build.Status.Reason = buildapi.StatusReasonMissingPushSecret
build.Status.Message = buildapi.StatusMessageMissingPushSecret
glog.V(4).Infof("Setting reason for pending build to %q due to missing secret %s/%s", build.Status.Reason, build.Namespace, secret.Name)
if build.Status.Phase != buildapi.BuildPhaseFailed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any change to deal w/ the problem you were having where the build object was stale and didn't reflect a "failed" state... what happened w/ that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested that and the build object always matched the cache correctly, so I didn't think it was worth introducing that much new code for no reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why were you hitting an issue with it yesterday?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the retryBuildStatusUpdate (https://github.com/openshift/origin/blob/master/pkg/build/builder/common.go#L183) was not setting the build phase. So when it encountered an issue updating and had to retry, it was not overwriting the build phase with it's failed phase, so it wasn't hitting my new if statement and skipping the switch/case statement.

ah, ok. so it was just never getting set properly into etcd in the first place.

@coreydaley
Copy link
Member Author

Because the retryBuildStatusUpdate (https://github.com/openshift/origin/blob/master/pkg/build/builder/common.go#L183) was not setting the build phase. So when it encountered an issue updating and had to retry, it was not overwriting the build phase with it's failed phase, so it wasn't hitting my new if statement and skipping the switch/case statement.

@openshift-bot
Copy link
Contributor

Evaluated for origin testextended up to 2df71df

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 2df71df

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test Running (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/240/) (Base Commit: c8e60fe)

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/1106/) (Base Commit: c8e60fe) (Extended Tests: core(builds))

@bparees
Copy link
Contributor

bparees commented Feb 15, 2017

only failure is a known/unrelated one. let's give this a shot!
[merge]

@jim-minter
Copy link
Contributor

@coreydaley is this expected to resolve issue 12341 out of interest?

@bparees
Copy link
Contributor

bparees commented Feb 16, 2017

@coreydaley is this expected to resolve issue 12341 out of interest?

@jim-minter yes

/not-corey

@bparees
Copy link
Contributor

bparees commented Feb 16, 2017

i've updated the summary to include a fix ref to that flake issue.

@bparees
Copy link
Contributor

bparees commented Feb 16, 2017

@stevekuznetsov ummm:

14:52:52 No match for argument: docker-v1.10-migrator
14:52:52 No match for argument: glibc-all-langpacks
14:52:52 No match for argument: selinux-policy-minimum
14:52:52 Error: No packages marked for removal.
14:52:53 The command '/bin/sh -c dnf -y update && dnf -y install docker glibc-langpack-en iptables openssh-clients openssh-server && dnf -y remove docker-v1.10-migrator glibc-all-langpacks selinux-policy-minimum && dnf clean all' returned a non-zero code: 1

[merge]

@stevekuznetsov
Copy link
Contributor

@marun that's an errant DIND interaction with dnf ...

@bparees
Copy link
Contributor

bparees commented Feb 17, 2017

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 2df71df

@openshift-bot
Copy link
Contributor

openshift-bot commented Feb 17, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/298/) (Base Commit: 774a283) (Image: devenv-rhel7_5927)

@openshift-bot openshift-bot merged commit 15e6cf4 into openshift:master Feb 17, 2017
@PI-Victor
Copy link
Contributor

@coreydaley great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

build status update extended tests flaking
7 participants