Fixing build reason/message getting wiped out race condition #12873

coreydaley · 2017-02-08T18:11:25Z

https://bugzilla.redhat.com/show_bug.cgi?id=1419810
https://bugzilla.redhat.com/show_bug.cgi?id=1415946

fixes #12341

coreydaley · 2017-02-08T18:20:19Z

[test]

coreydaley · 2017-02-08T21:14:58Z

@bparees ptal

bparees · 2017-02-08T22:23:12Z

[testextended][extended:core(builds)]

bparees · 2017-02-08T22:29:44Z

pkg/build/controller/controller.go

@@ -367,9 +368,9 @@ func (bc *BuildPodController) HandlePod(pod *kapi.Pod) error {
 			build.Status.StartTimestamp = &now
 		}
 		if err := bc.BuildUpdater.Update(build.Namespace, build); err != nil {
-			return fmt.Errorf("failed to update build %s/%s: %v", build.Namespace, build.Name, err)
+			return fmt.Errorf("Failed to update build %s/%s: %v", build.Namespace, build.Name, err)


keep this lowercase, it's how we do error message formatting.

bparees · 2017-02-08T22:30:17Z

pkg/build/controller/controller.go

@@ -465,6 +466,16 @@ func (bc *BuildDeleteController) HandleBuildDeletion(build *buildapi.Build) erro
 	return nil
 }

+// Decides whether or not to reset the currentReason and currentMessage based


godoc ("// resetReasonAndMessage decides whether.....")

bparees · 2017-02-08T22:33:29Z

pkg/build/controller/controller.go

+	// Update the build object when it:
+	// - Progresses to the next phase
+	// - The reason has changed
+	if (!hasBuildPodNameAnnotation(build) || previousPhase != currentPhase || previousReason != currentReason) && !buildutil.IsBuildComplete(build) {


|| previousMessage != currentMessage (yes i know, the Reason check probably covers this, but one day it might not)

bparees · 2017-02-08T23:06:49Z

i'm still unconvinced this solves whatever race we're hitting.

the controller already only updated the build if the phase had changed, which means it would only reset the message/reason when the phase changed. With these changes, the messages will still be reset under those conditions.

The primary race condition I can think of is the following sequence:

build pod gets created
build pod moves to pending. pod controller sets message/reason to empty and build phase to pending.
build pod starts running
build encounters a failure, sets the build failure reason
pod controller sees the "pod is running" event, resets the reason and updates the build object (because the phase changed from pending to running), wiping out the failure reason in the process.

(Ideally steps 4+5 would occur in reverse order: pod controller sees pod is running, updates the build to phase running, then the failure occurs and updates the build failure reason, but since the events are asynchronous there's no guarantee of which order they will occur in)

this PR doesn't seem to address that scenario.

PI-Victor · 2017-02-09T10:20:38Z

you can always run the extended tests locally
test/extended/core.sh --ginkgo.focus="update failure"
if you run the tests locally a couple of times, one of them is bound to fail.

What i've discovered is that the build object update that is done here fails to update every time. Not sure if it's strongly connected with the overall failure or not.

I0209 02:47:56.353187    7378 controller.go:136] Failed to record changes to build extended-test-update-buildstatus-jh2q1-l22zb/statusfail-fetchsourcedocker-1: Operation cannot be fulfilled on builds "statusfail-fetchsourcedocker-1": the object has been modified; please apply your changes to the latest version and try again

PI-Victor · 2017-02-09T12:03:54Z

i've done that by tailing the extended tests server instance logs that are stored here:

total 4856
-rw-r--r-- 1 codeflavor users    5494 Feb  9 02:49 container-docker-registry-1-0x4k8-registry.log
-rw-r--r-- 1 codeflavor users     591 Feb  9 02:49 container-router-2-9v7qq-router.log
-rw-r--r-- 1 codeflavor users   80580 Feb  9 02:49 container-statusfail-postcommithook-1-build-sti-build.log
-rw-r--r-- 1 codeflavor users 4876483 Feb  9 02:49 openshift.log
codeflavor:logs$ pwd
/tmp/openshift/test-extended/core/logs

smarterclayton · 2017-02-09T18:31:53Z

Flake was openshift/origin-gce#12

coreydaley · 2017-02-15T19:26:00Z

@openshift/devex ptal

bparees · 2017-02-15T20:00:30Z

pkg/build/builder/sti.go

-func convertS2IFailureType(reason s2iapi.StepFailureReason, message s2iapi.StepFailureMessage) (api.StatusReason, string) {
-	return api.StatusReason(reason), fmt.Sprintf("%s", message)
+func convertS2IFailureType(reason s2iapi.StepFailureReason, message s2iapi.StepFailureMessage) (api.BuildPhase, api.StatusReason, string) {
+	return api.BuildPhaseFailed, api.StatusReason(reason), fmt.Sprintf("%s", message)


unless we really expect different s2i failure types to result in different build phases, I don't think you should do this... it's basically a function w/ a hardcoded return value which doesn't make much sense. Just set the phase to failed explicitly in the places where you are calling this function. I know it's an extra line of code but it seems like a better separation of concerns.

bparees · 2017-02-15T20:02:00Z

pkg/build/controller/controller.go

-				build.Status.Reason = buildapi.StatusReasonMissingPushSecret
-				build.Status.Message = buildapi.StatusMessageMissingPushSecret
-				glog.V(4).Infof("Setting reason for pending build to %q due to missing secret %s/%s", build.Status.Reason, build.Namespace, secret.Name)
+	if build.Status.Phase != buildapi.BuildPhaseFailed {


I don't see any change to deal w/ the problem you were having where the build object was stale and didn't reflect a "failed" state... what happened w/ that?

I tested that and the build object always matched the cache correctly, so I didn't think it was worth introducing that much new code for no reason.

then why were you hitting an issue with it yesterday?

Because the retryBuildStatusUpdate (https://github.com/openshift/origin/blob/master/pkg/build/builder/common.go#L183) was not setting the build phase. So when it encountered an issue updating and had to retry, it was not overwriting the build phase with it's failed phase, so it wasn't hitting my new if statement and skipping the switch/case statement.

ah, ok. so it was just never getting set properly into etcd in the first place.

coreydaley · 2017-02-15T20:23:00Z

Because the retryBuildStatusUpdate (https://github.com/openshift/origin/blob/master/pkg/build/builder/common.go#L183) was not setting the build phase. So when it encountered an issue updating and had to retry, it was not overwriting the build phase with it's failed phase, so it wasn't hitting my new if statement and skipping the switch/case statement.

https://bugzilla.redhat.com/show_bug.cgi?id=1419810 https://bugzilla.redhat.com/show_bug.cgi?id=1415946

openshift-bot · 2017-02-15T20:37:34Z

Evaluated for origin testextended up to 2df71df

openshift-bot · 2017-02-15T20:37:41Z

Evaluated for origin test up to 2df71df

openshift-bot · 2017-02-15T20:37:51Z

continuous-integration/openshift-jenkins/test Running (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/240/) (Base Commit: c8e60fe)

openshift-bot · 2017-02-15T22:07:12Z

continuous-integration/openshift-jenkins/test SUCCESS

openshift-bot · 2017-02-15T22:29:29Z

continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/1106/) (Base Commit: c8e60fe) (Extended Tests: core(builds))

bparees · 2017-02-15T22:35:19Z

only failure is a known/unrelated one. let's give this a shot!
[merge]

jim-minter · 2017-02-16T09:42:19Z

@coreydaley is this expected to resolve issue 12341 out of interest?

bparees · 2017-02-16T13:53:31Z

@coreydaley is this expected to resolve issue 12341 out of interest?

@jim-minter yes

/not-corey

bparees · 2017-02-16T13:53:58Z

i've updated the summary to include a fix ref to that flake issue.

bparees · 2017-02-16T20:51:35Z

@stevekuznetsov ummm:

14:52:52 No match for argument: docker-v1.10-migrator
14:52:52 No match for argument: glibc-all-langpacks
14:52:52 No match for argument: selinux-policy-minimum
14:52:52 Error: No packages marked for removal.
14:52:53 The command '/bin/sh -c dnf -y update && dnf -y install docker glibc-langpack-en iptables openssh-clients openssh-server && dnf -y remove docker-v1.10-migrator glibc-all-langpacks selinux-policy-minimum && dnf clean all' returned a non-zero code: 1

[merge]

bparees · 2017-02-16T21:55:06Z

(failed job was https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_networking_future/273/consoleFull#812502775898c58db7602c31c0eab717)

stevekuznetsov · 2017-02-16T21:55:39Z

@marun that's an errant DIND interaction with dnf ...

bparees · 2017-02-17T15:16:09Z

[merge]

openshift-bot · 2017-02-17T15:21:22Z

Evaluated for origin merge up to 2df71df

openshift-bot · 2017-02-17T16:13:36Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/298/) (Base Commit: 774a283) (Image: devenv-rhel7_5927)

PI-Victor · 2017-02-17T18:08:32Z

@coreydaley great job!

coreydaley requested a review from bparees February 8, 2017 21:15

bparees reviewed Feb 8, 2017

View reviewed changes

bparees self-assigned this Feb 8, 2017

openshift-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 13, 2017

coreydaley changed the title ~~(WIP) Fixing build reason getting wiped out race condition~~ Fixing build reason/message getting wiped out race condition Feb 15, 2017

bparees requested changes Feb 15, 2017

View reviewed changes

(WIP) Fixing build reason getting wiped out by race condition

2df71df

https://bugzilla.redhat.com/show_bug.cgi?id=1419810 https://bugzilla.redhat.com/show_bug.cgi?id=1415946

openshift-bot merged commit 15e6cf4 into openshift:master Feb 17, 2017

bparees mentioned this pull request Mar 6, 2017

oc get builds reports wrong build duration #13254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing build reason/message getting wiped out race condition #12873

Fixing build reason/message getting wiped out race condition #12873

coreydaley commented Feb 8, 2017 •

edited by bparees

Loading

coreydaley commented Feb 8, 2017 •

edited

Loading

coreydaley commented Feb 8, 2017

bparees commented Feb 8, 2017

bparees Feb 8, 2017

bparees Feb 8, 2017

bparees Feb 8, 2017

bparees commented Feb 8, 2017

PI-Victor commented Feb 9, 2017

PI-Victor commented Feb 9, 2017 •

edited

Loading

smarterclayton commented Feb 9, 2017

coreydaley commented Feb 15, 2017

bparees Feb 15, 2017

bparees Feb 15, 2017

coreydaley Feb 15, 2017

bparees Feb 15, 2017

bparees Feb 15, 2017

coreydaley commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

bparees commented Feb 15, 2017

jim-minter commented Feb 16, 2017

bparees commented Feb 16, 2017

bparees commented Feb 16, 2017

bparees commented Feb 16, 2017

bparees commented Feb 16, 2017

stevekuznetsov commented Feb 16, 2017

bparees commented Feb 17, 2017

openshift-bot commented Feb 17, 2017

openshift-bot commented Feb 17, 2017 •

edited

Loading

PI-Victor commented Feb 17, 2017

Fixing build reason/message getting wiped out race condition #12873

Fixing build reason/message getting wiped out race condition #12873

Conversation

coreydaley commented Feb 8, 2017 • edited by bparees Loading

coreydaley commented Feb 8, 2017 • edited Loading

coreydaley commented Feb 8, 2017

bparees commented Feb 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bparees commented Feb 8, 2017

PI-Victor commented Feb 9, 2017

PI-Victor commented Feb 9, 2017 • edited Loading

smarterclayton commented Feb 9, 2017

coreydaley commented Feb 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coreydaley commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

openshift-bot commented Feb 15, 2017

bparees commented Feb 15, 2017

jim-minter commented Feb 16, 2017

bparees commented Feb 16, 2017

bparees commented Feb 16, 2017

bparees commented Feb 16, 2017

bparees commented Feb 16, 2017

stevekuznetsov commented Feb 16, 2017

bparees commented Feb 17, 2017

openshift-bot commented Feb 17, 2017

openshift-bot commented Feb 17, 2017 • edited Loading

PI-Victor commented Feb 17, 2017

coreydaley commented Feb 8, 2017 •

edited by bparees

Loading

coreydaley commented Feb 8, 2017 •

edited

Loading

PI-Victor commented Feb 9, 2017 •

edited

Loading

openshift-bot commented Feb 17, 2017 •

edited

Loading