-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing build reason/message getting wiped out race condition #12873
Fixing build reason/message getting wiped out race condition #12873
Conversation
[test] |
@bparees ptal |
[testextended][extended:core(builds)] |
pkg/build/controller/controller.go
Outdated
@@ -367,9 +368,9 @@ func (bc *BuildPodController) HandlePod(pod *kapi.Pod) error { | |||
build.Status.StartTimestamp = &now | |||
} | |||
if err := bc.BuildUpdater.Update(build.Namespace, build); err != nil { | |||
return fmt.Errorf("failed to update build %s/%s: %v", build.Namespace, build.Name, err) | |||
return fmt.Errorf("Failed to update build %s/%s: %v", build.Namespace, build.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep this lowercase, it's how we do error message formatting.
pkg/build/controller/controller.go
Outdated
@@ -465,6 +466,16 @@ func (bc *BuildDeleteController) HandleBuildDeletion(build *buildapi.Build) erro | |||
return nil | |||
} | |||
|
|||
// Decides whether or not to reset the currentReason and currentMessage based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
godoc ("// resetReasonAndMessage decides whether.....")
pkg/build/controller/controller.go
Outdated
// Update the build object when it: | ||
// - Progresses to the next phase | ||
// - The reason has changed | ||
if (!hasBuildPodNameAnnotation(build) || previousPhase != currentPhase || previousReason != currentReason) && !buildutil.IsBuildComplete(build) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| previousMessage != currentMessage (yes i know, the Reason check probably covers this, but one day it might not)
i'm still unconvinced this solves whatever race we're hitting. the controller already only updated the build if the phase had changed, which means it would only reset the message/reason when the phase changed. With these changes, the messages will still be reset under those conditions. The primary race condition I can think of is the following sequence:
(Ideally steps 4+5 would occur in reverse order: pod controller sees pod is running, updates the build to phase running, then the failure occurs and updates the build failure reason, but since the events are asynchronous there's no guarantee of which order they will occur in) this PR doesn't seem to address that scenario. |
you can always run the extended tests locally What i've discovered is that the build object update that is done here fails to update every time. Not sure if it's strongly connected with the overall failure or not.
|
i've done that by tailing the extended tests server instance logs that are stored here: total 4856
-rw-r--r-- 1 codeflavor users 5494 Feb 9 02:49 container-docker-registry-1-0x4k8-registry.log
-rw-r--r-- 1 codeflavor users 591 Feb 9 02:49 container-router-2-9v7qq-router.log
-rw-r--r-- 1 codeflavor users 80580 Feb 9 02:49 container-statusfail-postcommithook-1-build-sti-build.log
-rw-r--r-- 1 codeflavor users 4876483 Feb 9 02:49 openshift.log
codeflavor:logs$ pwd
/tmp/openshift/test-extended/core/logs |
Flake was openshift/origin-gce#12 |
@openshift/devex ptal |
pkg/build/builder/sti.go
Outdated
func convertS2IFailureType(reason s2iapi.StepFailureReason, message s2iapi.StepFailureMessage) (api.StatusReason, string) { | ||
return api.StatusReason(reason), fmt.Sprintf("%s", message) | ||
func convertS2IFailureType(reason s2iapi.StepFailureReason, message s2iapi.StepFailureMessage) (api.BuildPhase, api.StatusReason, string) { | ||
return api.BuildPhaseFailed, api.StatusReason(reason), fmt.Sprintf("%s", message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless we really expect different s2i failure types to result in different build phases, I don't think you should do this... it's basically a function w/ a hardcoded return value which doesn't make much sense. Just set the phase to failed explicitly in the places where you are calling this function. I know it's an extra line of code but it seems like a better separation of concerns.
build.Status.Reason = buildapi.StatusReasonMissingPushSecret | ||
build.Status.Message = buildapi.StatusMessageMissingPushSecret | ||
glog.V(4).Infof("Setting reason for pending build to %q due to missing secret %s/%s", build.Status.Reason, build.Namespace, secret.Name) | ||
if build.Status.Phase != buildapi.BuildPhaseFailed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any change to deal w/ the problem you were having where the build object was stale and didn't reflect a "failed" state... what happened w/ that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested that and the build object always matched the cache correctly, so I didn't think it was worth introducing that much new code for no reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then why were you hitting an issue with it yesterday?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the retryBuildStatusUpdate (https://github.com/openshift/origin/blob/master/pkg/build/builder/common.go#L183) was not setting the build phase. So when it encountered an issue updating and had to retry, it was not overwriting the build phase with it's failed phase, so it wasn't hitting my new if statement and skipping the switch/case statement.
ah, ok. so it was just never getting set properly into etcd in the first place.
Because the retryBuildStatusUpdate (https://github.com/openshift/origin/blob/master/pkg/build/builder/common.go#L183) was not setting the build phase. So when it encountered an issue updating and had to retry, it was not overwriting the build phase with it's failed phase, so it wasn't hitting my new if statement and skipping the switch/case statement. |
Evaluated for origin testextended up to 2df71df |
Evaluated for origin test up to 2df71df |
continuous-integration/openshift-jenkins/test Running (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/240/) (Base Commit: c8e60fe) |
continuous-integration/openshift-jenkins/test SUCCESS |
continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/1106/) (Base Commit: c8e60fe) (Extended Tests: core(builds)) |
only failure is a known/unrelated one. let's give this a shot! |
@coreydaley is this expected to resolve issue 12341 out of interest? |
@jim-minter yes /not-corey |
i've updated the summary to include a fix ref to that flake issue. |
@stevekuznetsov ummm:
[merge] |
@marun that's an errant DIND interaction with |
[merge] |
Evaluated for origin merge up to 2df71df |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/298/) (Base Commit: 774a283) (Image: devenv-rhel7_5927) |
@coreydaley great job! |
https://bugzilla.redhat.com/show_bug.cgi?id=1419810
https://bugzilla.redhat.com/show_bug.cgi?id=1415946
fixes #12341