-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apps: deployment config stuck in the new state should respect timeoutSeconds #17000
Conversation
need unit test... |
// In case we fail to created the deployer pod within strategy | ||
// timeoutSeconds do not leave the deployment config in New state forever | ||
// but timeout. | ||
if deployutil.ConfigExceededTimeoutSeconds(config, latestRC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfojtik I think you need to cancel that deployment (RC) as well .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will rather transition it to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the DC part, cancellation of latest deployment must be part of the Handle() because I need client access.
@tnozicka @Kargakis @smarterclayton PTAL Basically I use |
@@ -113,6 +113,7 @@ const ( | |||
DeploymentCancelledNewerDeploymentExists = "newer deployment was found running" | |||
DeploymentFailedUnrelatedDeploymentExists = "unrelated pod with the same name as this deployment is already running" | |||
DeploymentFailedDeployerPodNoLongerExists = "deployer pod no longer exists" | |||
DeploymentFailedUnableToCreateDeployerPod = "unable to create deployer pod" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks nice in oc status
:
dc/test deploys docker.io/library/centos:7
deployment #4 failed 44 minutes ago: unable to create deployer pod
deployment #3 deployed 2 hours ago - 1 pod
deployment #2 deployed 2 hours ago
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this code in DeploymentController instead of DeploymentConfigController. Logically this is about watching RC and transitioning it to FailedPhase. DeploymentController is the state machine that updates deployment phases and doing it from outside will only force conflicts.
if !deployutil.IsNewDeployment(deployment) { | ||
return nil | ||
} | ||
return retry.RetryOnConflict(retry.DefaultBackoff, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the deployment transitions to succeeded?
you have declared than you want to act only on state new but now you ignore it with retry...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think a check if the deployment is still new is ok, honestly I don't think that will ever happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfojtik 409?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by getting 409 none of your previous assertions is no longer valid. and transitioning from Succeeded
to Failed
might reveal unexpected side effects
pkg/apps/util/util.go
Outdated
@@ -782,6 +788,23 @@ func DeploymentsForCleanup(configuration *deployapi.DeploymentConfig, deployment | |||
return relevantDeployments | |||
} | |||
|
|||
func ConfigExceededTimeoutSeconds(config *deployapi.DeploymentConfig, latestRC *v1.ReplicationController) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeploymentExceededTimeoutSeconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I was lazy to change it, now you beat me to it :)
pkg/apps/util/util.go
Outdated
var timeoutSeconds int64 | ||
if params := config.Spec.Strategy.RollingParams; params != nil { | ||
timeoutSeconds = deployapi.DefaultRollingTimeoutSeconds | ||
if params.TimeoutSeconds != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the wrong property check the description here
origin/pkg/apps/apis/apps/types.go
Lines 272 to 273 in fb1679e
// TimeoutSeconds is the time to wait for updates before giving up. If the | |
// value is nil, a default will be used. |
Seems like activeDeadlineSeconds is what we use for timeout
origin/pkg/apps/apis/apps/types.go
Lines 219 to 221 in fb1679e
// ActiveDeadlineSeconds is the duration in seconds that the deployer pods for this deployment | |
// config may be active on a node before the system actively tries to terminate them. | |
ActiveDeadlineSeconds *int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also general timeout shouldn't be dependent on strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want // TimeoutSeconds is the time to wait for updates before giving up. If the
// value is nil, a default will be used.
Because literally this thing is about not getting any progress since we are not able to create deployer pod. ActiveDeadlineSeconds sets the duration for the deployer pod, but in this case there is no deployer pod.
pkg/apps/util/util.go
Outdated
timeoutSeconds = *params.TimeoutSeconds | ||
} | ||
} | ||
return int64(time.Since(latestRC.CreationTimestamp.Time)*time.Second) > timeoutSeconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to check how we do rollbacks because in that case creationTimestamp might be an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the RC has to be in "new" state for this to trigger... it is only in new when it is new :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we do rollbacks differently from upstream anyway (we always create a new RC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfojtik this is what I had in mind. The only question remaining for me is where we take that timeout
@@ -128,6 +128,7 @@ func (c *DeploymentConfigController) Handle(config *deployapi.DeploymentConfig) | |||
return err | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
/retest |
@mfojtik @smarterclayton @Kargakis I feel like we are missing proper deployment timeout in DC API object and closest to it is DC.Spec.Strategy.ActiveDeadlineSeconds. Timeout should be property of DC.Spec.Strategy (or DC.Spec) not part of a particular strategy. And CustomDeploymentStrategyParams is missing the timeout @mfojtik wants to use. I think it would be better to reshape activeDeadlineSeconds as that's common for all of them and also used as a timeout for deployer pod. |
flake: #17024 /retest |
/retest |
if deployutil.RolloutExceededTimeoutSeconds(config, deployment) { | ||
nextStatus = deployapi.DeploymentStatusFailed | ||
updatedAnnotations[deployapi.DeploymentStatusReasonAnnotation] = deployapi.DeploymentFailedUnableToCreateDeployerPod | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfojtik this would be a good place to log
pkg/apps/util/util.go
Outdated
} | ||
} | ||
// For "custom" strategy use the default for recreate strategy. | ||
if timeoutSeconds == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work. If one of the strategies will have TimeoutSeconds=0 this will overwrite it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should probably switch on strategy type
f06e179
to
24c9a78
Compare
pkg/apps/util/util.go
Outdated
// (like quota, etc...). In that case deployer controller use this function to | ||
// measure if the created deployment (RC) exceeded the timeout. | ||
func RolloutExceededTimeoutSeconds(config *deployapi.DeploymentConfig, latestRC *v1.ReplicationController) bool { | ||
return int64(time.Since(latestRC.CreationTimestamp.Time).Seconds()) > GetTimeoutSecondsForStrategy(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if none of the strategy matches or if timeout is set explicitly set to 0 it will always fail the deployment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnozicka 1) can never happen because we guard in validation?, 2) isn't that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually 0 means ignore the timeout as it wouldn't make sense otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeoutSeconds is optional an if not specified then it defaults to 600s. Setting explicitly to 0 means that I want 0 timeout because i'm crazy ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting 0 mean I want no timeout because if I don't specify it, it defaults to 600
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, you beat me to it.
@@ -586,3 +586,22 @@ func TestRemoveCondition(t *testing.T) { | |||
} | |||
} | |||
} | |||
|
|||
func TestRolloutExceededTimeoutSeconds(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a test table containing setting different strategies and not specified timeouts mixed with creationTimestamps values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
// deployer pod (quota, etc..) we should respect the timeoutSeconds in the | ||
// config strategy and transition the rollout to failed instead of waiting for | ||
// the deployment pod forever. | ||
config, err := deployutil.DecodeDeploymentConfig(deployment, c.codec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long term, we want to get rid of this... we should copy the timeoutSecond into annotation so we don't have to decode here... this also apply to other fields we check i guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/apps/util/util.go
Outdated
func RolloutExceededTimeoutSeconds(config *deployapi.DeploymentConfig, latestRC *v1.ReplicationController) bool { | ||
timeoutSeconds := GetTimeoutSecondsForStrategy(config) | ||
// If user set the timeoutSeconds to 0, we assume there should be no timeout. | ||
if timeoutSeconds == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, defensive would be to check timeoutSeconds <= 0
nextStatus = deployapi.DeploymentStatusFailed | ||
updatedAnnotations[deployapi.DeploymentStatusReasonAnnotation] = deployapi.DeploymentFailedUnableToCreateDeployerPod | ||
c.emitDeploymentEvent(deployment, v1.EventTypeWarning, "RolloutTimeout", fmt.Sprintf("Rollout for %q failed to create deployer pod (timeoutSeconds: %ds)", deployutil.LabelForDeploymentV1(deployment), deployutil.GetTimeoutSecondsForStrategy(config))) | ||
glog.V(4).Infof("Failing deployment %s/%s as we timeout out while waiting for the deployer pod to be created", deployment.Namespace, deployment.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/timeout out/timed out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: reached timeout while
pkg/apps/util/util_test.go
Outdated
now := time.Now() | ||
tests := []struct { | ||
name string | ||
config func(int64) *deployapi.DeploymentConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/func(int64) *deployapi.DeploymentConfig/*deployapi.DeploymentConfig/
pkg/apps/util/util_test.go
Outdated
tests := []struct { | ||
name string | ||
config func(int64) *deployapi.DeploymentConfig | ||
timeoutSeconds int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed
pkg/apps/util/util_test.go
Outdated
name string | ||
config func(int64) *deployapi.DeploymentConfig | ||
timeoutSeconds int64 | ||
deploymentCreationTime time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed
config: func(timeoutSeconds int64) *deployapi.DeploymentConfig { | ||
config := deploytest.OkDeploymentConfig(1) | ||
config.Spec.Strategy.RecreateParams.TimeoutSeconds = &timeoutSeconds | ||
return config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fill in timeout and creationTime
pkg/apps/util/util_test.go
Outdated
config := deploytest.OkDeploymentConfig(1) | ||
config.Spec.Strategy.RecreateParams.TimeoutSeconds = &timeoutSeconds | ||
return config | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call the lambda immediately == s/}/}()/
ab2eeea
to
38cc230
Compare
config: func(timeoutSeconds int64) *deployapi.DeploymentConfig { | ||
config := deploytest.OkDeploymentConfig(1) | ||
config.Spec.Strategy = deploytest.OkRollingStrategy() | ||
config.Spec.Strategy.RollingParams.TimeoutSeconds = &timeoutSeconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assign int64ptr(10)
directly here without the whole taking is as argument thing
pkg/apps/util/util_test.go
Outdated
if tc.expectTimeout && !gotTimeout { | ||
t.Errorf("[%s]: expected timeout, but got no timeout", tc.name) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfojtik what if !tc.expectTimeout && gotTimeout
?
config := deploytest.OkDeploymentConfig(1) | ||
config.Spec.Strategy.RecreateParams.TimeoutSeconds = nil | ||
return config | ||
}(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't take it as you have to fix it because I told you, you don't have to but check how the argument is pointless here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️
pkg/apps/util/util.go
Outdated
OwnerReferences: []metav1.OwnerReference{*controllerRef}, | ||
Labels: controllerLabels, | ||
OwnerReferences: []metav1.OwnerReference{*controllerRef}, | ||
CreationTimestamp: defaultCreationTimestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfojtik why? we will now send objects to API server with CreationTimestamp set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, I didn't notice this is not just for test...
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mfojtik, tnozicka The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
Automatic merge from submit-queue (batch tested with PRs 17020, 17026, 17000, 17010). |
There may be strategies in the future that don't use a pod, so there
might be no timeout. Probably needs to be discussed around long term
behavior of DC
|
Fixes: #16962
With this patch the deployment config controller will set the deployment as failed (timeout) after it reaches timeoutSeconds and the status of the deployment is 'new'. This generally happens when the deployment is not able to create the deployer pod (quota). We should not wait infinitely to have the quota.