-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deploy: retry scaling when the admission caches are not fully synced #13279
Conversation
[test] |
} | ||
return false, scaleErr | ||
}) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass the admission error along.
if err == wait.ErrWaitTimeout {
err = fmt.Errorf("%v: %v", err, scaleErr)
}
pkg/deploy/cmd/test/support.go
Outdated
func (t *FakeLaggedScaler) Scale(namespace, name string, newSize uint, preconditions *kubectl.ScalePrecondition, retry, wait *kubectl.RetryParams) error { | ||
if t.RetryCount != 2 { | ||
t.RetryCount += 1 | ||
return errors.NewForbidden(unversioned.GroupResource{Resource: "ReplicationController"}, name, fmt.Errorf("%s: not yet ready to handle request", name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that you are faking a real admission error.
pkg/deploy/cmd/test/support.go
Outdated
} | ||
|
||
func (t *FakeLaggedScaler) ScaleSimple(namespace, name string, preconditions *kubectl.ScalePrecondition, newSize uint) (string, error) { | ||
return "", fmt.Errorf("unexpected call to ScaleSimple") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is just copy&paste from the FakeScaler, i will nuke this.
This is implemented only for Recreate and initial Rolling updates, right? |
@Kargakis that is correct |
b83842e
to
f00d197
Compare
@Kargakis comments addressed, thanks! |
[test] |
I guess this is a problem that occurs during cluster upgrades when the api server cache needs to be resynced? Won't this be a problem for non-initial Rolling deployments? |
[test] |
1 similar comment
[test] |
Evaluated for origin test up to f00d197 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/445/) (Base Commit: 6e70ad5) |
@Kargakis we need to get this in, as this is a blocker bug :-/ |
if int32(replicas) == deployment.Spec.Replicas && int32(replicas) == deployment.Status.Replicas { | ||
return deployment, nil | ||
} | ||
if err := s.scaler.Scale(deployment.Namespace, deployment.Name, uint(replicas), &kubectl.ScalePrecondition{Size: -1, ResourceVersion: ""}, retry, wait); err != nil { | ||
var scaleErr error | ||
err := wait.PollImmediate(1*time.Second, 30*time.Second, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO that the 30s should be proportionate to TimeoutSeconds found in the strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work for installs so I am fine with merging. We should also probably handle it for upgrades for Rolling deployments but that will probably require changes in the rolling updater. [merge] |
Evaluated for origin merge up to f00d197 |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/445/) (Base Commit: 475ae9a) (Image: devenv-rhel7_6108) |
// This error is returned when the lifecycle admission plugin cache is not fully | ||
// synchronized. In that case the scaling should be retried. | ||
// | ||
// FIXME: The error returned from admission should not be forbidden but come-back-later error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had same problem in our environment and found that one of the controller is having auth issued with master. Controller was continuously trying to connect to the master and failing. After stopping that controller and master, Everything was fine.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1427992
This happens when the https://github.com/mfojtik/origin/blob/b83842eff0f6989f9735f483d4348ecc0d710ac7/vendor/k8s.io/kubernetes/plugin/pkg/admission/namespace/lifecycle/admission.go#L112 occurs. That leads to immediate deployment failure without retry.
This pull should keep retrying for 30 seconds (can be higher maybe? @Kargakis) with the assumption the caches will get warmed up eventually and the scaling succeed.