Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deploy: retry scaling when the admission caches are not fully synced #13279

Merged
merged 1 commit into from
Mar 30, 2017

Conversation

mfojtik
Copy link
Contributor

@mfojtik mfojtik commented Mar 7, 2017

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1427992

This happens when the https://github.com/mfojtik/origin/blob/b83842eff0f6989f9735f483d4348ecc0d710ac7/vendor/k8s.io/kubernetes/plugin/pkg/admission/namespace/lifecycle/admission.go#L112 occurs. That leads to immediate deployment failure without retry.

This pull should keep retrying for 30 seconds (can be higher maybe? @Kargakis) with the assumption the caches will get warmed up eventually and the scaling succeed.

@mfojtik
Copy link
Contributor Author

mfojtik commented Mar 7, 2017

[test]

@mfojtik mfojtik requested a review from 0xmichalis March 7, 2017 11:50
}
return false, scaleErr
})
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass the admission error along.

if err == wait.ErrWaitTimeout {
   err = fmt.Errorf("%v: %v", err, scaleErr)
}

func (t *FakeLaggedScaler) Scale(namespace, name string, newSize uint, preconditions *kubectl.ScalePrecondition, retry, wait *kubectl.RetryParams) error {
if t.RetryCount != 2 {
t.RetryCount += 1
return errors.NewForbidden(unversioned.GroupResource{Resource: "ReplicationController"}, name, fmt.Errorf("%s: not yet ready to handle request", name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that you are faking a real admission error.

}

func (t *FakeLaggedScaler) ScaleSimple(namespace, name string, preconditions *kubectl.ScalePrecondition, newSize uint) (string, error) {
return "", fmt.Errorf("unexpected call to ScaleSimple")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is just copy&paste from the FakeScaler, i will nuke this.

@0xmichalis
Copy link
Contributor

This is implemented only for Recreate and initial Rolling updates, right?

@0xmichalis 0xmichalis self-assigned this Mar 7, 2017
@mfojtik
Copy link
Contributor Author

mfojtik commented Mar 8, 2017

@Kargakis that is correct

@mfojtik mfojtik force-pushed the retry-scale-in-deploy branch from b83842e to f00d197 Compare March 8, 2017 10:57
@mfojtik
Copy link
Contributor Author

mfojtik commented Mar 8, 2017

@Kargakis comments addressed, thanks!

@mfojtik
Copy link
Contributor Author

mfojtik commented Mar 9, 2017

[test]

@0xmichalis
Copy link
Contributor

@Kargakis that is correct

I guess this is a problem that occurs during cluster upgrades when the api server cache needs to be resynced? Won't this be a problem for non-initial Rolling deployments?

@mfojtik
Copy link
Contributor Author

mfojtik commented Mar 14, 2017

[test]

1 similar comment
@mfojtik
Copy link
Contributor Author

mfojtik commented Mar 24, 2017

[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to f00d197

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/445/) (Base Commit: 6e70ad5)

@mfojtik
Copy link
Contributor Author

mfojtik commented Mar 30, 2017

@Kargakis we need to get this in, as this is a blocker bug :-/

if int32(replicas) == deployment.Spec.Replicas && int32(replicas) == deployment.Status.Replicas {
return deployment, nil
}
if err := s.scaler.Scale(deployment.Namespace, deployment.Name, uint(replicas), &kubectl.ScalePrecondition{Size: -1, ResourceVersion: ""}, retry, wait); err != nil {
var scaleErr error
err := wait.PollImmediate(1*time.Second, 30*time.Second, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO that the 30s should be proportionate to TimeoutSeconds found in the strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xmichalis
Copy link
Contributor

This should work for installs so I am fine with merging. We should also probably handle it for upgrades for Rolling deployments but that will probably require changes in the rolling updater.

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to f00d197

@openshift-bot
Copy link
Contributor

openshift-bot commented Mar 30, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/445/) (Base Commit: 475ae9a) (Image: devenv-rhel7_6108)

@openshift-bot openshift-bot merged commit f081ac4 into openshift:master Mar 30, 2017
// This error is returned when the lifecycle admission plugin cache is not fully
// synchronized. In that case the scaling should be retried.
//
// FIXME: The error returned from admission should not be forbidden but come-back-later error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue to fix this?

Copy link

@jagadeeshops jagadeeshops May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had same problem in our environment and found that one of the controller is having auth issued with master. Controller was continuously trying to connect to the master and failing. After stopping that controller and master, Everything was fine.

@mfojtik mfojtik deleted the retry-scale-in-deploy branch September 5, 2018 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants