deploy: retry scaling when the admission caches are not fully synced #13279

mfojtik · 2017-03-07T11:48:08Z

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1427992

This happens when the https://github.com/mfojtik/origin/blob/b83842eff0f6989f9735f483d4348ecc0d710ac7/vendor/k8s.io/kubernetes/plugin/pkg/admission/namespace/lifecycle/admission.go#L112 occurs. That leads to immediate deployment failure without retry.

This pull should keep retrying for 30 seconds (can be higher maybe? @Kargakis) with the assumption the caches will get warmed up eventually and the scaling succeed.

mfojtik · 2017-03-07T11:49:56Z

[test]

0xmichalis · 2017-03-07T23:11:57Z

pkg/deploy/strategy/recreate/recreate.go

+		}
+		return false, scaleErr
+	})
+	if err != nil {


Pass the admission error along.

if err == wait.ErrWaitTimeout { err = fmt.Errorf("%v: %v", err, scaleErr) }

0xmichalis · 2017-03-07T23:18:50Z

pkg/deploy/cmd/test/support.go

+func (t *FakeLaggedScaler) Scale(namespace, name string, newSize uint, preconditions *kubectl.ScalePrecondition, retry, wait *kubectl.RetryParams) error {
+	if t.RetryCount != 2 {
+		t.RetryCount += 1
+		return errors.NewForbidden(unversioned.GroupResource{Resource: "ReplicationController"}, name, fmt.Errorf("%s: not yet ready to handle request", name))


Add a comment that you are faking a real admission error.

0xmichalis · 2017-03-07T23:19:58Z

pkg/deploy/cmd/test/support.go

+}
+
+func (t *FakeLaggedScaler) ScaleSimple(namespace, name string, preconditions *kubectl.ScalePrecondition, newSize uint) (string, error) {
+	return "", fmt.Errorf("unexpected call to ScaleSimple")


Why isn't this expected?

it is just copy&paste from the FakeScaler, i will nuke this.

0xmichalis · 2017-03-07T23:22:48Z

This is implemented only for Recreate and initial Rolling updates, right?

mfojtik · 2017-03-08T09:57:22Z

@Kargakis that is correct

mfojtik · 2017-03-08T10:57:31Z

@Kargakis comments addressed, thanks!

mfojtik · 2017-03-09T15:40:00Z

[test]

0xmichalis · 2017-03-09T15:47:36Z

@Kargakis that is correct

I guess this is a problem that occurs during cluster upgrades when the api server cache needs to be resynced? Won't this be a problem for non-initial Rolling deployments?

mfojtik · 2017-03-14T16:28:04Z

[test]

mfojtik · 2017-03-24T09:58:54Z

[test]

openshift-bot · 2017-03-24T10:05:28Z

Evaluated for origin test up to f00d197

openshift-bot · 2017-03-24T11:17:12Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/445/) (Base Commit: 6e70ad5)

mfojtik · 2017-03-30T12:23:45Z

@Kargakis we need to get this in, as this is a blocker bug :-/

0xmichalis · 2017-03-30T12:33:17Z

pkg/deploy/strategy/recreate/recreate.go

 	if int32(replicas) == deployment.Spec.Replicas && int32(replicas) == deployment.Status.Replicas {
 		return deployment, nil
 	}
-	if err := s.scaler.Scale(deployment.Namespace, deployment.Name, uint(replicas), &kubectl.ScalePrecondition{Size: -1, ResourceVersion: ""}, retry, wait); err != nil {
+	var scaleErr error
+	err := wait.PollImmediate(1*time.Second, 30*time.Second, func() (bool, error) {


TODO that the 30s should be proportionate to TimeoutSeconds found in the strategy.

0xmichalis · 2017-03-30T12:35:55Z

This should work for installs so I am fine with merging. We should also probably handle it for upgrades for Rolling deployments but that will probably require changes in the rolling updater.

[merge]

openshift-bot · 2017-03-30T12:37:19Z

Evaluated for origin merge up to f00d197

openshift-bot · 2017-03-30T13:05:24Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/445/) (Base Commit: 475ae9a) (Image: devenv-rhel7_6108)

smarterclayton · 2017-04-12T17:22:54Z

pkg/deploy/strategy/recreate/recreate.go

+		// This error is returned when the lifecycle admission plugin cache is not fully
+		// synchronized. In that case the scaling should be retried.
+		//
+		// FIXME: The error returned from admission should not be forbidden but come-back-later error.


Is there an issue to fix this?

We had same problem in our environment and found that one of the controller is having auth issued with master. Controller was continuously trying to connect to the master and failing. After stopping that controller and master, Everything was fine.

mfojtik requested a review from 0xmichalis March 7, 2017 11:50

0xmichalis reviewed Mar 7, 2017

View reviewed changes

0xmichalis self-assigned this Mar 7, 2017

deploy: retry scaling when the admission caches are not fully synced

f00d197

mfojtik force-pushed the retry-scale-in-deploy branch from b83842e to f00d197 Compare March 8, 2017 10:57

0xmichalis reviewed Mar 30, 2017

View reviewed changes

mfojtik mentioned this pull request Mar 30, 2017

Make polling in recreate strategy proportional to strategy #13583

Closed

openshift-bot merged commit f081ac4 into openshift:master Mar 30, 2017

smarterclayton reviewed Apr 12, 2017

View reviewed changes

mfojtik deleted the retry-scale-in-deploy branch September 5, 2018 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy: retry scaling when the admission caches are not fully synced #13279

deploy: retry scaling when the admission caches are not fully synced #13279

mfojtik commented Mar 7, 2017

mfojtik commented Mar 7, 2017

0xmichalis Mar 7, 2017

0xmichalis Mar 7, 2017

0xmichalis Mar 7, 2017

mfojtik Mar 8, 2017

0xmichalis commented Mar 7, 2017

mfojtik commented Mar 8, 2017

mfojtik commented Mar 8, 2017

mfojtik commented Mar 9, 2017

0xmichalis commented Mar 9, 2017

mfojtik commented Mar 14, 2017

mfojtik commented Mar 24, 2017

openshift-bot commented Mar 24, 2017

openshift-bot commented Mar 24, 2017

mfojtik commented Mar 30, 2017

0xmichalis Mar 30, 2017

mfojtik Mar 30, 2017

mfojtik Mar 30, 2017

0xmichalis commented Mar 30, 2017

openshift-bot commented Mar 30, 2017

openshift-bot commented Mar 30, 2017 •

edited

Loading

smarterclayton Apr 12, 2017

jagadeeshops May 8, 2017 •

edited

Loading

deploy: retry scaling when the admission caches are not fully synced #13279

deploy: retry scaling when the admission caches are not fully synced #13279

Conversation

mfojtik commented Mar 7, 2017

mfojtik commented Mar 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis commented Mar 7, 2017

mfojtik commented Mar 8, 2017

mfojtik commented Mar 8, 2017

mfojtik commented Mar 9, 2017

0xmichalis commented Mar 9, 2017

mfojtik commented Mar 14, 2017

mfojtik commented Mar 24, 2017

openshift-bot commented Mar 24, 2017

openshift-bot commented Mar 24, 2017

mfojtik commented Mar 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis commented Mar 30, 2017

openshift-bot commented Mar 30, 2017

openshift-bot commented Mar 30, 2017 • edited Loading

Choose a reason for hiding this comment

jagadeeshops May 8, 2017 • edited Loading

Choose a reason for hiding this comment

openshift-bot commented Mar 30, 2017 •

edited

Loading

jagadeeshops May 8, 2017 •

edited

Loading