Make A/B deployment proportional to service weight #15309

pecameron · 2017-07-18T14:52:37Z

Distribute requests among the route's services based on service weight.
The portion of the total weight that each service has is distributed
evenly among the service's endpoints.

bug 1470350
https://bugzilla.redhat.com/show_bug.cgi?id=1470350

pecameron · 2017-07-18T14:53:19Z

@knobunc @rajatchopra PTAL

knobunc

Generally on the right track. Let's use getServiceUnits to make it a hair simpler.

knobunc · 2017-07-18T14:59:11Z

api/swagger-spec/oapi-v1.json

@@ -24586,10 +24586,6 @@
       "$ref": "v1.StageInfo"
      },
      "description": "stages contains details about each stage that occurs during the build including start time, duration (in milliseconds), and the steps that occured within each stage."
-     },
-     "logSnippet": {


This should not have gone away.

Fixed. (accidentally deleted)

knobunc · 2017-07-19T11:50:44Z

pkg/router/template/router.go

@@ -812,6 +812,20 @@ func (r *templateRouter) RemoveRoute(route *routeapi.Route) {
 	r.stateChanged = true
 }

+// numberOfEndpoints returns the number of endpoints


numberOfEndpoints returns the number of endpoints for the given service id

knobunc · 2017-07-19T11:53:05Z

pkg/router/template/router.go

+	r.lock.Lock()
+	defer r.lock.Unlock()
+
+	var eps int32


eps := 0

(0 is a type int, and that it as least 32 bits)

Or if you must:

var eps int32 = 0

knobunc · 2017-07-19T11:53:47Z

pkg/router/template/router.go

+	var eps int32
+	eps = 0
+	svc, ok := r.findMatchingServiceUnit(id)
+	if ok && len(svc.EndpointTable) > int(eps) {


no need to cast if you use := 0

knobunc · 2017-07-19T11:54:06Z

pkg/router/template/router.go

+	eps = 0
+	svc, ok := r.findMatchingServiceUnit(id)
+	if ok && len(svc.EndpointTable) > int(eps) {
+		eps = int32(len(svc.EndpointTable))


and no need to cast here either...

pecameron · 2017-07-19T15:24:49Z

@knobunc Is there more need? PTAL

knobunc · 2017-07-19T15:38:53Z

pkg/router/template/router.go

+	// serviceUnits[key] is the weigth for each endpoint in the service
+	// the sum of the weights of the endpoints is the scaled service weight
+
+	var numEp int32


numEp := r.numberOfEndpoints(key)
if numEp < 1 {
numEp = 1
}

pecameron · 2017-07-19T17:51:55Z

@knobunc @rajatchopra
added test case
PTAL

knobunc

LGTM

knobunc · 2017-07-19T19:17:02Z

@openshift/networking PTAL

knobunc · 2017-07-19T19:22:59Z

[test][testextended][extended: networking]

rajatchopra · 2017-07-19T19:44:10Z

pkg/router/template/router.go

+	}
+	// scale the weights to near the maximum (256)
+	// This improves precision when scaling for the endpoints
+	var scaleWeight int32 = 256 / maxWeight


This normalization is a good idea. But it will not cover the case where maxWeight is > 128 and one of the services have a weight which is too low compared to its number of endpoints.
e.g. I want A/B testing to happen for a percentage of 1.5% i.e. 1.5% of traffic should be served by serviceA and the rest by serviceB. So I put a weight of 3 on svc A and 200 on svc B.
All is good until I have 20 endpoints for svcA at which point we will give the endpoints a weight of '0'.
Ideally, we should give 3 endpoints a weight of '1' and the rest become '0'(unfortunately).

I have another algorithm in mind (not perfect either). Need discussion.

//Foreach svc: equitableWeight = trunc(svcWeight/numEp) for i = 0; i < len(endpoints); i++ { endpoints[i].weight = equitableWeight } for i=0; i < (svcWeight - equitableWeight); i++ { endpoints[i].weight += 1 }

@rajatchopra Thanks for taking a look. The idea of normalizing was to bring up the largest weight to a 3 digit number. In this 128 is OK. It just gives a little mote headroom.
For your 1.5% example, The weight, if not 0, should have a min of 1. Thanks for spotting this. I'll fix it.
With a min of 1, the number of pods can be reduced if the overall percent is still too high. The 20 pods could be reduced to maybe 3.

@rajatchopra I do not like setting the weight to 0 for a live backend. We need to doc clearly that you can send more traffic than the proportion would allow if there are too many backends.

pecameron · 2017-07-20T13:14:02Z

[Extended Tests: networking]
@rajatchopra @knobunc PTAL

knobunc · 2017-07-20T20:41:11Z

Failed with:

Some command failed in a script: hack/update-generated-openapi.sh:34: genopenapi --logtostderr --output-base="${GOPATH}/src" --input-dirs "${INPUT_DIRS}" --output-package "${ORIGIN_PREFIX}pkg/openapi" "$@" exited with status 255.

knobunc · 2017-07-21T14:14:20Z

@openshift/networking PTAL

knobunc · 2017-07-21T14:15:38Z

pkg/route/apis/route/validation/validation.go

@@ -66,7 +66,7 @@ func ValidateRoute(route *routeapi.Route) field.ErrorList {

 	backendPath := specPath.Child("alternateBackends")
 	if len(route.Spec.AlternateBackends) > 3 {
-		result = append(result, field.Required(backendPath, "cannot specify more than 3 additional backends"))
+		result = append(result, field.Required(backendPath, "cannot specify more than 3 alternate  backends"))


Tiniest nit: Extra space after "alternate"

pecameron · 2017-07-24T13:44:45Z

[test]

knobunc · 2017-07-24T13:58:23Z

@pecameron testing failed with:

F0721 17:27:36.647074   21052 openapi.go:24] Error: Failed executing generator: some packages had errors:
errors in package "github.com/openshift/origin/pkg/openapi":
output for "openapi/zz_generated.openapi.go" differs; first existing/expected diff: 
  "or more backends the route points to. Weights on each backend can define the balance of traffic sent"
  "to four backends (services) the route points to. Requests are distributed among the backends dependi"

logs

openshift-bot · 2017-07-24T17:56:45Z

Evaluated for origin test up to 073c4b8

openshift-bot · 2017-07-24T20:08:49Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/3457/) (Base Commit: 49df2d6) (PR Branch Commit: 073c4b8)

A route can front up to 4 services that handle the requests. The load balancing strategy governs which endpoint gets each request. When roundrobin is chosen, the portion of the requests that each service handles is governed by the weight assigned to the service. Each endpoint in the service gets a fraction of the service's requests bug 1470350 https://bugzilla.redhat.com/show_bug.cgi?id=1470350 Code change is in origin PR 15309 openshift/origin#15309

pecameron · 2017-07-27T19:16:03Z

/retest

pecameron · 2017-07-28T18:01:37Z

/retest

pecameron · 2017-07-31T13:21:33Z

/retest

Distribute requests among the route's services based on service weight. The portion of the total weight that each service has is distributed evenly among the service's endpoints. bug 1470350 https://bugzilla.redhat.com/show_bug.cgi?id=1470350

pecameron · 2017-07-31T20:11:04Z

@knobunc PTAL

openshift-ci-robot · 2017-07-31T22:30:15Z

@pecameron: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/check	`0d2add8`	link	`/test test_pull_request_origin_check`

Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

pecameron · 2017-08-01T12:41:58Z

/retest

pecameron · 2017-08-01T17:18:12Z

@rajatchopra PTAL I need a /lgtm to move this forward. Thanks.

rajatchopra

/lgtm

Let's do a follow on PR about the algorithm though. I would have liked that in the case where number of endpoints overrun the maximum manageable with weight '1', we should ignore the rest of the endpoints rather than just 'warn' that A/B weightage will not be honoured.

rajatchopra · 2017-08-01T18:15:29Z

/approve addresses issue #11881

eparis · 2017-08-01T18:18:08Z

/approve no-issue

openshift-merge-robot · 2017-08-01T18:18:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, pecameron, rajatchopra

Associated issue requirement bypassed by: eparis

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~OWNERS~~ [eparis]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

eparis · 2017-08-01T18:18:39Z

/retest

openshift-merge-robot · 2017-08-02T10:29:53Z

Automatic merge from submit-queue (batch tested with PRs 15533, 15414, 15584, 15529, 15309)

knobunc requested changes Jul 18, 2017

View reviewed changes

pecameron force-pushed the bz1470350 branch 2 times, most recently from 16ac7bd to 6f9c3d8 Compare July 18, 2017 16:48

knobunc requested changes Jul 19, 2017

View reviewed changes

knobunc reviewed Jul 19, 2017

View reviewed changes

pecameron force-pushed the bz1470350 branch from 6f9c3d8 to 5c9d817 Compare July 19, 2017 17:50

knobunc approved these changes Jul 19, 2017

View reviewed changes

rajatchopra reviewed Jul 19, 2017

View reviewed changes

pecameron force-pushed the bz1470350 branch from 5c9d817 to 16087f7 Compare July 20, 2017 13:12

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 21, 2017

knobunc reviewed Jul 21, 2017

View reviewed changes

pecameron force-pushed the bz1470350 branch from 16087f7 to 24838b2 Compare July 21, 2017 17:05

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 21, 2017

pecameron force-pushed the bz1470350 branch from 24838b2 to 8ea2ae8 Compare July 24, 2017 14:14

openshift-merge-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 24, 2017

pecameron force-pushed the bz1470350 branch from 8ea2ae8 to 073c4b8 Compare July 24, 2017 17:54

pecameron force-pushed the bz1470350 branch from 073c4b8 to 0d2add8 Compare July 25, 2017 18:59

openshift-merge-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 25, 2017

pecameron force-pushed the bz1470350 branch from 0d2add8 to 7ac939f Compare July 26, 2017 20:05

openshift-merge-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 26, 2017

openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2017

openshift-merge-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2017

pecameron force-pushed the bz1470350 branch from 7ac939f to c743cb7 Compare July 31, 2017 13:19

pecameron force-pushed the bz1470350 branch from c743cb7 to 22295d0 Compare July 31, 2017 15:43

pecameron force-pushed the bz1470350 branch from 22295d0 to b0db0b2 Compare July 31, 2017 19:54

rajatchopra approved these changes Aug 1, 2017

View reviewed changes

openshift-ci-robot assigned rajatchopra Aug 1, 2017

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 1, 2017

openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 1, 2017

openshift-merge-robot merged commit b64c94e into openshift:master Aug 2, 2017

pravisankar mentioned this pull request Oct 24, 2017

traffic splitting of A/B deployment is by replicas but not by service #11881

Closed

Make A/B deployment proportional to service weight #15309

Make A/B deployment proportional to service weight #15309

Conversation

pecameron commented Jul 18, 2017

pecameron commented Jul 18, 2017

knobunc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pecameron commented Jul 19, 2017

Choose a reason for hiding this comment

pecameron commented Jul 19, 2017

knobunc left a comment

Choose a reason for hiding this comment

knobunc commented Jul 19, 2017

knobunc commented Jul 19, 2017

Choose a reason for hiding this comment

pecameron Jul 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pecameron commented Jul 20, 2017

knobunc commented Jul 20, 2017

knobunc commented Jul 21, 2017

Choose a reason for hiding this comment

pecameron commented Jul 24, 2017

knobunc commented Jul 24, 2017 • edited Loading

openshift-bot commented Jul 24, 2017

openshift-bot commented Jul 24, 2017

pecameron commented Jul 27, 2017

pecameron commented Jul 28, 2017

pecameron commented Jul 31, 2017

pecameron commented Jul 31, 2017

openshift-ci-robot commented Jul 31, 2017 • edited Loading

pecameron commented Aug 1, 2017

pecameron commented Aug 1, 2017

rajatchopra left a comment

Choose a reason for hiding this comment

rajatchopra commented Aug 1, 2017

eparis commented Aug 1, 2017

openshift-merge-robot commented Aug 1, 2017

eparis commented Aug 1, 2017

openshift-merge-robot commented Aug 2, 2017

pecameron Jul 20, 2017 •

edited

Loading

knobunc commented Jul 24, 2017 •

edited

Loading

openshift-ci-robot commented Jul 31, 2017 •

edited

Loading