Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make A/B deployment proportional to service weight #15309

Merged
merged 1 commit into from
Aug 2, 2017

Conversation

pecameron
Copy link
Contributor

Distribute requests among the route's services based on service weight.
The portion of the total weight that each service has is distributed
evenly among the service's endpoints.

bug 1470350
https://bugzilla.redhat.com/show_bug.cgi?id=1470350

@pecameron
Copy link
Contributor Author

@knobunc @rajatchopra PTAL

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally on the right track. Let's use getServiceUnits to make it a hair simpler.

@@ -24586,10 +24586,6 @@
"$ref": "v1.StageInfo"
},
"description": "stages contains details about each stage that occurs during the build including start time, duration (in milliseconds), and the steps that occured within each stage."
},
"logSnippet": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have gone away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. (accidentally deleted)

@pecameron pecameron force-pushed the bz1470350 branch 2 times, most recently from 16ac7bd to 6f9c3d8 Compare July 18, 2017 16:48
@@ -812,6 +812,20 @@ func (r *templateRouter) RemoveRoute(route *routeapi.Route) {
r.stateChanged = true
}

// numberOfEndpoints returns the number of endpoints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numberOfEndpoints returns the number of endpoints for the given service id

r.lock.Lock()
defer r.lock.Unlock()

var eps int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eps := 0

(0 is a type int, and that it as least 32 bits)

Or if you must:

var eps int32 = 0

var eps int32
eps = 0
svc, ok := r.findMatchingServiceUnit(id)
if ok && len(svc.EndpointTable) > int(eps) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to cast if you use := 0

eps = 0
svc, ok := r.findMatchingServiceUnit(id)
if ok && len(svc.EndpointTable) > int(eps) {
eps = int32(len(svc.EndpointTable))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and no need to cast here either...

@pecameron
Copy link
Contributor Author

@knobunc Is there more need? PTAL

// serviceUnits[key] is the weigth for each endpoint in the service
// the sum of the weights of the endpoints is the scaled service weight

var numEp int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numEp := r.numberOfEndpoints(key)
if numEp < 1 {
numEp = 1
}

@pecameron
Copy link
Contributor Author

@knobunc @rajatchopra
added test case
PTAL

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@knobunc
Copy link
Contributor

knobunc commented Jul 19, 2017

@openshift/networking PTAL

@knobunc
Copy link
Contributor

knobunc commented Jul 19, 2017

[test][testextended][extended: networking]

}
// scale the weights to near the maximum (256)
// This improves precision when scaling for the endpoints
var scaleWeight int32 = 256 / maxWeight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This normalization is a good idea. But it will not cover the case where maxWeight is > 128 and one of the services have a weight which is too low compared to its number of endpoints.
e.g. I want A/B testing to happen for a percentage of 1.5% i.e. 1.5% of traffic should be served by serviceA and the rest by serviceB. So I put a weight of 3 on svc A and 200 on svc B.
All is good until I have 20 endpoints for svcA at which point we will give the endpoints a weight of '0'.
Ideally, we should give 3 endpoints a weight of '1' and the rest become '0'(unfortunately).

I have another algorithm in mind (not perfect either). Need discussion.

//Foreach svc:

equitableWeight = trunc(svcWeight/numEp)
for i = 0; i < len(endpoints); i++ {
    endpoints[i].weight = equitableWeight
}
for i=0; i < (svcWeight - equitableWeight); i++ {
    endpoints[i].weight += 1
}

Copy link
Contributor Author

@pecameron pecameron Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajatchopra Thanks for taking a look. The idea of normalizing was to bring up the largest weight to a 3 digit number. In this 128 is OK. It just gives a little mote headroom.
For your 1.5% example, The weight, if not 0, should have a min of 1. Thanks for spotting this. I'll fix it.
With a min of 1, the number of pods can be reduced if the overall percent is still too high. The 20 pods could be reduced to maybe 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajatchopra I do not like setting the weight to 0 for a live backend. We need to doc clearly that you can send more traffic than the proportion would allow if there are too many backends.

@pecameron
Copy link
Contributor Author

[Extended Tests: networking]
@rajatchopra @knobunc PTAL

@knobunc
Copy link
Contributor

knobunc commented Jul 20, 2017

Failed with:

Some command failed in a script: hack/update-generated-openapi.sh:34: genopenapi --logtostderr --output-base="${GOPATH}/src" --input-dirs "${INPUT_DIRS}" --output-package "${ORIGIN_PREFIX}pkg/openapi" "$@" exited with status 255.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 21, 2017
@knobunc
Copy link
Contributor

knobunc commented Jul 21, 2017

@openshift/networking PTAL

@@ -66,7 +66,7 @@ func ValidateRoute(route *routeapi.Route) field.ErrorList {

backendPath := specPath.Child("alternateBackends")
if len(route.Spec.AlternateBackends) > 3 {
result = append(result, field.Required(backendPath, "cannot specify more than 3 additional backends"))
result = append(result, field.Required(backendPath, "cannot specify more than 3 alternate backends"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiniest nit: Extra space after "alternate"

@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 21, 2017
@pecameron
Copy link
Contributor Author

[test]

@knobunc
Copy link
Contributor

knobunc commented Jul 24, 2017

@pecameron testing failed with:

F0721 17:27:36.647074   21052 openapi.go:24] Error: Failed executing generator: some packages had errors:
errors in package "github.com/openshift/origin/pkg/openapi":
output for "openapi/zz_generated.openapi.go" differs; first existing/expected diff: 
  "or more backends the route points to. Weights on each backend can define the balance of traffic sent"
  "to four backends (services) the route points to. Requests are distributed among the backends dependi"

logs

@openshift-merge-robot openshift-merge-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 24, 2017
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 073c4b8

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/3457/) (Base Commit: 49df2d6) (PR Branch Commit: 073c4b8)

@openshift-merge-robot openshift-merge-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 25, 2017
@openshift-merge-robot openshift-merge-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 26, 2017
pecameron added a commit to pecameron/openshift-docs that referenced this pull request Jul 26, 2017
A route can front up to 4 services that handle the requests.
The load balancing strategy governs which endpoint gets each request.
When roundrobin is chosen, the portion of the requests that each
service handles is governed by the weight assigned to the service.
Each endpoint in the service gets a fraction of the service's requests

bug 1470350
https://bugzilla.redhat.com/show_bug.cgi?id=1470350

Code change is in origin PR 15309
openshift/origin#15309
@pecameron
Copy link
Contributor Author

/retest

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2017
@pecameron
Copy link
Contributor Author

/retest

@openshift-merge-robot openshift-merge-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2017
@pecameron
Copy link
Contributor Author

/retest

Distribute requests among the route's services based on service weight.
The portion of the total weight that each service has is distributed
evenly among the service's endpoints.

bug 1470350
https://bugzilla.redhat.com/show_bug.cgi?id=1470350
@pecameron
Copy link
Contributor Author

@knobunc PTAL

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 31, 2017

@pecameron: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/check 0d2add8 link /test test_pull_request_origin_check

Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pecameron
Copy link
Contributor Author

/retest

@pecameron
Copy link
Contributor Author

@rajatchopra PTAL I need a /lgtm to move this forward. Thanks.

Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Let's do a follow on PR about the algorithm though. I would have liked that in the case where number of endpoints overrun the maximum manageable with weight '1', we should ignore the rest of the endpoints rather than just 'warn' that A/B weightage will not be honoured.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 1, 2017
@rajatchopra
Copy link
Contributor

/approve addresses issue #11881

@eparis
Copy link
Member

eparis commented Aug 1, 2017

/approve no-issue

@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, pecameron, rajatchopra

Associated issue requirement bypassed by: eparis

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 1, 2017
@eparis
Copy link
Member

eparis commented Aug 1, 2017

/retest

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 15533, 15414, 15584, 15529, 15309)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants