stop scheduler from advancing through empty buckets without accept from ratelimiter #18604

jim-minter · 2018-02-13T23:10:18Z

I believe this fixes https://bugzilla.redhat.com/show_bug.cgi?id=1515058 and substantially helps https://bugzilla.redhat.com/show_bug.cgi?id=1543446

jim-minter · 2018-02-13T23:10:42Z

@bparees do you think this need backporting?

bparees · 2018-02-14T02:51:01Z

return (s.position + inc + len(s.buckets)) % len(s.buckets)

unrelated to your changes but since it hurt my brain, isn't that numerically identical to return (s.position + inc) % len(s.buckets) ?

isn't (x+y) % y == x % y ?

bparees

so this does fix the code to behave as the doc implies it was intended to ("If the bucket is empty, we wait for the rate limiter before returning.") but since i haven't studied this like you have, why is it bad to skip through the empty buckets? i'm clear how the buckets, rate limiter, and desired import schedule interact, so it's not obvious to me why treating empty buckets as "an item to process" is critical to it working properly.

bparees · 2018-02-14T02:56:56Z

pkg/image/controller/scheduler.go

 	for k, v := range last {
 		delete(last, k)
 		s.buckets[s.at(-1)][k] = v
 		return k, v, false
 	}
+	s.position = s.at(1)


in the interest of (some) clarity, i think i'd be inclined to move this into the "if len(last) == 0" block above and return from within that if-check. The fact that the for-loop above is really an "if" statement makes it a bit non-obvious what the code flow is here.

I tried it and don't believe it ends up clearer. The compiler ends up insisting on a return statement at the end of the function which is actually unreachable. Or you need to declare k and v outside the range and break. I've added comments.

jim-minter · 2018-02-14T15:53:38Z

return (s.position + inc + len(s.buckets)) % len(s.buckets)

unrelated to your changes but since it hurt my brain, isn't that numerically identical to return (s.position + inc) % len(s.buckets) ?
isn't (x+y) % y == x % y ?

It's not numerically identical in the case that inc is negative: in this case the return value could be negative and therefore invalid. I suppose the intended contract here is that abs(inc) <= len(s.buckets).

bparees · 2018-02-14T15:56:37Z

It's not numerically identical in the case that inc is negative: in this case the return value could be negative and therefore invalid.

sigh, yes of course.

jim-minter · 2018-02-14T17:06:30Z

so this does fix the code to behave as the doc implies it was intended to ("If the bucket is empty, we wait for the rate limiter before returning.") but since i haven't studied this like you have, why is it bad to skip through the empty buckets? i'm clear how the buckets, rate limiter, and desired import schedule interact, so it's not obvious to me why treating empty buckets as "an item to process" is critical to it working properly.

Say the scheduler has 4 buckets and the minimum refresh interval is 15 minutes. I think the idea was that the ratelimiter emits 4 tokens in 15 minutes so that the scheduler gets through all its buckets once in the period. If the scheduler immediately skips to the next bucket when done without waiting on the ratelimiter, it goes too fast (https://bugzilla.redhat.com/show_bug.cgi?id=1515058). Worse, if no scheduler bucket is empty, the ratelimiter is never called and the scheduler hot loops (a second rate limiter based on MaxScheduledImageImportsPerMinute meant that this wasn't super obvious). That's https://bugzilla.redhat.com/show_bug.cgi?id=1543446.

jim-minter · 2018-02-14T17:07:16Z

At the moment I'm looking for backportable correctness fixes because I think this may be appropriate for backport. Subsequently I'd like to refactor this as I think there are several corner case race conditions and the whole thing is very opaque.

bparees · 2018-02-14T17:15:16Z

Worse, if no scheduler bucket is empty, the ratelimiter is never called and the scheduler hot loops

every bucket will eventually be empty since it's moving items out of the bucket as it goes, right?

but thanks for you other explanation, that makes sense (that the ratelimiter config and the number of buckets are configured together. I didn't want to dig into the code enough to see if that was the case)

bparees · 2018-02-14T17:16:47Z

As for backporting, this doesn't feel like it rises to the level of needing to be backported (especially if we can get it into 3.9) but that would be up to CEE to negotiate w/ the customer who hit the issue. I'd ask on the bug.

jim-minter · 2018-02-14T17:34:47Z

every bucket will eventually be empty since it's moving items out of the bucket as it goes, right?

That's not relevant - because Add() spreads items across all buckets, any OCP instance with more scheduled imagestreams than buckets will hot loop. It's in https://bugzilla.redhat.com/show_bug.cgi?id=1543446 and the unit test I've added recreates it. The net effect is that the backend docker registries are queried constantly to the rate of MaxScheduledImageImportsPerMinute.

bparees

one nit and lgtm

bparees · 2018-02-14T19:23:52Z

pkg/image/controller/scheduler_test.go

+	return x
+}
+
+type wallClock struct{}


does not appear to be used

…om ratelimiter

jim-minter · 2018-02-14T19:30:12Z

wallClock comment added

bparees · 2018-02-14T19:36:45Z

/lgtm

openshift-ci-robot · 2018-02-14T19:36:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, jim-minter

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/image/OWNERS~~ [bparees]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-merge-robot · 2018-02-14T22:00:06Z

/test all [submit-queue is verifying that this PR is safe to merge]

openshift-bot · 2018-02-14T23:02:37Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2018-02-15T01:03:36Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-merge-robot · 2018-02-15T01:37:23Z

Automatic merge from submit-queue (batch tested with PRs 18505, 18617, 18604).

jim-minter assigned bparees Feb 13, 2018

openshift-ci-robot requested review from mfojtik and soltysh February 13, 2018 23:10

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 13, 2018

bparees reviewed Feb 14, 2018

View reviewed changes

jim-minter force-pushed the bz1515058 branch 2 times, most recently from 4917c6b to aeddc60 Compare February 14, 2018 17:28

bparees requested changes Feb 14, 2018

View reviewed changes

pkg/image/controller/scheduler_test.go

return x

}

type wallClock struct{}

Copy link

Contributor

bparees Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not appear to be used

Jim Minter added 2 commits February 14, 2018 13:29

stop scheduler from advancing through empty buckets without accept fr…

1408348

…om ratelimiter

unexport scheduler and add health warning

1733eed

jim-minter force-pushed the bz1515058 branch from 519b94b to 1733eed Compare February 14, 2018 19:30

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2018

bparees approved these changes Feb 14, 2018

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2018

bparees added this to the 3.9.0 milestone Feb 14, 2018

openshift-merge-robot merged commit 5535deb into openshift:master Feb 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop scheduler from advancing through empty buckets without accept from ratelimiter #18604

stop scheduler from advancing through empty buckets without accept from ratelimiter #18604

jim-minter commented Feb 13, 2018

jim-minter commented Feb 13, 2018

bparees commented Feb 14, 2018

bparees left a comment

bparees Feb 14, 2018

jim-minter Feb 14, 2018

jim-minter commented Feb 14, 2018

bparees commented Feb 14, 2018

jim-minter commented Feb 14, 2018

jim-minter commented Feb 14, 2018

bparees commented Feb 14, 2018

bparees commented Feb 14, 2018

jim-minter commented Feb 14, 2018

bparees left a comment

bparees Feb 14, 2018

jim-minter commented Feb 14, 2018

bparees commented Feb 14, 2018

openshift-ci-robot commented Feb 14, 2018

openshift-merge-robot commented Feb 14, 2018

openshift-bot commented Feb 14, 2018

openshift-bot commented Feb 15, 2018

openshift-merge-robot commented Feb 15, 2018

stop scheduler from advancing through empty buckets without accept from ratelimiter #18604

stop scheduler from advancing through empty buckets without accept from ratelimiter #18604

Conversation

jim-minter commented Feb 13, 2018

jim-minter commented Feb 13, 2018

bparees commented Feb 14, 2018

bparees left a comment

Choose a reason for hiding this comment

bparees Feb 14, 2018

Choose a reason for hiding this comment

jim-minter Feb 14, 2018

Choose a reason for hiding this comment

jim-minter commented Feb 14, 2018

bparees commented Feb 14, 2018

jim-minter commented Feb 14, 2018

jim-minter commented Feb 14, 2018

bparees commented Feb 14, 2018

bparees commented Feb 14, 2018

jim-minter commented Feb 14, 2018

bparees left a comment

Choose a reason for hiding this comment

bparees Feb 14, 2018

Choose a reason for hiding this comment

jim-minter commented Feb 14, 2018

bparees commented Feb 14, 2018

openshift-ci-robot commented Feb 14, 2018

openshift-merge-robot commented Feb 14, 2018

openshift-bot commented Feb 14, 2018

openshift-bot commented Feb 15, 2018

openshift-merge-robot commented Feb 15, 2018