-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop scheduler from advancing through empty buckets without accept from ratelimiter #18604
Conversation
@bparees do you think this need backporting? |
unrelated to your changes but since it hurt my brain, isn't that numerically identical to isn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this does fix the code to behave as the doc implies it was intended to ("If the bucket is empty, we wait for the rate limiter before returning.") but since i haven't studied this like you have, why is it bad to skip through the empty buckets? i'm clear how the buckets, rate limiter, and desired import schedule interact, so it's not obvious to me why treating empty buckets as "an item to process" is critical to it working properly.
pkg/image/controller/scheduler.go
Outdated
for k, v := range last { | ||
delete(last, k) | ||
s.buckets[s.at(-1)][k] = v | ||
return k, v, false | ||
} | ||
s.position = s.at(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the interest of (some) clarity, i think i'd be inclined to move this into the "if len(last) == 0" block above and return from within that if-check. The fact that the for-loop above is really an "if" statement makes it a bit non-obvious what the code flow is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it and don't believe it ends up clearer. The compiler ends up insisting on a return statement at the end of the function which is actually unreachable. Or you need to declare k and v outside the range and break. I've added comments.
It's not numerically identical in the case that inc is negative: in this case the return value could be negative and therefore invalid. I suppose the intended contract here is that abs(inc) <= len(s.buckets). |
sigh, yes of course. |
Say the scheduler has 4 buckets and the minimum refresh interval is 15 minutes. I think the idea was that the ratelimiter emits 4 tokens in 15 minutes so that the scheduler gets through all its buckets once in the period. If the scheduler immediately skips to the next bucket when done without waiting on the ratelimiter, it goes too fast (https://bugzilla.redhat.com/show_bug.cgi?id=1515058). Worse, if no scheduler bucket is empty, the ratelimiter is never called and the scheduler hot loops (a second rate limiter based on MaxScheduledImageImportsPerMinute meant that this wasn't super obvious). That's https://bugzilla.redhat.com/show_bug.cgi?id=1543446. |
At the moment I'm looking for backportable correctness fixes because I think this may be appropriate for backport. Subsequently I'd like to refactor this as I think there are several corner case race conditions and the whole thing is very opaque. |
every bucket will eventually be empty since it's moving items out of the bucket as it goes, right? but thanks for you other explanation, that makes sense (that the ratelimiter config and the number of buckets are configured together. I didn't want to dig into the code enough to see if that was the case) |
As for backporting, this doesn't feel like it rises to the level of needing to be backported (especially if we can get it into 3.9) but that would be up to CEE to negotiate w/ the customer who hit the issue. I'd ask on the bug. |
4917c6b
to
aeddc60
Compare
That's not relevant - because Add() spreads items across all buckets, any OCP instance with more scheduled imagestreams than buckets will hot loop. It's in https://bugzilla.redhat.com/show_bug.cgi?id=1543446 and the unit test I've added recreates it. The net effect is that the backend docker registries are queried constantly to the rate of MaxScheduledImageImportsPerMinute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit and lgtm
return x | ||
} | ||
|
||
type wallClock struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not appear to be used
519b94b
to
1733eed
Compare
wallClock comment added |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bparees, jim-minter The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
Automatic merge from submit-queue (batch tested with PRs 18505, 18617, 18604). |
I believe this fixes https://bugzilla.redhat.com/show_bug.cgi?id=1515058 and substantially helps https://bugzilla.redhat.com/show_bug.cgi?id=1543446