-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the router reload suppression so that it doesn't block updates #17049
Change the router reload suppression so that it doesn't block updates #17049
Conversation
Added new environment variables to help with debugging: - OPENSHIFT_LOG_LEVEL: Defaults to 4, but sets the debug level to the given value - OPENSHIFT_GET_ALL_DOCKER_LOGS: A boolean that enables dumping of all container logs if any container failed (rather than just giving the logs from the failure)
/test |
2c3828a
to
3893117
Compare
This changes the locking so that a reload doesn't hold a lock of the router object for the duration of the reload so that updates that happen while the router is reloading can be processed immediately and batched up, then included when the next reload occurs. Before this, if a reload ran longer than the reload interval, only one event would be processed before triggering a new reload. Which would then lock out other event processing. This caused the router to not make any meaningful progress consuming events. A new module to do the rate limiting has been added. The module has have a top and bottom half. The top half simply calls the bottom half with a flag indicating the user has made a change. The flag simply tells the bottom half to register the desire to reload (so we can do it under a single lock acquisition). The bottom half is in charge of determining if it can immediately reload or if it has to wait. If it must wait, then it works out the earliest time it can reload and schedules a callback to itself for that time. If it determines it can reload, then it runs the reload code immediately. When the reload is complete, it calls itself again to make sure there was no other pending reload that had come in while the reload was running. Whenever the bottom half calls itself, it does it without the flag indicating the user made a change. Fixes bug 1471899 -- https://bugzilla.redhat.com/show_bug.cgi?id=1471899
3893117
to
dac5ce6
Compare
This scares me a lot coming in so late. Can you confirm with QA that they have time to run all their router tests on this, at scale? I'm starting to lean to pushing this out of 3.7, even though we know it would be huge win for us... |
Changes are very easy to understand, thanks Ben! In router |
After discussing with @knobunc and @rajatchopra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Non blocking comments posted above.
if untilNextCallback > 0 { | ||
// We want to reload... but can't yet because some window is not satisfied | ||
if csrl.callbackTimer == nil { | ||
csrl.callbackTimer = time.AfterFunc(untilNextCallback, func() { csrl.changeWorker(false) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we should have a for loop instead of a recursive call, just to avoid the remote possibility of constant changes causing stackoverflow.
|
||
return csrl.handlerFunc() | ||
} | ||
if err := runHandler(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The func variable is the same as the function it resides in. Maybe choose a different name here. Spun me around a bit because I thought we are making a recursive call :).
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, rajatchopra The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
Automatic merge from submit-queue. |
Change the router reload suppression so that it doesn't block updates
This changes the locking so that a reload doesn't hold a lock of the router object for the duration of the reload so that updates that happen while the router is reloading can be processed immediately and batched up, then included when the next reload occurs. Before this, if a reload ran longer than the reload interval, only one event would be processed before triggering a new reload. Which would then lock out other event processing. This caused the router to not make any meaningful progress consuming events.
A new module to do the rate limiting has been added.
The module has have a top and bottom half. The top half simply calls the bottom half with a flag indicating the user has made a change. The flag simply tells the bottom half to register the desire to reload (so we can do it under a single lock acquisition).
The bottom half is in charge of determining if it can immediately reload or if it has to wait. If it must wait, then it works out the earliest time it can reload and schedules a callback to itself for that time.
If it determines it can reload, then it runs the reload code immediately. When the reload is complete, it calls itself again to make sure there was no other pending reload that had come in while the reload was running.
Whenever the bottom half calls itself, it does it without the flag indicating the user made a change.
Fixes bug 1471899 -- https://bugzilla.redhat.com/show_bug.cgi?id=1471899
@openshift/networking PTAL