Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the router reload supression, and add some more controls #16564

Closed

Conversation

knobunc
Copy link
Contributor

@knobunc knobunc commented Sep 26, 2017

The main purpose is to change the locking so that a reload doesn't
hold a lock of the router object for the duration of the reload so that updates that happen while the router is reloaded can be processed immediately and batched up and included when the next reload occurs. Before this, if a reload went longer than the reload interval, one event would be processed, and then trigger a new reload. Which would then lock out other event processing. So the router would not make any meaningful progress consuming events, and spend a lot of time and CPU reloading continually.

The router controls are:

  • Max Reload Frequency (start-to-start of a reload)
  • Gaps between reloads (end-to-start of a reload)
  • Bunch events by waiting a little before reload (wait N ms between
    the time the event comes in and when the reload starts)

The commit routine is now changed to have a top and bottom half. The
top half checks to see if a reload is scheduled, and if it is, it
returns immediately. If there's no reload scheduled then it calls the
bottom half and returns.

The bottom half is in charge of determining if it can immediately
reload or if it has to wait. If it must wait, then it works out the
earliest time it can reload and schedules a callback to itself for
that time.

If it determines it can reload, then it runs the reload code
immediately. When the reload is complete, it calls itself again to
make sure there was no other pending reload that had come in while the
reload was running.

@knobunc knobunc self-assigned this Sep 26, 2017
@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 26, 2017
@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2017
@imcsk8
Copy link
Contributor

imcsk8 commented Sep 26, 2017

Code looks ok. The flow is actually a more clear this way.
Are there any suggested ways to test this?

@knobunc
Copy link
Contributor Author

knobunc commented Sep 27, 2017

@imcsk8 Thanks for the review. I plan to add some tests soon.

@knobunc knobunc force-pushed the feature/change-router-locking branch 2 times, most recently from 70a6561 to d09631e Compare September 27, 2017 20:35
@openshift-merge-robot openshift-merge-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 28, 2017
@knobunc
Copy link
Contributor Author

knobunc commented Sep 28, 2017

@openshift/networking @openshift/sig-networking FYI

@knobunc knobunc changed the title [WIP] Changed the router reload timing controls Change the router reload supression, and add some more controls Sep 28, 2017
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 28, 2017
@knobunc knobunc force-pushed the feature/change-router-locking branch 7 times, most recently from 902bd6f to 929af8f Compare September 28, 2017 17:01
@lihongan
Copy link
Contributor

Can we use the controls' default value (RELOAD_INTERVAL=5s, RELOAD_GAP=100ms, RELOAD_EVENT_WAIT=10ms) to verify the bug https://bugzilla.redhat.com/show_bug.cgi?id=1471899 ? If not can you give the recommended value for that scenario of thousands routes?

And my understanding for this commit is: the max reload duration will be limited to 4890ms (5s-100ms-10ms), right ?

@knobunc
Copy link
Contributor Author

knobunc commented Sep 29, 2017

@lihongan those values would work. You could drop RELOAD_INTERVAL to 1s and RELOAD_GAP and RELOAD_EVENT_WAIT to 0ms if you wanted. That might make it easier to reproduce and validate.

The duration is the longest of the times. So if a reload is desired, then it can reload if (and only if):

  • The last attempt to reload was more than RELOAD_INTERVAL ago
  • The last reload finished more than RELOAD_GAP ago
  • The current time is more than RELOAD_EVENT_WAIT than the first message that triggered the reload (since we expect more than one message to come in in a bunch related to a route change)

So, the longest possible reload time is really unknowable. If the time taken to actually perform the haproxy reload is longer than the RELOAD_INTERVAL, it will be the actual reload time + RELOAD_GAP. Otherwise, if the actual reload time is less than RELOAD_INTERVAL - RELOAD_GAP, then it will reload at RELOAD_INTERVAL.

RELOAD_EVENT_WAIT will likely be ignored on a busy system since if you have events coming in all the time that would trigger a reload, then a reload will be requested long before RELOAD_INTERVAL and RELOAD_GAP will allow it. On the other hand, on a system with few route or endpoint changes, then before, the reload would happen immediately as soon as the first event was delivered, and the second event would have to wait for RELOAD_INTERVAL before it took effect, even though it was probably a route or endpoint change related to the first event, and the first event may not be useful without it. So, we now allow the router to hesitate to gather more events before actually triggering the reload.

@knobunc knobunc force-pushed the feature/change-router-locking branch 4 times, most recently from a6ee50f to 9f2f273 Compare September 29, 2017 15:16
@knobunc knobunc force-pushed the feature/change-router-locking branch from 9f2f273 to 2cb02bb Compare September 29, 2017 15:46
value = time.Duration(defaultReloadEventWait * time.Millisecond)
}
return value
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda seems like these could be one function that takes the env var name and the default interval as arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

glog.V(4).Infof("Router state synchronized for the first time")
r.synced = true
r.stateChanged = true
if hasSynced {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused... if hasSynced = true that implies the state is fully synchronized, but then below r.synced can be false? maybe there's a better name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user tells us we have synced, but our internal state doesn't yet reflect that... update the internal state and emit a message

// We need to track the earliest commit time so we can do burst supression.
now := time.Now()
r.commitReqTime = &now
r.lock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you just defer r.lock.Unlock() at the top, and then in this block do defer r.commitWorker() so the locking is a bit simpler? The second defer should run after the unlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do it that way because the unlock would be called after the commitWorker because defers are called in reverse order from definition.

lastReloadEnd time.Time
// commitReqTime is nil if no commit is needed, otherwise it is the time the commit was requested
commitReqTime *time.Time
// commitRunning indicates whether there is the commitFunc is actively running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit "commitRunning indicates whether the commitFunc is actively running"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@knobunc knobunc force-pushed the feature/change-router-locking branch 2 times, most recently from 0ed02ab to 5b50019 Compare September 29, 2017 17:19
Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of nits, otherwise good

@@ -37,6 +37,12 @@ import (
// defaultReloadInterval is how often to do reloads in seconds.
const defaultReloadInterval = 5

// defaultReloadGap is the minimum gap between a reload end and start in milliseconds.
const defaultReloadGap = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should just be time.Durations too...

commitRunning bool
// commitTimer is the timer we use to make callbacks to the delayed commits
commitTimer *time.Timer
// The minimum time between the starts of reloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godoc: reloadInterval is the minimum...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto the next few fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

untilNextCallback = getNextReload(timeUntilNextAction(now, r.lastReloadEnd, r.reloadGap))
untilNextCallback = getNextReload(timeUntilNextAction(now, *r.commitReqTime, r.reloadEventWait))

if untilNextCallback > 0*time.Second {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need * time.Second here, I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if r.commitTimer == nil {
r.commitTimer = time.AfterFunc(untilNextCallback, r.commitWorker)
} else {
r.commitTimer.Reset(untilNextCallback)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should note the reason for the reset here as well ("we reset to give a chance to grab extra changes before a reload, but the maxReloadInterval above prevents us from resetting forever").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified in a comment.

The main purpose is to change the locking so that a reload doesn't
hold a lock of the router object for the duration of the reload so
that updates that happen while the router is reloaded can be processed
immediately and batched up and included when the next reload
occurs. Before this, if a reload went longer than the reload interval,
one event would be processed, and then trigger a new reload. Which
would then lock out other event processing. So the router would not
make any meaningful progress consuming events, and spend a lot of time
and CPU reloading continually.

The router controls are:
 - $RELOAD_INTERVAL - Max Reload Frequency (start-to-start of a reload), default 5s
 - $RELOAD_GAP - Gaps between reloads (end-to-start of a reload), default 100ms
 - $RELOAD_EVENT_WAIT - Bunch events by waiting a little before reload (wait N ms between
   the time the event comes in and when the reload starts), default 10ms

The commit routine is now changed to have a top and bottom half.  The
top half checks to see if a reload is scheduled, and if it is, it
returns immediately.  If there's no reload scheduled then it calls the
bottom half and returns.

The bottom half is in charge of determining if it can immediately
reload or if it has to wait.  If it must wait, then it works out the
earliest time it can reload and schedules a callback to itself for
that time.

If it determines it can reload, then it runs the reload code
immediately.  When the reload is complete, it calls itself again to
make sure there was no other pending reload that had come in while the
reload was running.
These are the autogenerated shell completion files that correspond to
the new router reload command line parameters.
Added new environment variables to help with debugging:
 - OPENSHIFT_LOG_LEVEL: Defaults to 4, but sets the debug level to the given value
 - OPENSHIFT_GET_ALL_DOCKER_LOGS: A boolean that enables dumping of all container logs if any container failed (rather than just giving the logs from the failure)
The reload constants were previously integers that were multipled by a
duration constant later.  This is confusing since the units are
different, and the environment variables are duration strings.

This patch also changes the way the constants are used and moves them
inline in the parameter definitions with a helper function to
translate to durations.
@knobunc knobunc force-pushed the feature/change-router-locking branch from 5b50019 to 4099687 Compare September 29, 2017 17:49
@knobunc
Copy link
Contributor Author

knobunc commented Sep 29, 2017

/approve

@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: knobunc
We suggest the following additional approver: smarterclayton

Assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@knobunc
Copy link
Contributor Author

knobunc commented Sep 29, 2017

/assign @smarterclayton

@@ -103,7 +103,9 @@ func (o *TemplateRouter) Bind(flag *pflag.FlagSet) {
flag.StringVar(&o.DefaultDestinationCAPath, "default-destination-ca-path", util.Env("DEFAULT_DESTINATION_CA_PATH", "/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt"), "A path to a PEM file containing the default CA bundle to use with re-encrypt routes. This CA should sign for certificates in the Kubernetes DNS space (service.namespace.svc).")
flag.StringVar(&o.TemplateFile, "template", util.Env("TEMPLATE_FILE", ""), "The path to the template file to use")
flag.StringVar(&o.ReloadScript, "reload", util.Env("RELOAD_SCRIPT", ""), "The path to the reload script to use")
flag.DurationVar(&o.ReloadInterval, "interval", reloadInterval(), "Controls how often router reloads are invoked. Mutiple router reload requests are coalesced for the duration of this interval since the last reload time.")
flag.DurationVar(&o.ReloadInterval, "interval", getDurationEnv("RELOAD_INTERVAL", 5*time.Second), "Controls how often router reloads are invoked. Mutiple router reload requests are coalesced for the duration of this interval since the time the last reload started.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos

@@ -103,7 +103,9 @@ func (o *TemplateRouter) Bind(flag *pflag.FlagSet) {
flag.StringVar(&o.DefaultDestinationCAPath, "default-destination-ca-path", util.Env("DEFAULT_DESTINATION_CA_PATH", "/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt"), "A path to a PEM file containing the default CA bundle to use with re-encrypt routes. This CA should sign for certificates in the Kubernetes DNS space (service.namespace.svc).")
flag.StringVar(&o.TemplateFile, "template", util.Env("TEMPLATE_FILE", ""), "The path to the template file to use")
flag.StringVar(&o.ReloadScript, "reload", util.Env("RELOAD_SCRIPT", ""), "The path to the reload script to use")
flag.DurationVar(&o.ReloadInterval, "interval", reloadInterval(), "Controls how often router reloads are invoked. Mutiple router reload requests are coalesced for the duration of this interval since the last reload time.")
flag.DurationVar(&o.ReloadInterval, "interval", getDurationEnv("RELOAD_INTERVAL", 5*time.Second), "Controls how often router reloads are invoked. Mutiple router reload requests are coalesced for the duration of this interval since the time the last reload started.")
flag.DurationVar(&o.ReloadGap, "reload-gap", getDurationEnv("RELOAD_GAP", 100*time.Millisecond), "Controls how often router reloads are invoked. Mutiple router reload requests are coalesced for the duration of this interval since the time the last reload ended.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often are we really going to tune these? Adding lots of extra tunables comes with a cost - do these really need to be changeable or can they just be proportional to the overall duration?

Every tunable we add is more choices that someone can get wrong. Also, I couldn't understand how the flags are used from the descriptions, which is a problem.

@@ -205,6 +207,14 @@ func (o *TemplateRouterOptions) Complete() error {
return fmt.Errorf("invalid reload interval: %v - must be a positive duration", nsecs)
}

if o.ReloadGap < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in validate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the other checks in here be in validate too? Or is there some pattern that I don't grok fully.

lock sync.Mutex
// Track the start and end of router reloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow. This is a LOT of variables. This should be abstracted into some sort of separate object that is passed in because the rest of the template router has nothing to do with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please hide this complexity so that it doesn't get entangled with other places.

// Note: If this is changed FakeCommit() in fake.go should also be updated
func (r *templateRouter) Commit() {
func (r *templateRouter) realCommit(hasSynced bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow.

So this is incredibly complicated and really needs to be abstracted into something smaller and more well tested on its own. The original point of the "commit wrapper" was to have a function that you invoke that does the right thing, and then the rest of it is ignored. This has now become entangled with the template router.

Please separate all of this logic out into three pieces - the signal (from the rest of the template router to reload), the rate limiter / tracker (the bit that says when an actual commit should go), and the underlying commit action that controls how fast it happens.

I.e.:

  1. router controller signals (changes happened)
  2. a rate limiter / commit controller decides whether to pass it on (this code)
  3. the underlying commit.

There really cannot be a connection between them, and the code here is too complicated for me to be confident it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt can you help suggest here? i think of this as controller model - several caches:

informer cache -> feeding secondary cache -> controller loop that reads secondary cache and needs to sync up with secondary cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have worked out a possible solution.

keyFunc := func(_ interface{}) (string, error) {
return "templaterouter", nil
}
// Do an immediate commit so that we can reply to health checks before the state syncs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this make the router reply that it is healthy prior to having synced? is that correct? couldn't that cause traffic to be directed to it prior to actually having all the routes set up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already cover that behavior in the router... the haproxy template is passed a synced flag. So it doesn't bind 80/443 until the sync happens, but it does bind the status port for health checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems really odd. It reports healthy before it can serve traffic?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 17, 2017
@pravisankar
Copy link

pravisankar commented Oct 18, 2017

                  t1                    t2
   StartReload ----------> EndReload -----------> StartNextReload

t1 (time between start and end of reload) is variable as it depends on the number of routes/endpoints and also on cpu load on the router machine. So as routes/endpoints increase or decrease, what ever value we initially set may not be accurate/desired.
t2 (time between end and next start of reload) can be tuned irrespective of the number of routes or cpu load and this allows coalescing bunch of updates on router.
RELOAD_EVENT_WAIT may be unnecessary as RELOAD_GAP implicitly includes this case and also if we have a stream of events continuously then we may not reload for a long time.

I think instead of 3 knobs, just RELOAD_GAP knob will solve most of the reload problems and the current implementation of doing router reload and not blocking other in-memory router updates during this time will help a lot.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 20, 2017
@knobunc
Copy link
Contributor Author

knobunc commented Oct 26, 2017

Superseded by #17049

@knobunc knobunc closed this Oct 26, 2017
@knobunc knobunc deleted the feature/change-router-locking branch June 7, 2018 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/routing size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.