-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaced event queue based watching resources in router with shared informers #16315
Replaced event queue based watching resources in router with shared informers #16315
Conversation
[test] |
@openshift/networking @knobunc @rajatchopra PTAL |
/test integration issue #16312 |
/lgtm Thanks @pravisankar. @smarterclayton does this look sane to you? Obviously we still need to look into some of the test case failures (especially the one around the router reload). |
I don't like the live calls. Let's do this the proper way and use the cache correctly. Wait for sync, then flip a bool and force a refresh on the cache. Also, don't use the resource name switch, just embed five informer inits. Strong typing is better |
Note that I'm really glad the event queue is gone, just want to get the last extra mile to make this "normal". Live calls are bad because they won't take advantage of API chunking when we turn that on in 3.8 |
@deads2k re: how to safely have "only do this after sync" with the informer
I think using an index on hostname and then doing a sort is the right thing
to do, and if the "ready boolean" is unset simply exit the loop and come
back around. Or we can just delay the initial sync step of writing out the
config until sync is safe.
I do think we don't want to write back to the route API until we've fully
synced, so we probably are going to have to:
1. complete the full sync and populate the index (writing nothing to route
api for the hostname overlap)
2. wait for that
3. trigger a refresh that will then do the exact same work over again but
will trigger hostname writes to route api
4. let the first write happen
…On Tue, Sep 19, 2017 at 11:53 AM, Ben Bennett ***@***.***> wrote:
/lgtm Thanks @pravisankar <https://github.com/pravisankar>.
@smarterclayton <https://github.com/smarterclayton> does this look sane
to you? Obviously we still need to look into some of the test case failures
(especially the one around the router reload).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16315 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p_2Y7RH4IhJpVR1bdJ18AvaoRqmhks5sj-KCgaJpZM4PVcfY>
.
|
In general, "do this after sync" is expressed by filling a work queue while the cache is priming, but not starting any workers. After all caches have sync, you can synchronously do some work (fill a secondary cache perhaps?) and then start a single worker. Doing it like that would ensure let you trigger based off of a shared informer, do some work before you consume and process any update, and process resources in order. If you must process individual watch notifications (not resources), then you could fill your own queue (or super deep channel). |
If you need additional feedback this week to get this closed out please
don't hesitate to ask. I'd like this in 3.7.
…On Sat, Sep 23, 2017 at 4:09 AM, OpenShift Merge Robot < ***@***.***> wrote:
@pravisankar <https://github.com/pravisankar> PR needs rebase
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16315 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pz7r521BiHrAZhK9hhBFPFZa2haEks5slLy5gaJpZM4PVcfY>
.
|
ff000e8
to
3dc6842
Compare
[test] |
@smarterclayton @knobunc @rajatchopra Updated, PTAL |
routes, err := lw.client.Routes(lw.namespace).List(opts) | ||
if err != nil { | ||
return nil, err | ||
rc.FirstSyncDone = func() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once has synced has returned true, this is unnecessary.
Maybe we should talk in person, I still think this is much more complicated than it needs to be.
- Set "syncing" boolean true (never commit while this is true)
- have your config loops update internal structs
- wait for all informers synced
- set syncing to be true
- trigger a refresh in informers
3-5 should be able to be done in a single method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue here is that you have two caches:
- in the informer
- caches in the route plugin
The second cache is filled by the first cache. The code that fills the second cache is not done under the lock that maintains the first cache, and therefore while order of events can be guaranteed, there is no guarantee that the second cache is up to date with the first cache.
You need to determine when the second cache has observed all of the events from the first cache once it has synced once.
I think you can address this by starting the informers, waiting for synced (on all of them) and then call each cache as informer.GetStore().List() and send those to the route controller as adds. Once all of them are done, then call commit. Then register the route controller as a listener on the informer, and every update from then on is safe.
3dc6842
to
ba301d2
Compare
/test extended_conformance_gce |
Regarding the question today, if an informer index of routes by host is
used then the index is up to date once synced is complete (index updates
are synchronous). So we can replace our internal map with the index, and
if we set the Boolean after true and resync any handlers are guaranteed to
observe the other routes with the same host.
Alternatively, we can simply avoid registering our handlers until sync is
true, and then register and resync (may have to double check the ordering
guarantees around adding a handler).
On Sep 26, 2017, at 9:06 PM, Ravi Sankar Penta <[email protected]> wrote:
/test extended_conformance_gce
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16315 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p7cTBCV5gcTM86m4Jhl6Jsrm1R_9ks5smZ-EgaJpZM4PVcfY>
.
|
// event handlers have the same view of sync state. | ||
c.endpointsListConsumed = c.EndpointsListConsumed() | ||
c.commit() | ||
c.updateConsumedCount(endpoints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counting isn't enough to tell you whether you've got everything. If an endpoint is deleted while you're doing the sync you will never reach your target number. You can't synchronize with your handlers like this unfortunately.
ba301d2
to
d86352a
Compare
@smarterclayton @knobunc Updated, PTAL |
} | ||
time.Sleep(50 * time.Millisecond) | ||
} | ||
c.StartInformers(utilwait.NeverStop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to belong here, but in the factory. Why should this be a concern of the route controller?
Registering functions into another type is a code smell. Just have a higher level method that calls public methods on the controller from the factory
glog.Fatalf("Failed to sync router informer cache: %v", err) | ||
} | ||
c.processExistingItems() | ||
c.firstSyncDone = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only part of this method that really belongs here is setting this Boolean (which needs to be under a lock).
c.HandleEndpoints(watch.Added, item.(*kapi.Endpoints)) | ||
} | ||
|
||
for _, item := range c.InformerCacheList(&routeapi.Route{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This abstraction is unnecessary. Do this from the factory and just call the store directly.
d86352a
to
168938a
Compare
@smarterclayton @knobunc rearranged the code as suggested, please review |
ed94e3d
to
f5ddc5e
Compare
field fields.Selector | ||
namespace string | ||
func (f *RouterControllerFactory) initCallbacks(rc *routercontroller.RouterController) { | ||
rc.HasSyncedInformers = func() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When re-sync is in progress, we want to reduce the number of reloads. We could fully rely on router coalescing and can get rid of this informer synced check. @knobunc @rajatchopra what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with 1000 routes to check whether router coalescing is sufficient or informer synced check is necessary to reduce reloads. Router coalescing without sync check worked fine, so removed this unnecessary check.
} | ||
if lw.field != nil { | ||
field = lw.field.String() | ||
func (f *RouterControllerFactory) processExistingItems(rc *routercontroller.RouterController) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
godoc on this function and how it is used (and reason)
f5ddc5e
to
fba2402
Compare
…nformers - Custom shared informer is used to leverage namespace, label and field filtering. (Auto generated shared informer does not allow this) - Listing resources by shared informers doesn't order by resource version/creation time. So custom lister for routes is used to order the route list by creation time and this will allow oldest route to be processed before new route to claim the host name. - Synchronization with the informer queue and cache is a bit difficult as the cache could have newer changes than what was pushed on to the queue. Luckily We only care about the first sync to avoid 503 status code for routes. - Handling first sync: * Informers are started with no registered event handlers * Wait for all informers to be synced * Block router reload * Get list of items from informers store and process manually * Perform router reload * Register router event handlers This guarantees first router sync is performed after processing all existing items. - Subsequent router syncs rely on informer syncing sate and uses rate limiter to coalesce changes.
fba2402
to
74560f7
Compare
@smarterclayton @knobunc @rajatchopra can you please take another look? |
Looks phenomenal, thanks for cleaning up. Very easy to read now. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, smarterclayton The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest |
Was preapproved due to importance to getting this fixed |
Automatic merge from submit-queue. |
Automatic merge from submit-queue. Sharded router based on namespace labels should notice routes immediately - Currently, sharded router based on namespace labels could take 2 resync intervals (10 to 15 mins) to notice new routes which may not be acceptable to some customers. This change allows routes to work immediately just like the non-sharded router behavior. - Watching project resource may not guarantee the order of the events, so there is no behavior change to shared router based on project labels. Trello card: https://trello.com/c/Q0puUQOT Rebased on top of #16315
Custom shared informer is used to leverage namespace, label and field filtering.
(Auto generated shared informer does not allow this)
Listing resources by shared informers doesn't order by resource version/creation time.
So custom lister for routes is used to order the route list by creation time and this
will allow oldest route to be processed before new route to claim the host name.
Synchronization with the informer queue and cache is a bit difficult as the cache could
have newer changes than what was pushed on to the queue. Luckily We only care about the
first sync to avoid 503 status code for routes.
Handling first sync:
This guarantees first router sync is performed after processing all existing items.
Subsequent router syncs rely on informer syncing sate and uses rate limiter to coalesce changes.
Deleted eventQueue, no longer used
Trello card: https://trello.com/c/y6SFvOA7