-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The unique_host router filter can lose routes #19175
The unique_host router filter can lose routes #19175
Conversation
38b7cf3
to
0ef491c
Compare
0ef491c
to
5ef5849
Compare
/test unit |
5ef5849
to
a1697ac
Compare
a1697ac
to
22bb5f2
Compare
99ef47b
to
1360dd3
Compare
/retest |
4 similar comments
/retest |
/retest |
/retest |
/retest |
7ddfad2
to
1a6c7db
Compare
/retest |
d44d476
to
3f41918
Compare
3f41918
to
f7d2aa8
Compare
@openshift/networking still iterating on this but i found three separate issues - one that was new this release, one that was a holdover from previous releases related to closely spaced updates to routes, and one that was a bug that has existed since wildcard support was added. Still trying to tease apart remaining issues. |
32a6f16
to
6e083a6
Compare
/retest |
6e083a6
to
ef95391
Compare
Ok, this is ready for review. The last few commits are cleaning up the tests. |
Three related issues:
Right now one of the e2e stress tests will flake a bit more because backoff isn’t long enough. |
@openshift/networking PTAL |
/test gcp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
minor comments.
return strings.Trim(s, "\"'") | ||
}) | ||
if err != nil { | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log error here? Know that it can be a lot of messages ... so log at say a numerically higher level (otherwise the error is eaten up here).
// hasExistingMatch returns true if a route is in exists with the same path. | ||
func hasExistingMatch(exists []*routeapi.Route, route *routeapi.Route) bool { | ||
for _, existing := range exists { | ||
if existing.Spec.Path == route.Spec.Path { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Felt odd to initially see only a path check here and not a host name. Until I saw the caller code using this for a specific host. Maybe a comment here ... by itself, this function reads odd.
Trying out a minor fix in the last commit that keeps the work queue from getting out of order - instead of requeueing with a delay and grabbing the next queue item immediately, sleep inside the work loop. This can delay shutdown of the queue until the lease expires, but keeps queued items in roughly insertion order and avoids followers from simply skipping to the next item and competing with the leader. The followers get locked out of trying work, so they fall further and further behind increasing the chance of the write lease being held. |
Everything looks good to me too... I just don't want to say the magic word because your last commit still says WIP. |
4f9079a
to
cc79897
Compare
Found a bug last night where I dropped route modifications incorrectly in unique_host. Testing |
/test gcp |
1 similar comment
/test gcp |
The unique_host filter grew over time to add additional functionality and scenarios. As part of that growth the guarantees for how claiming routes evicted and restored other routes that shared the same host or path were weakened and eventually broken. Two routes that conflict by host and path created at the same time where one is deleted will result in a 50% chance that the host remains clamed by the deleted route. Refactor unique_host to use a separate data structure - a host index - to track the routes that are assigned to each host. Split the logic for deciding which routes for a given host to use into side-effect free functions that can be individually tested. In the core unique_host code guarantee that when routes cover other routes that deletions are propagated down to lower level plugins and that no route is left behind. Add much stronger tests on the core data structure to ensure the logic of unique_host is not broken.
The router unique host test was embedding the lost update that we observed as "correct" behavior. Clean up the test to not reuse objects and to set UIDs, and give each sequent a correct name. An error condition on the extended validation test was being masked by a unique_host wrapper - remove that wrapper and make the test correct.
Prevents lots of crap from being output when running in a namespace.
The writerlease is a work queue, but we were exiting immediately on conflicts. This is not our normal pattern, which is to build a work queue and then resync from the latest cache state. Change how status.go queues up work so that we perform our retry inside the lease function. Should ensure that the correct output is eventually written.
Because we are now using the informer cache, having the plugins mutate the passed in object is incorrect (the cache doesn't have that modification). Instead, mutate the cache from the very beginning so that we always have the router's preferred spec.host set.
This prevents double deletion logic and other ugly mismatches between upstream deletion and our deletion.
Also use a consistent prefix so we get good debug output
cc79897
to
624bf21
Compare
Ok, I found and tracked down the bug that was resulting in the previous failures. When I added the host_index I needed to make the behavior of receiving a route modification that doesn't change path or host be simpler - if the route is the same resource version as the one we have cached, it's a no-op, if it's a changed resource version and the route is currently active, we need to tell the unique_host plugin that the route is activated. unique_host doesn't need to know the details, just whether to pass the update down. With that, this PR should be correct and all known flakes addressed (i added tests for the update with no-op and update of an active route to the hostindex_test). Ready for final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks for this Clayton... it's impressive.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The unique_host filter grew over time to add additional functionality and
scenarios. As part of that growth the guarantees for how claiming routes
evicted and restored other routes that shared the same host or path were
weakened and eventually broken. Two routes that conflict by host and path
created at the same time where one is deleted will result in a 50% chance
that the host remains claimed by the deleted route.
Refactor unique_host to use a separate data structure - a host index - to
track the routes that are assigned to each host. Split the logic for
deciding which routes for a given host to use into side-effect free
functions that can be individually tested. In the core unique_host code
guarantee that when routes cover other routes that deletions are propagated
down to lower level plugins and that no route is left behind.
Add much stronger tests on the core data structure to ensure the logic of
unique_host is not broken.
Identified while adding #18658. Only the last two commits are unique.