Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The unique_host router filter can lose routes #19175

Merged

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Apr 2, 2018

The unique_host filter grew over time to add additional functionality and
scenarios. As part of that growth the guarantees for how claiming routes
evicted and restored other routes that shared the same host or path were
weakened and eventually broken. Two routes that conflict by host and path
created at the same time where one is deleted will result in a 50% chance
that the host remains claimed by the deleted route.

Refactor unique_host to use a separate data structure - a host index - to
track the routes that are assigned to each host. Split the logic for
deciding which routes for a given host to use into side-effect free
functions that can be individually tested. In the core unique_host code
guarantee that when routes cover other routes that deletions are propagated
down to lower level plugins and that no route is left behind.

Add much stronger tests on the core data structure to ensure the logic of
unique_host is not broken.

Identified while adding #18658. Only the last two commits are unique.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2018
@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 2, 2018
@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch from 38b7cf3 to 0ef491c Compare April 2, 2018 05:27
@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch from 0ef491c to 5ef5849 Compare April 2, 2018 20:37
@smarterclayton
Copy link
Contributor Author

/test unit

@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch from 5ef5849 to a1697ac Compare April 3, 2018 02:57
@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 3, 2018
@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch from a1697ac to 22bb5f2 Compare April 3, 2018 06:03
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 3, 2018
@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch 3 times, most recently from 99ef47b to 1360dd3 Compare April 5, 2018 00:08
@smarterclayton
Copy link
Contributor Author

/retest

4 similar comments
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch 2 times, most recently from 7ddfad2 to 1a6c7db Compare April 8, 2018 21:36
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

@openshift/networking still iterating on this but i found three separate issues - one that was new this release, one that was a holdover from previous releases related to closely spaced updates to routes, and one that was a bug that has existed since wildcard support was added.

Still trying to tease apart remaining issues.

@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch 2 times, most recently from 32a6f16 to 6e083a6 Compare April 17, 2018 05:10
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch from 6e083a6 to ef95391 Compare April 17, 2018 20:30
@smarterclayton
Copy link
Contributor Author

Ok, this is ready for review. The last few commits are cleaning up the tests.

@smarterclayton
Copy link
Contributor Author

Three related issues:

  1. unique hosts wasn’t tracking when removed routes uncovered a previously claimed route
  2. The new write lease code was subtly wrong for conflicts - we gave up, rather than retrying. To retry, we need to read from the cache and keep track of the retries. That requires tracking the retry in the write lease correctly.
  3. To use the cache, we need to extract the route update logic to happen before the cache so we aren’t modifying objects from the cache.

Right now one of the e2e stress tests will flake a bit more because backoff isn’t long enough.

@knobunc knobunc requested a review from rajatchopra April 18, 2018 13:33
@knobunc
Copy link
Contributor

knobunc commented Apr 18, 2018

@openshift/networking PTAL

@smarterclayton
Copy link
Contributor Author

/test gcp

Copy link
Contributor

@ramr ramr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
minor comments.

return strings.Trim(s, "\"'")
})
if err != nil {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log error here? Know that it can be a lot of messages ... so log at say a numerically higher level (otherwise the error is eaten up here).

// hasExistingMatch returns true if a route is in exists with the same path.
func hasExistingMatch(exists []*routeapi.Route, route *routeapi.Route) bool {
for _, existing := range exists {
if existing.Spec.Path == route.Spec.Path {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Felt odd to initially see only a path check here and not a host name. Until I saw the caller code using this for a specific host. Maybe a comment here ... by itself, this function reads odd.

@smarterclayton
Copy link
Contributor Author

Trying out a minor fix in the last commit that keeps the work queue from getting out of order - instead of requeueing with a delay and grabbing the next queue item immediately, sleep inside the work loop. This can delay shutdown of the queue until the lease expires, but keeps queued items in roughly insertion order and avoids followers from simply skipping to the next item and competing with the leader. The followers get locked out of trying work, so they fall further and further behind increasing the chance of the write lease being held.

@knobunc
Copy link
Contributor

knobunc commented Apr 19, 2018

Everything looks good to me too... I just don't want to say the magic word because your last commit still says WIP.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 19, 2018 via email

@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch from 4f9079a to cc79897 Compare April 20, 2018 15:45
@smarterclayton
Copy link
Contributor Author

Found a bug last night where I dropped route modifications incorrectly in unique_host. Testing

@smarterclayton
Copy link
Contributor Author

/test gcp

1 similar comment
@smarterclayton
Copy link
Contributor Author

/test gcp

The unique_host filter grew over time to add additional functionality
and scenarios. As part of that growth the guarantees for how claiming
routes evicted and restored other routes that shared the same host or
path were weakened and eventually broken. Two routes that conflict by
host and path created at the same time where one is deleted will result
in a 50% chance that the host remains clamed by the deleted route.

Refactor unique_host to use a separate data structure - a host index -
to track the routes that are assigned to each host. Split the logic for
deciding which routes for a given host to use into side-effect free
functions that can be individually tested. In the core unique_host code
guarantee that when routes cover other routes that deletions are
propagated down to lower level plugins and that no route is left behind.

Add much stronger tests on the core data structure to ensure the logic
of unique_host is not broken.
The router unique host test was embedding the lost update that we
observed as "correct" behavior. Clean up the test to not reuse objects
and to set UIDs, and give each sequent a correct name.

An error condition on the extended validation test was being masked by
a unique_host wrapper - remove that wrapper and make the test correct.
Prevents lots of crap from being output when running in a namespace.
The writerlease is a work queue, but we were exiting immediately on
conflicts. This is not our normal pattern, which is to build a work
queue and then resync from the latest cache state. Change how status.go
queues up work so that we perform our retry inside the lease function.
Should ensure that the correct output is eventually written.
Because we are now using the informer cache, having the plugins mutate
the passed in object is incorrect (the cache doesn't have that
modification). Instead, mutate the cache from the very beginning so that
we always have the router's preferred spec.host set.
This prevents double deletion logic and other ugly mismatches between
upstream deletion and our deletion.
Also use a consistent prefix so we get good debug output
@smarterclayton smarterclayton force-pushed the fix_lost_unique_hosts branch from cc79897 to 624bf21 Compare April 23, 2018 05:19
@smarterclayton
Copy link
Contributor Author

Ok, I found and tracked down the bug that was resulting in the previous failures. When I added the host_index I needed to make the behavior of receiving a route modification that doesn't change path or host be simpler - if the route is the same resource version as the one we have cached, it's a no-op, if it's a changed resource version and the route is currently active, we need to tell the unique_host plugin that the route is activated. unique_host doesn't need to know the details, just whether to pass the update down.

With that, this PR should be correct and all known flakes addressed (i added tests for the update with no-op and update of an active route to the hostindex_test). Ready for final review.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2018
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for this Clayton... it's impressive.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 1a712f2 into openshift:master Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/routing lgtm Indicates that a PR is ready to be merged. sig/networking size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants