The unique_host router filter can lose routes #19175

smarterclayton · 2018-04-02T05:21:50Z

The unique_host filter grew over time to add additional functionality and
scenarios. As part of that growth the guarantees for how claiming routes
evicted and restored other routes that shared the same host or path were
weakened and eventually broken. Two routes that conflict by host and path
created at the same time where one is deleted will result in a 50% chance
that the host remains claimed by the deleted route.

Refactor unique_host to use a separate data structure - a host index - to
track the routes that are assigned to each host. Split the logic for
deciding which routes for a given host to use into side-effect free
functions that can be individually tested. In the core unique_host code
guarantee that when routes cover other routes that deletions are propagated
down to lower level plugins and that no route is left behind.

Add much stronger tests on the core data structure to ensure the logic of
unique_host is not broken.

Identified while adding #18658. Only the last two commits are unique.

smarterclayton · 2018-04-02T21:18:15Z

/test unit

smarterclayton · 2018-04-05T16:48:00Z

/retest

smarterclayton · 2018-04-06T21:18:25Z

/retest

smarterclayton · 2018-04-07T18:01:04Z

/retest

smarterclayton · 2018-04-07T18:27:48Z

/retest

smarterclayton · 2018-04-07T21:07:21Z

/retest

smarterclayton · 2018-04-10T03:35:51Z

/retest

smarterclayton · 2018-04-12T22:19:58Z

@openshift/networking still iterating on this but i found three separate issues - one that was new this release, one that was a holdover from previous releases related to closely spaced updates to routes, and one that was a bug that has existed since wildcard support was added.

Still trying to tease apart remaining issues.

smarterclayton · 2018-04-17T18:38:07Z

/retest

smarterclayton · 2018-04-18T01:08:46Z

Ok, this is ready for review. The last few commits are cleaning up the tests.

smarterclayton · 2018-04-18T01:19:35Z

Three related issues:

unique hosts wasn’t tracking when removed routes uncovered a previously claimed route
The new write lease code was subtly wrong for conflicts - we gave up, rather than retrying. To retry, we need to read from the cache and keep track of the retries. That requires tracking the retry in the write lease correctly.
To use the cache, we need to extract the route update logic to happen before the cache so we aren’t modifying objects from the cache.

Right now one of the e2e stress tests will flake a bit more because backoff isn’t long enough.

knobunc · 2018-04-18T13:33:58Z

@openshift/networking PTAL

smarterclayton · 2018-04-18T13:38:41Z

/test gcp

ramr

lgtm
minor comments.

ramr · 2018-04-18T21:26:17Z

pkg/cmd/infra/router/router.go

-		return strings.Trim(s, "\"'")
+	})
+	if err != nil {
+		return


log error here? Know that it can be a lot of messages ... so log at say a numerically higher level (otherwise the error is eaten up here).

ramr · 2018-04-18T21:27:09Z

pkg/router/controller/hostindex/activation.go

+// hasExistingMatch returns true if a route is in exists with the same path.
+func hasExistingMatch(exists []*routeapi.Route, route *routeapi.Route) bool {
+	for _, existing := range exists {
+		if existing.Spec.Path == route.Spec.Path {


Felt odd to initially see only a path check here and not a host name. Until I saw the caller code using this for a specific host. Maybe a comment here ... by itself, this function reads odd.

smarterclayton · 2018-04-19T02:39:18Z

Trying out a minor fix in the last commit that keeps the work queue from getting out of order - instead of requeueing with a delay and grabbing the next queue item immediately, sleep inside the work loop. This can delay shutdown of the queue until the lease expires, but keeps queued items in roughly insertion order and avoids followers from simply skipping to the next item and competing with the leader. The followers get locked out of trying work, so they fall further and further behind increasing the chance of the write lease being held.

knobunc · 2018-04-19T19:14:10Z

Everything looks good to me too... I just don't want to say the magic word because your last commit still says WIP.

smarterclayton · 2018-04-19T22:58:09Z

Yeah, I'm trying to tighten up the failure modes around conflict retries.

…

On Thu, Apr 19, 2018 at 3:54 PM, OpenShift CI Robot < ***@***.***> wrote: @smarterclayton <https://github.com/smarterclayton>: The following tests *failed*, say /retest to rerun them all: Test name Commit Details Rerun command ci/openshift-jenkins/cmd ab5e947 <ab5e947> link <https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/19175/test_pull_request_origin_cmd/12718/> /test cmd ci/openshift-jenkins/gcp ab5e947 <ab5e947> link <https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/19175/test_pull_request_origin_extended_conformance_gce/19335/> /test gcp ci/openshift-jenkins/unit 4f9079a <4f9079a> link <https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/19175/test_pull_request_origin_unit/12221/> /test unit Full PR test history <https://openshift-gce-devel.appspot.com/pr/19175>. Your PR dashboard <https://openshift-gce-devel.appspot.com/pr/smarterclayton>. Please help us cut down on flakes by linking to <https://github.com/kubernetes/community/blob/master/contributors/devel/flaky-tests.md#filing-issues-for-flaky-tests> an open issue <https://github.com/openshift/origin/issues?q=is:issue+is:open> when you hit one in your PR. Instructions for interacting with me using PR comments are available here <https://git.k8s.io/community/contributors/devel/pull-requests.md>. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra <https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:> repository. I understand the commands that are listed here <https://go.k8s.io/bot-commands>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19175 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p6aT7wC59Vua6Vayajk1qUx4ICaVks5tqOtdgaJpZM4TDKVN> .

smarterclayton · 2018-04-20T17:21:45Z

Found a bug last night where I dropped route modifications incorrectly in unique_host. Testing

smarterclayton · 2018-04-20T17:22:20Z

/test gcp

smarterclayton · 2018-04-21T03:40:50Z

/test gcp

The unique_host filter grew over time to add additional functionality and scenarios. As part of that growth the guarantees for how claiming routes evicted and restored other routes that shared the same host or path were weakened and eventually broken. Two routes that conflict by host and path created at the same time where one is deleted will result in a 50% chance that the host remains clamed by the deleted route. Refactor unique_host to use a separate data structure - a host index - to track the routes that are assigned to each host. Split the logic for deciding which routes for a given host to use into side-effect free functions that can be individually tested. In the core unique_host code guarantee that when routes cover other routes that deletions are propagated down to lower level plugins and that no route is left behind. Add much stronger tests on the core data structure to ensure the logic of unique_host is not broken.

The router unique host test was embedding the lost update that we observed as "correct" behavior. Clean up the test to not reuse objects and to set UIDs, and give each sequent a correct name. An error condition on the extended validation test was being masked by a unique_host wrapper - remove that wrapper and make the test correct.

Prevents lots of crap from being output when running in a namespace.

The writerlease is a work queue, but we were exiting immediately on conflicts. This is not our normal pattern, which is to build a work queue and then resync from the latest cache state. Change how status.go queues up work so that we perform our retry inside the lease function. Should ensure that the correct output is eventually written.

Because we are now using the informer cache, having the plugins mutate the passed in object is incorrect (the cache doesn't have that modification). Instead, mutate the cache from the very beginning so that we always have the router's preferred spec.host set.

This prevents double deletion logic and other ugly mismatches between upstream deletion and our deletion.

Also use a consistent prefix so we get good debug output

smarterclayton · 2018-04-23T13:54:38Z

Ok, I found and tracked down the bug that was resulting in the previous failures. When I added the host_index I needed to make the behavior of receiving a route modification that doesn't change path or host be simpler - if the route is the same resource version as the one we have cached, it's a no-op, if it's a changed resource version and the route is currently active, we need to tell the unique_host plugin that the route is activated. unique_host doesn't need to know the details, just whether to pass the update down.

With that, this PR should be correct and all known flakes addressed (i added tests for the update with no-op and update of an active route to the hostindex_test). Ready for final review.

knobunc

/lgtm

Thanks for this Clayton... it's impressive.

openshift-ci-robot · 2018-04-23T17:40:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/cmd/infra/router/OWNERS~~ [knobunc,smarterclayton]
~~pkg/router/OWNERS~~ [knobunc,smarterclayton]
~~pkg/util/OWNERS~~ [smarterclayton]
~~test/extended/OWNERS~~ [knobunc,smarterclayton]
~~test/integration/OWNERS~~ [knobunc,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 2, 2018

openshift-ci-robot requested review from liggitt and mrogers950 April 2, 2018 05:21

openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 2, 2018

smarterclayton mentioned this pull request Apr 2, 2018

Replace router support for ingress with an ingress-to-route controller #18658

Merged

3 tasks

smarterclayton force-pushed the fix_lost_unique_hosts branch from 38b7cf3 to 0ef491c Compare April 2, 2018 05:27

smarterclayton added component/routing sig/networking labels Apr 2, 2018

smarterclayton force-pushed the fix_lost_unique_hosts branch from 0ef491c to 5ef5849 Compare April 2, 2018 20:37

smarterclayton force-pushed the fix_lost_unique_hosts branch from 5ef5849 to a1697ac Compare April 3, 2018 02:57

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 3, 2018

smarterclayton force-pushed the fix_lost_unique_hosts branch from a1697ac to 22bb5f2 Compare April 3, 2018 06:03

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 3, 2018

smarterclayton force-pushed the fix_lost_unique_hosts branch 3 times, most recently from 99ef47b to 1360dd3 Compare April 5, 2018 00:08

smarterclayton force-pushed the fix_lost_unique_hosts branch 2 times, most recently from 7ddfad2 to 1a6c7db Compare April 8, 2018 21:36

smarterclayton force-pushed the fix_lost_unique_hosts branch 2 times, most recently from d44d476 to 3f41918 Compare April 12, 2018 16:15

smarterclayton mentioned this pull request Apr 12, 2018

[Conformance][Area:Networking][Feature:Router] The HAProxy router converges when multiple routers are writing conflicting status #19058

Closed

smarterclayton force-pushed the fix_lost_unique_hosts branch from 3f41918 to f7d2aa8 Compare April 12, 2018 21:29

smarterclayton force-pushed the fix_lost_unique_hosts branch 2 times, most recently from 32a6f16 to 6e083a6 Compare April 17, 2018 05:10

smarterclayton force-pushed the fix_lost_unique_hosts branch from 6e083a6 to ef95391 Compare April 17, 2018 20:30

knobunc requested a review from rajatchopra April 18, 2018 13:33

ramr approved these changes Apr 18, 2018

View reviewed changes

smarterclayton force-pushed the fix_lost_unique_hosts branch from 4f9079a to cc79897 Compare April 20, 2018 15:45

smarterclayton added 7 commits April 23, 2018 01:19

Service list/watches should be scoped to namespace

c124d17

Prevents lots of crap from being output when running in a namespace.

Make openshift e2e tests manage their own namespaces

9404da7

This prevents double deletion logic and other ugly mismatches between upstream deletion and our deletion.

Use the router's image instead of guessing in tests

624bf21

Also use a consistent prefix so we get good debug output

smarterclayton force-pushed the fix_lost_unique_hosts branch from cc79897 to 624bf21 Compare April 23, 2018 05:19

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2018

knobunc approved these changes Apr 23, 2018

View reviewed changes

openshift-merge-robot merged commit 1a712f2 into openshift:master Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The unique_host router filter can lose routes #19175

The unique_host router filter can lose routes #19175

smarterclayton commented Apr 2, 2018 •

edited

Loading

smarterclayton commented Apr 2, 2018

smarterclayton commented Apr 5, 2018

smarterclayton commented Apr 6, 2018

smarterclayton commented Apr 7, 2018

smarterclayton commented Apr 7, 2018

smarterclayton commented Apr 7, 2018

smarterclayton commented Apr 10, 2018

smarterclayton commented Apr 12, 2018

smarterclayton commented Apr 17, 2018

smarterclayton commented Apr 18, 2018

smarterclayton commented Apr 18, 2018

knobunc commented Apr 18, 2018

smarterclayton commented Apr 18, 2018

ramr left a comment

ramr Apr 18, 2018

ramr Apr 18, 2018

smarterclayton commented Apr 19, 2018

knobunc commented Apr 19, 2018

smarterclayton commented Apr 19, 2018 via email

smarterclayton commented Apr 20, 2018

smarterclayton commented Apr 20, 2018

smarterclayton commented Apr 21, 2018

smarterclayton commented Apr 23, 2018

knobunc left a comment

openshift-ci-robot commented Apr 23, 2018

The unique_host router filter can lose routes #19175

The unique_host router filter can lose routes #19175

Conversation

smarterclayton commented Apr 2, 2018 • edited Loading

smarterclayton commented Apr 2, 2018

smarterclayton commented Apr 5, 2018

smarterclayton commented Apr 6, 2018

smarterclayton commented Apr 7, 2018

smarterclayton commented Apr 7, 2018

smarterclayton commented Apr 7, 2018

smarterclayton commented Apr 10, 2018

smarterclayton commented Apr 12, 2018

smarterclayton commented Apr 17, 2018

smarterclayton commented Apr 18, 2018

smarterclayton commented Apr 18, 2018

knobunc commented Apr 18, 2018

smarterclayton commented Apr 18, 2018

ramr left a comment

Choose a reason for hiding this comment

ramr Apr 18, 2018

Choose a reason for hiding this comment

ramr Apr 18, 2018

Choose a reason for hiding this comment

smarterclayton commented Apr 19, 2018

knobunc commented Apr 19, 2018

smarterclayton commented Apr 19, 2018 via email

smarterclayton commented Apr 20, 2018

smarterclayton commented Apr 20, 2018

smarterclayton commented Apr 21, 2018

smarterclayton commented Apr 23, 2018

knobunc left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Apr 23, 2018

smarterclayton commented Apr 2, 2018 •

edited

Loading