Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform real backoff when contending for writes from the router #18686

Merged
merged 4 commits into from
Mar 17, 2018

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Feb 20, 2018

The current route backoff mechanism works by tracking the last touch
time from other routes, but this is prone to failure and will not scale
to very large sets of routers competing for updating status.

Instead, treat the ability to write status as a lease renewal, and have
failure to write status as a cue to backoff. Each new write further
increases the lease confidence up to an interval. Treat observed writes
from other processes as a signal that the lease holder is maintaining
their lease.

This should allow route status updates to be scale free

Needs an e2e test still

@openshift/sig-networking @ramr

@openshift-ci-robot openshift-ci-robot added sig/networking approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 20, 2018
@openshift-ci-robot openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 20, 2018
@smarterclayton smarterclayton force-pushed the router_push branch 2 times, most recently from 4fdc6cc to b1f1f97 Compare February 22, 2018 07:08
glog.V(4).Infof("[%s] Lease owner or electing, running %s", l.name, key)
}

isLeader, retry := fn()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I grok this part. So we call the work queue function (in this case to update status) - https://github.com/smarterclayton/origin/blob/b1f1f97f0eee4ae5325a05731288ece94e864f38/pkg/router/controller/status.go#L252 and its response says its the leader? Aren't we already the leader here if we got to this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lease that is driven by work renewal. If a client observes that no work has been done within the lease window, it can compete to acquire the lease by doing work.

If we succeed (return true for the work), then we have "acquired the lease" by virtue of doing the work. The route object status itself is acting as the lease.

l.tick++
}
l.expires = nowFn().Add(l.nextBackoff())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to have to look at this again ... where is the actual election done here? Are we setting this (aka doing the election) based on the status update for the route going through? That level of indirection might make it tough to understand this without some comments here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are able to write to the object (because you have a delta to the current state) you take ownership of the lease. If you get a contention, or if you observe another client write to the object, you go into random exponential backoff until one of your writes succeeds (resetting you to zero backoff) or you hit the max backoff (the lease interval).

If you're in follower mode, everytime you observe another client doing real work you extend the lease.

For three router processes, the first route to be admitted would create a status attempt from each router (all in election). Which ever won would consider itself the leader - the others would go into backoff. If the lease interval expires without the followers observing any work getting done, they'll start trying to do work. If an entire backoff interval goes by and there is no work, the lease is released and they'll start competing again. 1m is completely unscientific but is roughly equivalent to our current mechanism.

@smarterclayton
Copy link
Contributor Author

Note the goal is to prevent contended writes (instead of N replicas all attempting to write every status, we want to ensure that even as N increases write load doesn't increase back to the router).

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/hold

As I was testing the various scenarios (conflicting config routers, multiple different router names, parallel) I realized that this causes a big slowdown when different routers are exposing the same route.

I'm going to make this be a general cleanup of the existing code so that the logic is more obvious and the code is shared between reject and admit (we don't update the route ingress field when a route is rejected because we don't set canonical hostname or wildcard policy), and then come back to this at a later point. And also add better tests, because it's clear that we need a harness that makes this easy to understand.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2018
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 4, 2018
@openshift-merge-robot openshift-merge-robot added the vendor-update Touching vendor dir or related files label Mar 5, 2018
@knobunc knobunc self-assigned this Mar 5, 2018
@knobunc knobunc self-requested a review March 5, 2018 20:00
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Mar 6, 2018

Ok, this is all over but the screaming.

The actual update conflict algorithm previously was:

  1. If we know we stored the correct value, and then receive a new incorrect value, we know we're conflicting and we should stop attempting to write
  2. In any other case, go ahead and attempt to write.

I've clarified that flow and added some bells and whistles:

  1. Instead of using an LRU with size 1024, we simply remember all routes and then have a periodic flush function
  2. In the periodic flush function, if we've detected contention, write a glog Warning to the logs so the customer knows what to do or at least we can debug it
  3. If we detect enough contention, just stop (until the next expiration window). I chose 10 contentions arbitrarily, and 1/10th the resync interval (3m is the default), but there shouldn't be any way for a human to trigger this accidentally (they'd have to manually update status) and so we detect contention much faster than before (before we would do one write for ever route, now we do up to 10 writes).
  4. Remove some of the unclear magic values we placed into the cache (timestamp zero vs real timestamp).
  5. Stop depending on timestamps - those were a hack to work around the LRU. Now that we have a flush function, we don't really need to check timestamps at all. We could also go to an epoch model if we wanted to in the future.

Other cleanup

  1. I went through the existing status update code and made it consistent between admits and rejects - there is now exactly one code path
  2. During rejections we were not resetting wildcard policy or canonical hostname - that is now fixed
  3. I split the deeply confusing status conflict recording into its own data structure - there are much better comments in place and it should be easier to understand wth is going on in the code
  4. Fixed up logging messages to be clearer about what is happening

The actual original impetus for this PR is still valid, but I'll come back to that in the future (deferring writes so that if we have a router scale 10, we don't do 10 writes for every status) once this gets cleaned up. It will be much easier to do with this in place.

I still need to add the e2e test scenarios:

  1. create multiple routers attempting to update status on the same router name to different values, verify we perform no more than N router * 10 writes
  2. create multiple router attempting to update status on different router names, verify that they are all created and updated
  3. run a scaled out router and verify that all routers accept writes, even with conflicts

Some initial review will be helpful. Future changes to this code are at least sane and I'm more confident the code can be extended to backoff.

Take the previous direct map access and place it behind a contention
tracker interface with much better comments. Add a better heuristic for
detecting mass conflicts (instead of processing all N routes before
giving up, stop much earlier). Remove the LRU behavior and use a simple
flushed cache. Unify the code for admission and rejection and fix a bug
where wildcard policy and canonical hostname weren't written in status.
@openshift-merge-robot openshift-merge-robot removed the vendor-update Touching vendor dir or related files label Mar 6, 2018
@smarterclayton
Copy link
Contributor Author

I had a breakthrough on the write leasing. I needed to extend the lease only when getting a "modified" route event where the current router's ingress status was the most recent (had the most recent Admitted condition last transition time). Also, I needed to enforce a minimum time for follower steps so that the last leader would always get an edge (prevents conflicts after we have a long quiet period).

Barring tests, i think this addresses the two issues I originally wanted to solve:

  1. allow the router to be run with more than a few replicas (currently we have M*N writes, this gets us to O(N) writes)
  2. ensure that conflict detection is solid so that a rollout of a lot of replicas (>5) doesn't blow up the router

I need to get tests in but this is ready for eyeballs. Local testing confirmed it has the desired behavior.

@smarterclayton
Copy link
Contributor Author

/retest

Each router process now uses a rough "write leasing" scheme to avoid
sending conflicting writes. When a router starts, it tries to write
status for its ingress. If it succeeds in writing status, it considers
itself to hold the lease, and if it fails it considers itself a follower
and goes into exponential backoff. A single leader quickly emerges, and
all other routers observe the writes and consider the leader to be
extending her lease. In this fashion a large number of routers can use
the route status itself as a coordination mechanism and avoid generating
large numbers of meaningless writes.
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton smarterclayton removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2018
@smarterclayton
Copy link
Contributor Author

/test install

@smarterclayton
Copy link
Contributor Author

I want to add one more stress test (testing random changes to routes over time) but I think this PR is ready for review as is. The GCP test fails because the image doesn't have the latest code (we don't build images on GCP) and I'll disable the test before merging and then reenable after.

@smarterclayton
Copy link
Contributor Author

/test gcp

7 similar comments
@smarterclayton
Copy link
Contributor Author

/test gcp

@smarterclayton
Copy link
Contributor Author

/test gcp

@smarterclayton
Copy link
Contributor Author

/test gcp

@smarterclayton
Copy link
Contributor Author

/test gcp

@smarterclayton
Copy link
Contributor Author

/test gcp

@smarterclayton
Copy link
Contributor Author

/test gcp

@smarterclayton
Copy link
Contributor Author

/test gcp

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever code. But I think it matches the description, and the description seems to solve the problem.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 15, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

ssh flake

@smarterclayton
Copy link
Contributor Author

/retest

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 18686, 18998).

@openshift-merge-robot openshift-merge-robot merged commit c6d8a92 into openshift:master Mar 17, 2018
@openshift-ci-robot
Copy link

@smarterclayton: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/gcp 73905e4 link /test gcp
ci/openshift-jenkins/extended_conformance_install 73905e4 link /test extended_conformance_install

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. sig/networking size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants