-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform real backoff when contending for writes from the router #18686
Perform real backoff when contending for writes from the router #18686
Conversation
4fdc6cc
to
b1f1f97
Compare
glog.V(4).Infof("[%s] Lease owner or electing, running %s", l.name, key) | ||
} | ||
|
||
isLeader, retry := fn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I grok
this part. So we call the work queue function (in this case to update status) - https://github.com/smarterclayton/origin/blob/b1f1f97f0eee4ae5325a05731288ece94e864f38/pkg/router/controller/status.go#L252 and its response says its the leader? Aren't we already the leader here if we got to this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lease that is driven by work renewal. If a client observes that no work has been done within the lease window, it can compete to acquire the lease by doing work.
If we succeed (return true for the work), then we have "acquired the lease" by virtue of doing the work. The route object status itself is acting as the lease.
l.tick++ | ||
} | ||
l.expires = nowFn().Add(l.nextBackoff()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to have to look at this again ... where is the actual election done here? Are we setting this (aka doing the election) based on the status update for the route going through? That level of indirection might make it tough to understand this without some comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are able to write to the object (because you have a delta to the current state) you take ownership of the lease. If you get a contention, or if you observe another client write to the object, you go into random exponential backoff until one of your writes succeeds (resetting you to zero backoff) or you hit the max backoff (the lease interval).
If you're in follower mode, everytime you observe another client doing real work you extend the lease.
For three router processes, the first route to be admitted would create a status attempt from each router (all in election). Which ever won would consider itself the leader - the others would go into backoff. If the lease interval expires without the followers observing any work getting done, they'll start trying to do work. If an entire backoff interval goes by and there is no work, the lease is released and they'll start competing again. 1m is completely unscientific but is roughly equivalent to our current mechanism.
b1f1f97
to
054078b
Compare
Note the goal is to prevent contended writes (instead of N replicas all attempting to write every status, we want to ensure that even as N increases write load doesn't increase back to the router). |
/retest |
/hold As I was testing the various scenarios (conflicting config routers, multiple different router names, parallel) I realized that this causes a big slowdown when different routers are exposing the same route. I'm going to make this be a general cleanup of the existing code so that the logic is more obvious and the code is shared between reject and admit (we don't update the route ingress field when a route is rejected because we don't set canonical hostname or wildcard policy), and then come back to this at a later point. And also add better tests, because it's clear that we need a harness that makes this easy to understand. |
054078b
to
4ae5470
Compare
4ae5470
to
3d59c70
Compare
3d59c70
to
4ad44ed
Compare
Ok, this is all over but the screaming. The actual update conflict algorithm previously was:
I've clarified that flow and added some bells and whistles:
Other cleanup
The actual original impetus for this PR is still valid, but I'll come back to that in the future (deferring writes so that if we have a router scale 10, we don't do 10 writes for every status) once this gets cleaned up. It will be much easier to do with this in place. I still need to add the e2e test scenarios:
Some initial review will be helpful. Future changes to this code are at least sane and I'm more confident the code can be extended to backoff. |
4ad44ed
to
60614d4
Compare
Take the previous direct map access and place it behind a contention tracker interface with much better comments. Add a better heuristic for detecting mass conflicts (instead of processing all N routes before giving up, stop much earlier). Remove the LRU behavior and use a simple flushed cache. Unify the code for admission and rejection and fix a bug where wildcard policy and canonical hostname weren't written in status.
60614d4
to
6ff4a9f
Compare
I had a breakthrough on the write leasing. I needed to extend the lease only when getting a "modified" route event where the current router's ingress status was the most recent (had the most recent Admitted condition last transition time). Also, I needed to enforce a minimum time for follower steps so that the last leader would always get an edge (prevents conflicts after we have a long quiet period). Barring tests, i think this addresses the two issues I originally wanted to solve:
I need to get tests in but this is ready for eyeballs. Local testing confirmed it has the desired behavior. |
/retest |
Each router process now uses a rough "write leasing" scheme to avoid sending conflicting writes. When a router starts, it tries to write status for its ingress. If it succeeds in writing status, it considers itself to hold the lease, and if it fails it considers itself a follower and goes into exponential backoff. A single leader quickly emerges, and all other routers observe the writes and consider the leader to be extending her lease. In this fashion a large number of routers can use the route status itself as a coordination mechanism and avoid generating large numbers of meaningless writes.
6ff4a9f
to
9cfd001
Compare
/retest |
82c6b10
to
162e42d
Compare
/retest |
162e42d
to
9705730
Compare
9705730
to
73905e4
Compare
/test install |
I want to add one more stress test (testing random changes to routes over time) but I think this PR is ready for review as is. The GCP test fails because the image doesn't have the latest code (we don't build images on GCP) and I'll disable the test before merging and then reenable after. |
/test gcp |
7 similar comments
/test gcp |
/test gcp |
/test gcp |
/test gcp |
/test gcp |
/test gcp |
/test gcp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever code. But I think it matches the description, and the description seems to solve the problem.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/retest ssh flake |
/retest |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 18686, 18998). |
@smarterclayton: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The current route backoff mechanism works by tracking the last touch
time from other routes, but this is prone to failure and will not scale
to very large sets of routers competing for updating status.
Instead, treat the ability to write status as a lease renewal, and have
failure to write status as a cue to backoff. Each new write further
increases the lease confidence up to an interval. Treat observed writes
from other processes as a signal that the lease holder is maintaining
their lease.
This should allow route status updates to be scale free
Needs an e2e test still
@openshift/sig-networking @ramr