-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make router conflict detection work even during initial informer sync #19706
Make router conflict detection work even during initial informer sync #19706
Conversation
3bad2cc
to
df50150
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
df50150
to
f29c587
Compare
f29c587
to
1ef10b8
Compare
/test gcp |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Each router should cap at 10 conflicts. Use more routes to get more potentially conflicting writes.
1ef10b8
to
a4e322b
Compare
The reason the test was flaking is that 3 routers must burn through their entire initial sync before they start detecting conflicts (a router was only detecting conflicts if they themselves set the value). I added a commit that also detects when someone else updates the route, which should cause contention to be detected much earlier for N > 2 routers. For N == 2, we probably need a different informer watcher. I think in 3.11 I'm going to move conflict detection out of the status loop and into its own informer watcher, which will juts track this and signal to the core loop. |
5dc4777
to
84b5154
Compare
Scratch that, I'm just going to change it now. Having two routers contending on a long sync might be common and we might as well do it right. Testing... |
84b5154
to
4668a91
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Because the initial sync of the router may take a long time and we need to detect conflicts caused early in that sync, separate out contention detection into a separate goroutine fed by the informer which looks for repeated changes to a given route's status. The status plugin resets the state to candidate when it makes a write.
4668a91
to
aa28fa9
Compare
Updated, ready for re-review. |
/retest |
Three runs without a failure - I think I've gotten this locked down. |
/retest |
/test gcp |
1 similar comment
/test gcp |
No failures in the last 5 runs. |
Going to mark for merge, please comment and I'll address anything after. |
/retest Please review the full test history for this PR and help us cut down flakes. |
The conflict detection mechanism in the router did not correctly handle sync (which on large clusters could be a huge number of writes). The informer delivers events in order - all SYNC, then updates. Two routers starting at the beginning of the sync could conflict without detecting it until they see the first update (so O(2*routes)). Instead of detecting conflicts in the router plugin, move detection to a side scanner that looks at the informer directly and only processes updates (so we ignore the initial sync and then rapidly start detecting updates). This causes conflicting routers to converge quickly.
In addition, the stress test is not tuned particularly well for passing. We create 3 routers with conflicting status and update 10 routes. The hard stop for conflict detection is at 10 conflicts every 12 seconds. The test is intended to verify that we hard stop when conflicting, and that a subsequent edit of a route converges at O(routers).
This commit changes the number of conflicts a router tolerates before blocking all updates. Since conflict detection is reliable (we see a route we changed change to a different status value) the new value 5 is just as safe as the old value of 10. I've also increased the number of routes - that should more clearly identify the upper bounds of conflict writes (which should be <15 + initial writes detected).