Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make router conflict detection work even during initial informer sync #19706

Merged
merged 2 commits into from
May 18, 2018

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented May 14, 2018

The conflict detection mechanism in the router did not correctly handle sync (which on large clusters could be a huge number of writes). The informer delivers events in order - all SYNC, then updates. Two routers starting at the beginning of the sync could conflict without detecting it until they see the first update (so O(2*routes)). Instead of detecting conflicts in the router plugin, move detection to a side scanner that looks at the informer directly and only processes updates (so we ignore the initial sync and then rapidly start detecting updates). This causes conflicting routers to converge quickly.

In addition, the stress test is not tuned particularly well for passing. We create 3 routers with conflicting status and update 10 routes. The hard stop for conflict detection is at 10 conflicts every 12 seconds. The test is intended to verify that we hard stop when conflicting, and that a subsequent edit of a route converges at O(routers).

This commit changes the number of conflicts a router tolerates before blocking all updates. Since conflict detection is reliable (we see a route we changed change to a different status value) the new value 5 is just as safe as the old value of 10. I've also increased the number of routes - that should more clearly identify the upper bounds of conflict writes (which should be <15 + initial writes detected).

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 14, 2018
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2018
@smarterclayton smarterclayton removed the lgtm Indicates that a PR is ready to be merged. label May 14, 2018
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 14, 2018
@smarterclayton smarterclayton changed the title Add route uid to debug logs for router Tune to prevent haproxy stress test flakes May 14, 2018
@smarterclayton
Copy link
Contributor Author

/test gcp

@knobunc
Copy link
Contributor

knobunc commented May 15, 2018

/lgtm
Thanks Clayton

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 15, 2018
Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@smarterclayton smarterclayton removed the lgtm Indicates that a PR is ready to be merged. label May 15, 2018
Each router should cap at 10 conflicts. Use more routes to get more
potentially conflicting writes.
@smarterclayton
Copy link
Contributor Author

The reason the test was flaking is that 3 routers must burn through their entire initial sync before they start detecting conflicts (a router was only detecting conflicts if they themselves set the value). I added a commit that also detects when someone else updates the route, which should cause contention to be detected much earlier for N > 2 routers. For N == 2, we probably need a different informer watcher.

I think in 3.11 I'm going to move conflict detection out of the status loop and into its own informer watcher, which will juts track this and signal to the core loop.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 15, 2018
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 16, 2018
@smarterclayton
Copy link
Contributor Author

Scratch that, I'm just going to change it now. Having two routers contending on a long sync might be common and we might as well do it right. Testing...

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Because the initial sync of the router may take a long time and we need
to detect conflicts caused early in that sync, separate out contention
detection into a separate goroutine fed by the informer which looks for
repeated changes to a given route's status. The status plugin resets the
state to candidate when it makes a write.
@smarterclayton
Copy link
Contributor Author

Updated, ready for re-review.

@smarterclayton smarterclayton changed the title Tune to prevent haproxy stress test flakes Make router conflict detection work even during initial informer sync May 16, 2018
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

Three runs without a failure - I think I've gotten this locked down.

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/test gcp

1 similar comment
@smarterclayton
Copy link
Contributor Author

/test gcp

@smarterclayton
Copy link
Contributor Author

No failures in the last 5 runs.

@smarterclayton
Copy link
Contributor Author

Going to mark for merge, please comment and I'll address anything after.

@smarterclayton smarterclayton added the lgtm Indicates that a PR is ready to be merged. label May 17, 2018
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 2270d0a into openshift:master May 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants