Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router: Changed default resource resync interval from 10mins to 30mins #17012

Merged

Conversation

pravisankar
Copy link

@pravisankar pravisankar commented Oct 23, 2017

  • Resyncs are mainly intended for robustness. Mainly to handle the case
    where the resource handler failed to process the item and we hope this
    will be fixed if we process the item again after sometime(resync interval).
    Yes, this may fix some transient errors but if we resync frequently then
    there could be big penalities.

  • Currently router watches these resources: routes, endpoints, nodes,
    namespaces, ingresses and secrets. When we have many routes
    (like several thousand in online case), processing these items takes
    long time, router reload itself takes few seconds (not milliseconds).
    Due to short resync interval there will be constant churn of reprocessing
    of all the items for all these resources.

  • Earlier we needed shorter resync interval because sharded router was depending
    on this interval but with Sharded router based on namespace labels should notice routes immediately #16039 that limitation is removed.

10 mins seems aggressive for some rare transient errors, changed defaults
to 30 mins. Admin can edit router deployment config if they need custom resync interval.

Fixed project sync interval in router

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 23, 2017
@pravisankar
Copy link
Author

@knobunc @eparis @rajatchopra PTAL

@pravisankar pravisankar added component/networking kind/bug Categorizes issue or PR as related to a bug. sig/networking labels Oct 23, 2017
Ravi Sankar Penta added 2 commits October 23, 2017 16:55
Rationale:

- Resyncs are mainly intended for robustness. Mainly to handle the case
where the resource handler failed to process the item and we hope this
will be fixed if we process the item again after sometime(resync interval).
Yes, this may fix some transient errors but if we resync frequently then
there could be big penalities.

- Currently router watches these resources: routes, endpoints, nodes,
namespaces, ingresses and secrets. When we have many routes
(like several thousand in online case), processing these items takes
long time, router reload itself takes few seconds (not milliseconds).
Due to short resync interval there will be constant churn of reprocessing
of all the items for all these resources.

- Earlier we needed shorter resync interval because sharded router was depending
on this interval but with openshift#16039
that limitation is removed.

10 mins seems aggressive for some rare transient errors, changed defaults
to 30 mins. Admin can edit router deployment config if they need custom resync interval.
ResyncInterval is tunable param and it can go as low as 1 second.
@php-coder
Copy link
Contributor

/unassign

@eparis
Copy link
Member

eparis commented Oct 24, 2017

/hold
Is there a bz or other specific problem this is trying to address? Admittedly we should never need the resync, but 30 minutes seems like an aweful long time...

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2017
@pravisankar
Copy link
Author

@eparis
We use 30 mins resync interval in openshift SDN. I'm assuming transient problems are rare and in most cases when it fails first time, it is highly likely that it will also fail next time. Any recommendations?
There is no open bug for this but the second commit fixes the issue where project sync interval could be negative if tunable resync interval is very low.

@imcsk8
Copy link
Contributor

imcsk8 commented Oct 24, 2017

LGTM

@eparis
Copy link
Member

eparis commented Oct 24, 2017

/lgtm
but lets leave this held until master opens for 3.8.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 24, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, pravisankar

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2017
@0xmichalis 0xmichalis added this to the 3.8.0 milestone Nov 2, 2017
@pravisankar pravisankar added sig/networking and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. labels Nov 14, 2017
@pravisankar
Copy link
Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 17012, 17243).

@openshift-merge-robot openshift-merge-robot merged commit 366bc48 into openshift:master Nov 18, 2017
@openshift-ci-robot
Copy link

@pravisankar: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/extended_conformance_install_update 5e20571 link /test extended_conformance_install_update
ci/openshift-jenkins/extended_conformance_install 5e20571 link /test extended_conformance_install
ci/openshift-jenkins/extended_conformance_gce 5e20571 link /test extended_conformance_gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/routing lgtm Indicates that a PR is ready to be merged. sig/networking size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants