Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve router metrics across restarts #18245

Merged

Conversation

smarterclayton
Copy link
Contributor

Router metrics aren't preserved across HAProxy reloads. To counter that, capture the state of certain counter style metrics (NOT rate style metrics) immediately before a router reload, then add them to subsequent reports. Some metrics will be lost (any recorded between the capture and the actual restart time) but this will make metrics accurate to within a few percent, even when routers are restarting frequently.

This increases the cost of the reload slightly, but metrics are arguably worth it.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2018
@smarterclayton
Copy link
Contributor Author

@openshift/networking for 3.10

@smarterclayton
Copy link
Contributor Author

Ok, the GCP test failed correctly (because we don't rebuild the image) when i ran without router exclusions. I'm going to drop router exclusions in openshift-eng/aos-cd-jobs#1023 and then this should report that GCP fails. When we merge if we don't have image builds for GCP yet we'll need to disable the test on GCP and then bring it back (or have the test autodetect the image is too old)

@smarterclayton
Copy link
Contributor Author

/test gcp
/retest

@knobunc
Copy link
Contributor

knobunc commented Feb 14, 2018

The code looks okay. BUT we still won't get the stats for older router processes that are handling long-running connections. So do we want to merge this now, or wait for dynamic updates to the router when it will not be needed?

@knobunc
Copy link
Contributor

knobunc commented Feb 14, 2018

@ramr @Miciah @rajatchopra any thoughts?

@ramr
Copy link
Contributor

ramr commented Feb 14, 2018

@knobunc my tuppence (2c) would be to merge it in if it looks good to you (caveat: I haven't looked at these bits as yet).

As re: the dynamic updates to the router (associated trello card: https://trello.com/c/W21FD0v0/585-spike-investigate-dynamic-changes-to-the-router-routerscale-ram), am still looking at it but there are definitely issues I see us having to get to there .. can't dynamic add/remove backends, handle enabling dynamic servers at scale, etc ... so it makes sense to merge this for the interim.

@smarterclayton
Copy link
Contributor Author

I think we want to be really sure that we've got the use case for router metrics solid, independent of dynamic updates. So even if the data is missing some stuff (bytes from restarted processes) we'll get a lot more comfortable having the metrics available. And at worst, when we have dynamic metrics this is easy to disable (since you have to explicitly request it).

@smarterclayton
Copy link
Contributor Author

Discussed with Ram, even with the dynamic config solution we'll still need reloads occasionally, in which case we'll need this. We talked about how metrics might change (the stat gatherer will need to consult the config manager to find the mapping between the backend slot pools and a real runtime) - i'll set aside time to make that change in concert with his changes.

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

Rebased and disabled test (after the next image is built I'll re-enable it)

openshift-merge-robot added a commit that referenced this pull request Mar 1, 2018
Automatic merge from submit-queue (batch tested with PRs 17420, 18254).

Prometheus should scrape the router by default

Builds on top of #18245 and will scrape the installed router by default. We ensure that the router by default will be using a serving cert to serve metrics, then add new roles, bindings, and a prometheus-scraper service account that has permission to scrape it.

For 3.10
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

3 similar comments
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

Any other comments? We'll still need this with dynamic reconfig, and Ram and I at least were fairly certainly we still had a path even if dynamic reconfigured is complex.

@knobunc
Copy link
Contributor

knobunc commented Mar 2, 2018

/lgtm

Thanks @smarterclayton

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit 014cc9b into openshift:master Mar 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. sig/networking size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants