-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve router metrics across restarts #18245
Preserve router metrics across restarts #18245
Conversation
@openshift/networking for 3.10 |
Ok, the GCP test failed correctly (because we don't rebuild the image) when i ran without router exclusions. I'm going to drop router exclusions in openshift-eng/aos-cd-jobs#1023 and then this should report that GCP fails. When we merge if we don't have image builds for GCP yet we'll need to disable the test on GCP and then bring it back (or have the test autodetect the image is too old) |
/test gcp |
The code looks okay. BUT we still won't get the stats for older router processes that are handling long-running connections. So do we want to merge this now, or wait for dynamic updates to the router when it will not be needed? |
@ramr @Miciah @rajatchopra any thoughts? |
@knobunc my tuppence ( As re: the dynamic updates to the router (associated trello card: https://trello.com/c/W21FD0v0/585-spike-investigate-dynamic-changes-to-the-router-routerscale-ram), am still looking at it but there are definitely issues I see us having to get to there .. can't dynamic add/remove backends, handle enabling dynamic servers at scale, etc ... so it makes sense to merge this for the interim. |
I think we want to be really sure that we've got the use case for router metrics solid, independent of dynamic updates. So even if the data is missing some stuff (bytes from restarted processes) we'll get a lot more comfortable having the metrics available. And at worst, when we have dynamic metrics this is easy to disable (since you have to explicitly request it). |
Discussed with Ram, even with the dynamic config solution we'll still need reloads occasionally, in which case we'll need this. We talked about how metrics might change (the stat gatherer will need to consult the config manager to find the mapping between the backend slot pools and a real runtime) - i'll set aside time to make that change in concert with his changes. |
/retest |
b109e31
to
f52f36c
Compare
Rebased and disabled test (after the next image is built I'll re-enable it) |
Automatic merge from submit-queue (batch tested with PRs 17420, 18254). Prometheus should scrape the router by default Builds on top of #18245 and will scrape the installed router by default. We ensure that the router by default will be using a serving cert to serve metrics, then add new roles, bindings, and a prometheus-scraper service account that has permission to scrape it. For 3.10
f52f36c
to
7adc3ec
Compare
/retest |
7adc3ec
to
8f0119b
Compare
/retest |
3 similar comments
/retest |
/retest |
/retest |
Any other comments? We'll still need this with dynamic reconfig, and Ram and I at least were fairly certainly we still had a path even if dynamic reconfigured is complex. |
/lgtm Thanks @smarterclayton |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. |
Router metrics aren't preserved across HAProxy reloads. To counter that, capture the state of certain counter style metrics (NOT rate style metrics) immediately before a router reload, then add them to subsequent reports. Some metrics will be lost (any recorded between the capture and the actual restart time) but this will make metrics accurate to within a few percent, even when routers are restarting frequently.
This increases the cost of the reload slightly, but metrics are arguably worth it.