Preserve router metrics across restarts #18245

smarterclayton · 2018-01-23T19:09:09Z

Router metrics aren't preserved across HAProxy reloads. To counter that, capture the state of certain counter style metrics (NOT rate style metrics) immediately before a router reload, then add them to subsequent reports. Some metrics will be lost (any recorded between the capture and the actual restart time) but this will make metrics accurate to within a few percent, even when routers are restarting frequently.

This increases the cost of the reload slightly, but metrics are arguably worth it.

smarterclayton · 2018-01-23T19:09:21Z

@openshift/networking for 3.10

smarterclayton · 2018-01-24T20:18:35Z

Ok, the GCP test failed correctly (because we don't rebuild the image) when i ran without router exclusions. I'm going to drop router exclusions in openshift-eng/aos-cd-jobs#1023 and then this should report that GCP fails. When we merge if we don't have image builds for GCP yet we'll need to disable the test on GCP and then bring it back (or have the test autodetect the image is too old)

smarterclayton · 2018-01-25T15:56:50Z

/test gcp
/retest

knobunc · 2018-02-14T15:14:50Z

The code looks okay. BUT we still won't get the stats for older router processes that are handling long-running connections. So do we want to merge this now, or wait for dynamic updates to the router when it will not be needed?

knobunc · 2018-02-14T15:15:08Z

@ramr @Miciah @rajatchopra any thoughts?

ramr · 2018-02-14T22:37:39Z

@knobunc my tuppence (2c) would be to merge it in if it looks good to you (caveat: I haven't looked at these bits as yet).

As re: the dynamic updates to the router (associated trello card: https://trello.com/c/W21FD0v0/585-spike-investigate-dynamic-changes-to-the-router-routerscale-ram), am still looking at it but there are definitely issues I see us having to get to there .. can't dynamic add/remove backends, handle enabling dynamic servers at scale, etc ... so it makes sense to merge this for the interim.

smarterclayton · 2018-02-19T04:31:25Z

I think we want to be really sure that we've got the use case for router metrics solid, independent of dynamic updates. So even if the data is missing some stuff (bytes from restarted processes) we'll get a lot more comfortable having the metrics available. And at worst, when we have dynamic metrics this is easy to disable (since you have to explicitly request it).

smarterclayton · 2018-03-01T04:14:27Z

Discussed with Ram, even with the dynamic config solution we'll still need reloads occasionally, in which case we'll need this. We talked about how metrics might change (the stat gatherer will need to consult the config manager to find the mapping between the backend slot pools and a real runtime) - i'll set aside time to make that change in concert with his changes.

smarterclayton · 2018-03-01T04:15:11Z

/retest

smarterclayton · 2018-03-01T06:06:55Z

Rebased and disabled test (after the next image is built I'll re-enable it)

Automatic merge from submit-queue (batch tested with PRs 17420, 18254). Prometheus should scrape the router by default Builds on top of #18245 and will scrape the installed router by default. We ensure that the router by default will be using a serving cert to serve metrics, then add new roles, bindings, and a prometheus-scraper service account that has permission to scrape it. For 3.10

smarterclayton · 2018-03-01T15:29:04Z

/retest

smarterclayton · 2018-03-01T20:01:37Z

/retest

smarterclayton · 2018-03-01T21:49:00Z

/retest

smarterclayton · 2018-03-02T03:51:08Z

/retest

smarterclayton · 2018-03-02T05:34:54Z

/retest

smarterclayton · 2018-03-02T16:40:41Z

Any other comments? We'll still need this with dynamic reconfig, and Ram and I at least were fairly certainly we still had a path even if dynamic reconfigured is complex.

knobunc · 2018-03-02T18:35:58Z

/lgtm

Thanks @smarterclayton

openshift-ci-robot · 2018-03-02T18:36:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/cmd/infra/router/OWNERS~~ [knobunc,smarterclayton]
~~pkg/router/OWNERS~~ [knobunc,smarterclayton]
~~test/extended/OWNERS~~ [knobunc,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-merge-robot · 2018-03-03T12:59:46Z

/test all [submit-queue is verifying that this PR is safe to merge]

openshift-merge-robot · 2018-03-03T14:27:32Z

Automatic merge from submit-queue.

openshift-ci-robot requested review from jim-minter, knobunc and pecameron January 23, 2018 19:09

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2018

smarterclayton added the sig/networking label Jan 23, 2018

jim-minter removed their request for review January 23, 2018 21:48

smarterclayton mentioned this pull request Jan 23, 2018

Prometheus should scrape the router by default #18254

Merged

gabemontero mentioned this pull request Jan 26, 2018

adjust newapp/newbuild error messages (arg classification vs. actual … #18272

Merged

knobunc self-assigned this Feb 14, 2018

smarterclayton force-pushed the preserve_metrics branch from b109e31 to f52f36c Compare March 1, 2018 06:06

smarterclayton force-pushed the preserve_metrics branch from f52f36c to 7adc3ec Compare March 1, 2018 15:14

Preserve router metrics across restarts

8f0119b

smarterclayton force-pushed the preserve_metrics branch from 7adc3ec to 8f0119b Compare March 1, 2018 16:34

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2018

openshift-merge-robot merged commit 014cc9b into openshift:master Mar 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve router metrics across restarts #18245

Preserve router metrics across restarts #18245

smarterclayton commented Jan 23, 2018

smarterclayton commented Jan 23, 2018

smarterclayton commented Jan 24, 2018

smarterclayton commented Jan 25, 2018

knobunc commented Feb 14, 2018

knobunc commented Feb 14, 2018

ramr commented Feb 14, 2018

smarterclayton commented Feb 19, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 2, 2018

smarterclayton commented Mar 2, 2018

smarterclayton commented Mar 2, 2018

knobunc commented Mar 2, 2018

openshift-ci-robot commented Mar 2, 2018

openshift-merge-robot commented Mar 3, 2018

openshift-merge-robot commented Mar 3, 2018

Preserve router metrics across restarts #18245

Preserve router metrics across restarts #18245

Conversation

smarterclayton commented Jan 23, 2018

smarterclayton commented Jan 23, 2018

smarterclayton commented Jan 24, 2018

smarterclayton commented Jan 25, 2018

knobunc commented Feb 14, 2018

knobunc commented Feb 14, 2018

ramr commented Feb 14, 2018

smarterclayton commented Feb 19, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 1, 2018

smarterclayton commented Mar 2, 2018

smarterclayton commented Mar 2, 2018

smarterclayton commented Mar 2, 2018

knobunc commented Mar 2, 2018

openshift-ci-robot commented Mar 2, 2018

openshift-merge-robot commented Mar 3, 2018

openshift-merge-robot commented Mar 3, 2018