-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrument the HAProxy router with metrics that contain route info #13337
Conversation
6358d63
to
dfb3b2c
Compare
Performance testing on this: @ 1000 routes / 1000 endpoints (below serverThreshold) - prometheus returns 2M of data, takes about 1-2% of CPU sustained to keep handling metrics every 2s (which is far more rapid than normal use). Memory usage is around 90M, no spiking. @ 1001 routes / 1001 endpoints (above serverThreshold) - prometheus drops down to 500k of data, takes less CPU. I expect this can handle up to 10k routes without tuning being needed, beyond that we may want to reduce the check interval or special case routes with only one endpoint. |
@knobunc this gets the router up to roughly the bar the controllers and api server are at. Feedback on parameters, behavior, etc appreciated. When you have someone to do review this should be mostly functional. With this in place we could consider grabbing router metrics on a per namespace basis in a 3.7 timeframe and using custom autoscaling for edge traffic. My primary concern is observability and getting the router to be something you can look at from a central monitoring platform. |
94b837f
to
9f2fbcf
Compare
[test] |
@stevekuznetsov https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/150/ should have failed GCE (images not built) but it didn't. It looks like it might because extended.test is not being built |
@smarterclayton from that log...
|
|
I think this is something in between RPMs being created and RPMs being installed. |
Maybe the wrong RPM location is being chosen? |
Publish step:
Install step uses the same repo:
Running the tests doesn't re-build anything, must be using the
|
Testing with debug log in https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/180/ |
Looks kosher?
|
9f2fbcf
to
f85850e
Compare
Aaand I'm an idiot. There's an explicit skip if the router doesn't have
the metrics endpoint, and it's firing. Images fix will fix the remaining
things.
On Mar 13, 2017, at 6:29 PM, OpenShift Bot <[email protected]> wrote:
continuous-integration/openshift-jenkins/test Running (
https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/174/)
(Base Commit: 209f100
<209f100>
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13337 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p4_K5mRA1t7GpJHpLgZ7bTOhtMLZks5rlcNQgaJpZM4MZB-H>
.
|
Images are fixed, re[test] |
I'm going to alter the haproxy server name one more time to put the service name in there, so we can filter on it. |
What about when there are multiple services involved? AB testing case. |
There's one in this output - weighted is the same as in the other AB (and
the test checks that).
On Mar 16, 2017, at 12:55 AM, Rajat Chopra <[email protected]> wrote:
What about when there are multiple services involved? AB testing case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13337 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pxbrFn1w0XNn06tyrs700sYtmuAEks5rmMC_gaJpZM4MZB-H>
.
|
4777e27
to
4e56f44
Compare
It's possible a future change could result in a collision between underscores and dashes, while ':' will never be part of the name.
Turn backend names into structured data: * be_http:NAMESPACE:NAME -> {backend="http",namespace="NAMESPACE",route="NAME"} * be_secure:NAMESPACE:NAME -> {backend="https",namespace="NAMESPACE",route="NAME"} * be_edge_http:NAMESPACE:NAME -> {backend="https-edge",namespace="NAMESPACE",route="NAME"} * be_tcp:NAMESPACE:NAME -> {backend="tcp",namespace="NAMESPACE",route="NAME"} * `*` -> {backend="other/*"} Allows per route / namespace aggregation of metrics. Include service information in endpoints and parse it out Services without target refs continue as they are.
Will help us understand how restarts happen in production
Reduces namespace cleanup times for some tests
Tests the metrics endpoint, including metrics transformation, healthz, and ACL checks.
96ba73d
to
1b0f7b4
Compare
Evaluated for origin test up to 1b0f7b4 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/526/) (Base Commit: 50bbe53) |
Rebased and squashed, [merge]
|
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/526/) (Base Commit: 50bbe53) (Image: devenv-rhel7_6112) |
Evaluated for origin merge up to 1b0f7b4 |
After this PR got merged we started to see following errors in the extended tests that we are running at the end of cluster install_upgrade job |
Check that the router has been created with metrics type haproxy (env var),
and that it has the right image (newest). If the first is true but the
second isn't, the test would fail
On Apr 4, 2017, at 12:10 AM, Jakub Hadvig <[email protected]> wrote:
After this PR got merged we started to see following errors in the extended
tests that we are running at the end of cluster install_upgrade job
https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_conformance_install_update/150/testReport/
any ideas why this is happening ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13337 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p_ZjaOqrg-A1hPbMKK6n643ktNFEks5rsW5sgaJpZM4MZB-H>
.
|
The env var is missing in the running router container
and yeas we are using the newest image |
@smarterclayton the flag is missing in the upgrade job. We should be starting the router in the upgrade job with it. |
The test should be skipped if that env var is unset. Can you verify that?
On Apr 4, 2017, at 3:18 PM, Jakub Hadvig <[email protected]> wrote:
@smarterclayton <https://github.com/smarterclayton> the flag is missing in
the upgrade job. We should be starting the router in the upgrade job with
it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13337 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pzl3MrGEhSBp0XAQ7td5uwrrv2UPks5rskM1gaJpZM4MZB-H>
.
|
No, its not skipped, it fails |
Can you debug why? :)
It works correctly on GCE (no new image) - my thought is that upgrade is
not running the correct logic and so the guard in place is failing.
On Apr 4, 2017, at 7:12 PM, Jakub Hadvig <[email protected]> wrote:
The test should be skipped if that env var is unset. Can you verify that?
No, its not skipped, it fails
https://ci.openshift.redhat.com/jenkins/view/All/job/test_branch_origin_extended_conformance_install_update/155/consoleFull#117457015656cbb9a5e4b02b88ae8c2f77
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13337 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pyN1n2dpqaL7i3hSjlnXXoITjlBwks5rsnnlgaJpZM4MZB-H>
.
|
Actually, easy answer - not skipped. The before each on the e2e test
metrics.go needs to skip if the haproxy metrics env var is not set to
haproxy, and skip healthz / profiling if listen address env var is not
set
|
Isn't it on the developer landing a feature into Origin to get the correct logic into the upgrade job so that their new feature goes in cleanly? |
I'm on PTO - this is a new test suite, so you can either fix it,
disable it in your runs, or wait until I get back. Or you can fix the
upgrade to set the right settings (which is the real bug). Up to you.
|
When we move to 1.6+, we can use https://github.com/haproxy/haproxy/blob/master/examples/seamless_reload.txt to save metrics prior to restart. |
Instrument the template router with a new optional flag
--metrics-type
and--listen-addr
, which follow the conventions of the other infrastructure components and register a health check (not used currently, since it doesn't check HAProxy), the pprof endpoints, and a prometheus metrics endpoint. Pprof and metrics are protected by the stats endpoint basic auth.oadm router
now enables a default listen address of0.0.0.0:1935
.On the template router, start with the prometheus haproxy_exporter to extract metrics from the running HAProxy process. Since large amounts of metrics are returned and reported, slim the list of metrics down to a smaller subset, with a focus on those that can provide meaningful input at large scales. Add a server threshold to stop reporting server metrics (and instead only backend metrics), when the number of servers in haproxy exceeds that, defaults to 1000 endpoints. Add a rate limiter on the number of times that the HAProxy stats endpoint can be called that is inversely proportional to the number total stats entries (usually dominated by backends + servers) to reduce total work - defaults to 5s for every 1000 servers or backends. Also, right before reload, capture the latest stats and report them on the next call (since we don't know how to merge metrics).
Add additional metrics on the router itself to report reload times and write config.
To properly expose metrics to the user, the following changes were made to the router template:
_
as a separator in backend/server names to:
, which is impossible to get from a route or ingress (previously ingress could result in_
being added, which becomes ambiguous)IdHash
toID
(endpoint IP + port number), which was not a required security step (cookie values are loaded separately).With these in place, a caller can get a wide set of metrics from the router, which can be used for custom autoscaling, monitoring, and problem detecting:
(limited example, most backends removed).