Instrument the HAProxy router with metrics that contain route info #13337

smarterclayton · 2017-03-10T06:44:04Z

Instrument the template router with a new optional flag --metrics-type and --listen-addr, which follow the conventions of the other infrastructure components and register a health check (not used currently, since it doesn't check HAProxy), the pprof endpoints, and a prometheus metrics endpoint. Pprof and metrics are protected by the stats endpoint basic auth. oadm router now enables a default listen address of 0.0.0.0:1935.

On the template router, start with the prometheus haproxy_exporter to extract metrics from the running HAProxy process. Since large amounts of metrics are returned and reported, slim the list of metrics down to a smaller subset, with a focus on those that can provide meaningful input at large scales. Add a server threshold to stop reporting server metrics (and instead only backend metrics), when the number of servers in haproxy exceeds that, defaults to 1000 endpoints. Add a rate limiter on the number of times that the HAProxy stats endpoint can be called that is inversely proportional to the number total stats entries (usually dominated by backends + servers) to reduce total work - defaults to 5s for every 1000 servers or backends. Also, right before reload, capture the latest stats and report them on the next call (since we don't know how to merge metrics).

Add additional metrics on the router itself to report reload times and write config.

To properly expose metrics to the user, the following changes were made to the router template:

Switch from _ as a separator in backend/server names to :, which is impossible to get from a route or ingress (previously ingress could result in _ being added, which becomes ambiguous)
Switch server names from the IdHash to ID (endpoint IP + port number), which was not a required security step (cookie values are loaded separately).

With these in place, a caller can get a wide set of metrics from the router, which can be used for custom autoscaling, monitoring, and problem detecting:

(limited example, most backends removed).

# HELP haproxy_backend_connections_total Total number of connections.
# TYPE haproxy_backend_connections_total gauge
haproxy_backend_connections_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute"} 20
haproxy_backend_connections_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="zeroweightroute"} 0
# HELP haproxy_backend_current_queue Current number of queued requests not assigned to any server.
# TYPE haproxy_backend_current_queue gauge
haproxy_backend_current_queue{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute"} 0
haproxy_backend_current_queue{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="zeroweightroute"} 0
# HELP haproxy_backend_up Current health status of the backend (1 = UP, 0 = DOWN).
# TYPE haproxy_backend_up gauge
haproxy_backend_up{backend="http",namespace="default",route="backend4"} 1
haproxy_backend_up{backend="http",namespace="default",route="docker-registry"} 1
haproxy_backend_up{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="zeroweightroute"} 1
# HELP haproxy_exporter_csv_parse_failures Number of errors while parsing CSV.
# TYPE haproxy_exporter_csv_parse_failures counter
haproxy_exporter_csv_parse_failures 0
# HELP haproxy_exporter_scrape_interval The time in seconds before another scrape is allowed, proportional to size of data.
# TYPE haproxy_exporter_scrape_interval gauge
haproxy_exporter_scrape_interval 5
# HELP haproxy_exporter_server_threshold Number of servers tracked and the current threshold value.
# TYPE haproxy_exporter_server_threshold counter
haproxy_exporter_server_threshold{type="current"} 13
haproxy_exporter_server_threshold{type="limit"} 500
# HELP haproxy_exporter_total_scrapes Current total HAProxy scrapes.
# TYPE haproxy_exporter_total_scrapes counter
haproxy_exporter_total_scrapes 27
# HELP haproxy_frontend_bytes_in_total Current total of incoming bytes.
# TYPE haproxy_frontend_bytes_in_total gauge
haproxy_frontend_bytes_in_total{frontend="fe_no_sni"} 0
haproxy_frontend_bytes_in_total{frontend="fe_sni"} 0
haproxy_frontend_bytes_in_total{frontend="public"} 1680
haproxy_frontend_bytes_in_total{frontend="public_ssl"} 0
haproxy_frontend_bytes_in_total{frontend="stats"} 207
# HELP haproxy_frontend_bytes_out_total Current total of outgoing bytes.
# TYPE haproxy_frontend_bytes_out_total gauge
haproxy_frontend_bytes_out_total{frontend="fe_no_sni"} 0
haproxy_frontend_bytes_out_total{frontend="fe_sni"} 0
haproxy_frontend_bytes_out_total{frontend="public"} 2680
haproxy_frontend_bytes_out_total{frontend="public_ssl"} 0
haproxy_frontend_bytes_out_total{frontend="stats"} 479
# HELP haproxy_frontend_connections_total Total number of connections.
# TYPE haproxy_frontend_connections_total gauge
haproxy_frontend_connections_total{frontend="fe_no_sni"} 0
haproxy_frontend_connections_total{frontend="fe_sni"} 0
haproxy_frontend_connections_total{frontend="public"} 20
haproxy_frontend_connections_total{frontend="public_ssl"} 0
haproxy_frontend_connections_total{frontend="stats"} 3
# HELP haproxy_frontend_current_session_rate Current number of sessions per second over last elapsed second.
# TYPE haproxy_frontend_current_session_rate gauge
haproxy_frontend_current_session_rate{frontend="fe_no_sni"} 0
haproxy_frontend_current_session_rate{frontend="fe_sni"} 0
haproxy_frontend_current_session_rate{frontend="public"} 0
haproxy_frontend_current_session_rate{frontend="public_ssl"} 0
haproxy_frontend_current_session_rate{frontend="stats"} 0
# HELP haproxy_frontend_current_sessions Current number of active sessions.
# TYPE haproxy_frontend_current_sessions gauge
haproxy_frontend_current_sessions{frontend="fe_no_sni"} 0
haproxy_frontend_current_sessions{frontend="fe_sni"} 0
haproxy_frontend_current_sessions{frontend="public"} 0
haproxy_frontend_current_sessions{frontend="public_ssl"} 0
haproxy_frontend_current_sessions{frontend="stats"} 0
# HELP haproxy_frontend_http_responses_total Total of HTTP responses.
# TYPE haproxy_frontend_http_responses_total gauge
haproxy_frontend_http_responses_total{code="2xx",frontend="fe_no_sni"} 0
haproxy_frontend_http_responses_total{code="2xx",frontend="fe_sni"} 0
haproxy_frontend_http_responses_total{code="2xx",frontend="public"} 20
haproxy_frontend_http_responses_total{code="2xx",frontend="stats"} 2
haproxy_frontend_http_responses_total{code="5xx",frontend="fe_no_sni"} 0
haproxy_frontend_http_responses_total{code="5xx",frontend="fe_sni"} 0
haproxy_frontend_http_responses_total{code="5xx",frontend="public"} 0
haproxy_frontend_http_responses_total{code="5xx",frontend="stats"} 0
# HELP haproxy_frontend_max_session_rate Maximum observed number of sessions per second.
# TYPE haproxy_frontend_max_session_rate gauge
haproxy_frontend_max_session_rate{frontend="fe_no_sni"} 0
haproxy_frontend_max_session_rate{frontend="fe_sni"} 0
haproxy_frontend_max_session_rate{frontend="public"} 10
haproxy_frontend_max_session_rate{frontend="public_ssl"} 0
haproxy_frontend_max_session_rate{frontend="stats"} 2
# HELP haproxy_frontend_max_sessions Maximum observed number of active sessions.
# TYPE haproxy_frontend_max_sessions gauge
haproxy_frontend_max_sessions{frontend="fe_no_sni"} 0
haproxy_frontend_max_sessions{frontend="fe_sni"} 0
haproxy_frontend_max_sessions{frontend="public"} 1
haproxy_frontend_max_sessions{frontend="public_ssl"} 0
haproxy_frontend_max_sessions{frontend="stats"} 1
# HELP haproxy_process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE haproxy_process_cpu_seconds_total counter
haproxy_process_cpu_seconds_total 0.01
# HELP haproxy_process_max_fds Maximum number of open file descriptors.
# TYPE haproxy_process_max_fds gauge
haproxy_process_max_fds 40046
# HELP haproxy_process_resident_memory_bytes Resident memory size in bytes.
# TYPE haproxy_process_resident_memory_bytes gauge
haproxy_process_resident_memory_bytes 6.836224e+06
# HELP haproxy_process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE haproxy_process_start_time_seconds gauge
haproxy_process_start_time_seconds 1.48929703048e+09
# HELP haproxy_process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE haproxy_process_virtual_memory_bytes gauge
haproxy_process_virtual_memory_bytes 5.4120448e+07
# HELP haproxy_server_bytes_in_total Current total of incoming bytes.
# TYPE haproxy_server_bytes_in_total gauge
haproxy_server_bytes_in_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 840
# HELP haproxy_server_bytes_out_total Current total of outgoing bytes.
# TYPE haproxy_server_bytes_out_total gauge
haproxy_server_bytes_out_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 1340
# HELP haproxy_server_check_failures_total Total number of failed health checks.
# TYPE haproxy_server_check_failures_total gauge
haproxy_server_check_failures_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_connection_errors_total Total of connection errors.
# TYPE haproxy_server_connection_errors_total gauge
haproxy_server_connection_errors_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_connections_total Total number of connections.
# TYPE haproxy_server_connections_total gauge
haproxy_server_connections_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 10
# HELP haproxy_server_current_queue Current number of queued requests assigned to this server.
# TYPE haproxy_server_current_queue gauge
haproxy_server_current_queue{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_current_session_rate Current number of sessions per second over last elapsed second.
# TYPE haproxy_server_current_session_rate gauge
haproxy_server_current_session_rate{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_current_sessions Current number of active sessions.
# TYPE haproxy_server_current_sessions gauge
haproxy_server_current_sessions{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_downtime_seconds_total Total downtime in seconds.
# TYPE haproxy_server_downtime_seconds_total gauge
haproxy_server_downtime_seconds_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_http_responses_total Total of HTTP responses.
# TYPE haproxy_server_http_responses_total gauge
haproxy_server_http_responses_total{backend="http",code="2xx",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 10
haproxy_server_http_responses_total{backend="http",code="5xx",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_max_session_rate Maximum observed number of sessions per second.
# TYPE haproxy_server_max_session_rate gauge
haproxy_server_max_session_rate{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 5
# HELP haproxy_server_max_sessions Maximum observed number of active sessions.
# TYPE haproxy_server_max_sessions gauge
haproxy_server_max_sessions{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 1
# HELP haproxy_server_response_errors_total Total of response errors.
# TYPE haproxy_server_response_errors_total gauge
haproxy_server_response_errors_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_up Current health status of the server (1 = UP, 0 = DOWN).
# TYPE haproxy_server_up gauge
haproxy_server_up{backend="http",namespace="default",route="backend1",server="172.17.0.3:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend1-1",server="172.17.0.3:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend1-2",server="172.17.0.3:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend1-2",server="172.17.0.4:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend1-3",server="172.17.0.3:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend2",server="172.17.0.4:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend3",server="172.17.0.4:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend4",server="172.17.0.4:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="docker-registry",server="172.17.0.5:5000"} 1
haproxy_server_up{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.10:8080"} 1
haproxy_server_up{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 1
haproxy_server_up{backend="other/be_no_sni",namespace="",route="",server="fe_no_sni"} 1
haproxy_server_up{backend="other/be_sni",namespace="",route="",server="fe_sni"} 1
# HELP haproxy_up Was the last scrape of haproxy successful.
# TYPE haproxy_up gauge
haproxy_up 1
# HELP openshift_build_info A metric with a constant '1' value labeled by major, minor, git commit & git version from which OpenShift was built.
# TYPE openshift_build_info gauge
openshift_build_info{gitCommit="39ecc8d",gitVersion="v1.5.0-alpha.3+39ecc8d-355-dirty",major="1",minor="5+"} 1
# HELP template_router_reload_seconds Measures the time spent reloading the router in seconds.
# TYPE template_router_reload_seconds summary
template_router_reload_seconds{quantile="0.5"} 0.067299316
template_router_reload_seconds{quantile="0.9"} 0.118891744
template_router_reload_seconds{quantile="0.99"} 0.128374662
template_router_reload_seconds_sum 1.69002939
template_router_reload_seconds_count 21
# HELP template_router_write_config_seconds Measures the time spent writing out the router configuration to disk in seconds.
# TYPE template_router_write_config_seconds summary
template_router_write_config_seconds{quantile="0.5"} 0.004694066
template_router_write_config_seconds{quantile="0.9"} 0.012401996
template_router_write_config_seconds{quantile="0.99"} 0.017210267
template_router_write_config_seconds_sum 0.15450527399999994
template_router_write_config_seconds_count 21

smarterclayton · 2017-03-12T06:09:40Z

Performance testing on this:

@ 1000 routes / 1000 endpoints (below serverThreshold) - prometheus returns 2M of data, takes about 1-2% of CPU sustained to keep handling metrics every 2s (which is far more rapid than normal use). Memory usage is around 90M, no spiking.

@ 1001 routes / 1001 endpoints (above serverThreshold) - prometheus drops down to 500k of data, takes less CPU.

I expect this can handle up to 10k routes without tuning being needed, beyond that we may want to reduce the check interval or special case routes with only one endpoint.

smarterclayton · 2017-03-12T06:11:35Z

@knobunc this gets the router up to roughly the bar the controllers and api server are at. Feedback on parameters, behavior, etc appreciated. When you have someone to do review this should be mostly functional. With this in place we could consider grabbing router metrics on a per namespace basis in a 3.7 timeframe and using custom autoscaling for edge traffic. My primary concern is observability and getting the router to be something you can look at from a central monitoring platform.

smarterclayton · 2017-03-13T04:57:19Z

[test]

smarterclayton · 2017-03-13T14:32:06Z

@stevekuznetsov https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/150/ should have failed GCE (images not built) but it didn't. It looks like it might because extended.test is not being built

stevekuznetsov · 2017-03-13T15:03:18Z

@smarterclayton from that log...

00:15:58.268 Successfully built: ... /openshifttmp/openshift/tito/x86_64/origin-tests-1.5.0-999.alpha.3+9f2fbcf.361.x86_64.rpm ...

stevekuznetsov · 2017-03-13T15:05:04Z

oct correctly merged your commit 9f2fbcf into master...

00:03:46.832 TASK [remote-sync : synchronize the repository with the remote server] *********
00:03:46.832 task path: /var/lib/jenkins/origin-ci-tool/7a8f8a31a03ddd44278142ff40021abeb31efe7b/lib/python2.7/site-packages/oct/ansible/oct/playbooks/sync/roles/remote-sync/tasks/main.yml:36
00:03:48.521 changed: [172.18.5.223] => {
00:03:48.521     "after": "9f2fbcfe425a14843dcfa5252f707596e0f23a56", 
00:03:48.521     "before": "18b1328a85d32551f434f05676436dbc64c4dc87", 
00:03:48.521     "changed": true, 
00:03:48.521     "remote_url_changed": false, 
00:03:48.521     "warnings": []
00:03:48.521 }
00:03:48.529 
00:03:48.529 TASK [remote-sync : check out the desired post-merge state, if requested] ******
00:03:48.529 task path: /var/lib/jenkins/origin-ci-tool/7a8f8a31a03ddd44278142ff40021abeb31efe7b/lib/python2.7/site-packages/oct/ansible/oct/playbooks/sync/roles/remote-sync/tasks/main.yml:45
00:03:48.991 changed: [172.18.5.223] => {
00:03:48.991     "changed": true, 
00:03:48.991     "cmd": "/usr/bin/git checkout master", 
00:03:48.991     "delta": "0:00:00.116039", 
00:03:48.991     "end": "2017-03-13 01:17:24.180675", 
00:03:48.991     "rc": 0, 
00:03:48.991     "start": "2017-03-13 01:17:24.064636", 
00:03:48.991     "stderr": [
00:03:48.991         "Switched to branch 'master'"
00:03:48.991     ], 
00:03:48.991     "stdout": [], 
00:03:48.991     "warnings": []
00:03:48.991 }
00:03:48.997 
00:03:48.997 TASK [remote-sync : merge the resulting state into another branch, if requested] ***
00:03:48.997 task path: /var/lib/jenkins/origin-ci-tool/7a8f8a31a03ddd44278142ff40021abeb31efe7b/lib/python2.7/site-packages/oct/ansible/oct/playbooks/sync/roles/remote-sync/tasks/main.yml:51
00:03:49.450 changed: [172.18.5.223] => {
00:03:49.450     "changed": true, 
00:03:49.450     "cmd": "/usr/bin/git merge pull-13337", 
00:03:49.451     "delta": "0:00:00.119124", 
00:03:49.451     "end": "2017-03-13 01:17:24.640143", 
00:03:49.451     "rc": 0, 
00:03:49.451     "start": "2017-03-13 01:17:24.521019", 
00:03:49.451     "stderr": [], 
00:03:49.451     "stdout": [
00:03:49.451         "Updating 18b1328..9f2fbcf", 
00:03:49.451         "Fast-forward", 
00:03:49.451         " contrib/completions/bash/openshift                 |   6 +", 
00:03:49.451         " contrib/completions/zsh/openshift                  |   6 +", 
00:03:49.451         " images/router/haproxy/conf/haproxy-config.template |  66 +--", 
00:03:49.451         " pkg/cmd/admin/router/router.go                     |  10 +", 
00:03:49.451         " pkg/cmd/infra/router/router.go                     |   3 +", 
00:03:49.451         " pkg/cmd/infra/router/template.go                   |  63 +++", 
00:03:49.451         " pkg/router/controller/ingress.go                   |   2 +-", 
00:03:49.451         " pkg/router/metrics/haproxy/haproxy.go              | 625 +++++++++++++++++++++", 
00:03:49.451         " pkg/router/metrics/metrics.go                      |  46 ++", 
00:03:49.451         " pkg/router/template/plugin.go                      |   2 +", 
00:03:49.451         " pkg/router/template/router.go                      |  51 +-", 
00:03:49.451         " test/extended/router/metrics.go                    | 245 ++++++++", 
00:03:49.452         " test/extended/testdata/router-metrics.yaml         | 105 ++++", 
00:03:49.452         " test/extended/testdata/scoped-router.yaml          |   3 +", 
00:03:49.452         " test/extended/testdata/weighted-router.yaml        |   3 +", 
00:03:49.452         " .../k8s.io/kubernetes/test/e2e/framework/util.go   |   2 +", 
00:03:49.452         " 16 files changed, 1195 insertions(+), 43 deletions(-)", 
00:03:49.452         " create mode 100644 pkg/router/metrics/haproxy/haproxy.go", 
00:03:49.452         " create mode 100644 pkg/router/metrics/metrics.go", 
00:03:49.452         " create mode 100644 test/extended/router/metrics.go", 
00:03:49.452         " create mode 100644 test/extended/testdata/router-metrics.yaml"
00:03:49.452     ], 
00:03:49.452     "warnings": []
00:03:49.452 }

smarterclayton · 2017-03-13T15:59:56Z

I think this is something in between RPMs being created and RPMs being installed.

stevekuznetsov · 2017-03-13T17:10:50Z

Maybe the wrong RPM location is being chosen?

stevekuznetsov · 2017-03-13T18:16:22Z

Publish step:

00:16:46.992 + gsutil -m cp -r artifacts/rpms gs://origin-ci-test/pr-logs/13337/test_pull_request_origin_extended_conformance_gce/150/artifacts/rpms
00:16:47.776 Copying file://artifacts/rpms/origin-tests-1.5.0-999.alpha.3+9f2fbcf.361.x86_64.rpm [Content-Type=application/x-rpm]...

Install step uses the same repo:

00:17:09.732 + ../../bin/local.sh ansible-playbook -e provision_gce_docker_storage_driver=devicemapper -e openshift_test_repo=https://storage.googleapis.com/origin-ci-test/pr-logs/13337/test_pull_request_origin_extended_conformance_gce/150/artifacts/rpms playbooks/launch.yaml

Running the tests doesn't re-build anything, must be using the extended.test from the RPM build...

00:33:54.873 + make test-extended SUITE=conformance
00:33:55.093 test/extended/conformance.sh 
00:34:29.311 [INFO] Running parallel tests N=25

smarterclayton · 2017-03-13T20:41:42Z

Testing with debug log in https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/180/

stevekuznetsov · 2017-03-13T21:25:03Z

Looks kosher?

00:34:05.549 + find _output/local/bin -ls
00:34:05.550 59187400    0 drwxr-sr-x   3 origin   origin-git       18 Mar 13 16:48 _output/local/bin
00:34:05.550 67921122    0 drwxr-sr-x   3 origin   origin-git       18 Mar 13 16:48 _output/local/bin/linux
00:34:05.550 76366864    4 drwxr-sr-x   2 origin   origin-git     4096 Mar 13 16:56 _output/local/bin/linux/amd64
00:34:05.550 76366865 63968 -rwxr-xr-x   1 origin   origin-git 65502210 Mar 13 16:49 _output/local/bin/linux/amd64/dockerregistry
00:34:05.550 76366866 117516 -rwxr-xr-x   1 origin   origin-git 120332984 Mar 13 16:54 _output/local/bin/linux/amd64/extended.test
00:34:05.550 76366867 86688 -rwxr-xr-x   1 origin   origin-git 88767680 Mar 13 16:56 _output/local/bin/linux/amd64/gendocs
00:34:05.550 76366868 258992 -rwxr-xr-x   1 origin   origin-git 265206432 Mar 13 16:56 _output/local/bin/linux/amd64/genman
00:34:05.550 76366869 53184 -rwxr-xr-x   1 origin   origin-git 54459696 Mar 13 16:49 _output/local/bin/linux/amd64/gitserver
00:34:05.551 76366870 5504 -rwxr-xr-x   1 origin   origin-git  5635062 Mar 13 16:49 _output/local/bin/linux/amd64/hello-openshift
00:34:05.551 76366871 2904 -rwxr-xr-x   1 origin   origin-git  2969783 Mar 13 16:49 _output/local/bin/linux/amd64/host-local
00:34:05.551 76366872    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kube-apiserver -> openshift
00:34:05.551 76366873    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kube-controller-manager -> openshift
00:34:05.551 76366874    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kube-proxy -> openshift
00:34:05.551 76366875    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kube-scheduler -> openshift
00:34:05.551 76366876    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kubectl -> openshift
00:34:05.551 76366877    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kubelet -> openshift
00:34:05.551 76366878    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kubernetes -> openshift
00:34:05.552 76366879 2796 -rwxr-xr-x   1 origin   origin-git  2859526 Mar 13 16:49 _output/local/bin/linux/amd64/loopback
00:34:05.552 76426528    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/oadm -> openshift
00:34:05.552 76426529 90608 -rwxr-xr-x   1 origin   origin-git 92780520 Mar 13 16:52 _output/local/bin/linux/amd64/oc
00:34:05.552 76426530 260644 -rwxr-xr-x   1 origin   origin-git 266898016 Mar 13 16:53 _output/local/bin/linux/amd64/openshift
00:34:05.552 76426531    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-deploy -> openshift
00:34:05.552 76426532    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-docker-build -> openshift
00:34:05.552 76426533    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-recycle -> openshift
00:34:05.552 76426534    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-router -> openshift
00:34:05.552 76426535    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-sti-build -> openshift
00:34:05.553 76426536    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/origin -> openshift
00:34:05.553 76426537    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/osadm -> openshift
00:34:05.553 76426538    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/osc -> openshift
00:34:05.553 76426539 1116 -rwxr-xr-x   1 origin   origin-git  1138998 Mar 13 16:48 _output/local/bin/linux/amd64/pod
00:34:05.553 76426540 5600 -rwxr-xr-x   1 origin   origin-git  5732301 Mar 13 16:49 _output/local/bin/linux/amd64/sdn-cni-plugin
00:34:05.553 + git status
00:34:05.620 # On branch master
00:34:05.620 # Your branch is ahead of 'origin/master' by 10 commits.
00:34:05.620 #   (use "git push" to publish your local commits)
00:34:05.620 #
00:34:05.620 nothing to commit, working directory clean

smarterclayton · 2017-03-13T22:47:15Z

Aaand I'm an idiot. There's an explicit skip if the router doesn't have the metrics endpoint, and it's firing. Images fix will fix the remaining things. On Mar 13, 2017, at 6:29 PM, OpenShift Bot <[email protected]> wrote: continuous-integration/openshift-jenkins/test Running ( https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/174/) (Base Commit: 209f100 <209f100> ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13337 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p4_K5mRA1t7GpJHpLgZ7bTOhtMLZks5rlcNQgaJpZM4MZB-H> .

stevekuznetsov · 2017-03-14T13:38:17Z

Images are fixed, re[test]

smarterclayton · 2017-03-14T19:08:10Z

@ramr @rajatchopra

smarterclayton · 2017-03-16T04:33:28Z

I'm going to alter the haproxy server name one more time to put the service name in there, so we can filter on it.

rajatchopra · 2017-03-16T04:55:25Z

What about when there are multiple services involved? AB testing case.

smarterclayton · 2017-03-16T05:08:46Z

There's one in this output - weighted is the same as in the other AB (and the test checks that). On Mar 16, 2017, at 12:55 AM, Rajat Chopra <[email protected]> wrote: What about when there are multiple services involved? AB testing case. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13337 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pxbrFn1w0XNn06tyrs700sYtmuAEks5rmMC_gaJpZM4MZB-H> .

It's possible a future change could result in a collision between underscores and dashes, while ':' will never be part of the name.

Turn backend names into structured data: * be_http:NAMESPACE:NAME -> {backend="http",namespace="NAMESPACE",route="NAME"} * be_secure:NAMESPACE:NAME -> {backend="https",namespace="NAMESPACE",route="NAME"} * be_edge_http:NAMESPACE:NAME -> {backend="https-edge",namespace="NAMESPACE",route="NAME"} * be_tcp:NAMESPACE:NAME -> {backend="tcp",namespace="NAMESPACE",route="NAME"} * `*` -> {backend="other/*"} Allows per route / namespace aggregation of metrics. Include service information in endpoints and parse it out Services without target refs continue as they are.

Will help us understand how restarts happen in production

Reduces namespace cleanup times for some tests

Tests the metrics endpoint, including metrics transformation, healthz, and ACL checks.

openshift-bot · 2017-04-01T23:53:14Z

Evaluated for origin test up to 1b0f7b4

openshift-bot · 2017-04-02T01:08:13Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/526/) (Base Commit: 50bbe53)

smarterclayton · 2017-04-02T08:41:21Z

Rebased and squashed, [merge]

openshift-bot · 2017-04-02T08:45:14Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/526/) (Base Commit: 50bbe53) (Image: devenv-rhel7_6112)

openshift-bot · 2017-04-02T08:45:14Z

Evaluated for origin merge up to 1b0f7b4

jhadvig · 2017-04-03T22:10:50Z

After this PR got merged we started to see following errors in the extended tests that we are running at the end of cluster install_upgrade job
https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_conformance_install_update/150/testReport/
@smarterclayton any ideas why this is happening ?

smarterclayton · 2017-04-04T09:00:02Z

Check that the router has been created with metrics type haproxy (env var), and that it has the right image (newest). If the first is true but the second isn't, the test would fail On Apr 4, 2017, at 12:10 AM, Jakub Hadvig <[email protected]> wrote: After this PR got merged we started to see following errors in the extended tests that we are running at the end of cluster install_upgrade job https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_conformance_install_update/150/testReport/ any ideas why this is happening ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13337 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p_ZjaOqrg-A1hPbMKK6n643ktNFEks5rsW5sgaJpZM4MZB-H> .

jhadvig · 2017-04-04T13:00:28Z

Check that the router has been created with metrics type haproxy (env var)

The env var is missing in the running router container

and that it has the right image (newest)

and yeas we are using the newest image
Here is the inspection of the router container http://pastebin.test.redhat.com/471601
The "ROUTER_METRICS_TYPE=haproxy" is missing

jhadvig · 2017-04-04T13:18:43Z

@smarterclayton the flag is missing in the upgrade job. We should be starting the router in the upgrade job with it.

smarterclayton · 2017-04-04T17:08:36Z

The test should be skipped if that env var is unset. Can you verify that? On Apr 4, 2017, at 3:18 PM, Jakub Hadvig <[email protected]> wrote: @smarterclayton <https://github.com/smarterclayton> the flag is missing in the upgrade job. We should be starting the router in the upgrade job with it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13337 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pzl3MrGEhSBp0XAQ7td5uwrrv2UPks5rskM1gaJpZM4MZB-H> .

jhadvig · 2017-04-04T17:12:02Z

The test should be skipped if that env var is unset. Can you verify that?

No, its not skipped, it fails
https://ci.openshift.redhat.com/jenkins/view/All/job/test_branch_origin_extended_conformance_install_update/155/consoleFull#117457015656cbb9a5e4b02b88ae8c2f77

smarterclayton · 2017-04-04T17:26:16Z

Can you debug why? :) It works correctly on GCE (no new image) - my thought is that upgrade is not running the correct logic and so the guard in place is failing. On Apr 4, 2017, at 7:12 PM, Jakub Hadvig <[email protected]> wrote: The test should be skipped if that env var is unset. Can you verify that? No, its not skipped, it fails https://ci.openshift.redhat.com/jenkins/view/All/job/test_branch_origin_extended_conformance_install_update/155/consoleFull#117457015656cbb9a5e4b02b88ae8c2f77 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13337 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pyN1n2dpqaL7i3hSjlnXXoITjlBwks5rsnnlgaJpZM4MZB-H> .

smarterclayton · 2017-04-04T17:28:26Z

Actually, easy answer - not skipped. The before each on the e2e test metrics.go needs to skip if the haproxy metrics env var is not set to haproxy, and skip healthz / profiling if listen address env var is not set

stevekuznetsov · 2017-04-04T17:29:20Z

my thought is that upgrade is not running the correct logic and so the guard in place is failing.

Isn't it on the developer landing a feature into Origin to get the correct logic into the upgrade job so that their new feature goes in cleanly?

smarterclayton · 2017-04-04T17:39:47Z

I'm on PTO - this is a new test suite, so you can either fix it, disable it in your runs, or wait until I get back. Or you can fix the upgrade to set the right settings (which is the real bug). Up to you.

smarterclayton · 2017-04-04T18:05:09Z

Here you go #13625

…

On Tue, Apr 4, 2017 at 7:39 PM, Clayton Coleman ***@***.***> wrote: I'm on PTO - this is a new test suite, so you can either fix it, disable it in your runs, or wait until I get back. Or you can fix the upgrade to set the right settings (which is the real bug). Up to you.

smarterclayton · 2017-04-17T03:01:10Z

When we move to 1.6+, we can use https://github.com/haproxy/haproxy/blob/master/examples/seamless_reload.txt to save metrics prior to restart.

smarterclayton force-pushed the router_metrics branch 4 times, most recently from 6358d63 to dfb3b2c Compare March 12, 2017 05:51

smarterclayton changed the title ~~WIP - Expose prometheus metrics for the router by default~~ Instrument the HAProxy router with metrics that contain route info Mar 12, 2017

smarterclayton added the component/networking label Mar 12, 2017

smarterclayton added this to the 1.6.0 milestone Mar 12, 2017

smarterclayton force-pushed the router_metrics branch 2 times, most recently from 94b837f to 9f2fbcf Compare March 13, 2017 04:30

smarterclayton force-pushed the router_metrics branch from 9f2fbcf to f85850e Compare March 13, 2017 22:25

smarterclayton force-pushed the router_metrics branch 3 times, most recently from 4777e27 to 4e56f44 Compare March 20, 2017 02:35

smarterclayton added 8 commits April 2, 2017 01:47

Use ':' as a name separate in the router

3e5c668

It's possible a future change could result in a collision between underscores and dashes, while ':' will never be part of the name.

Track reload and config write times in the router

7a415aa

Will help us understand how restarts happen in production

UPSTREAM: 42959: Delete host exec pods faster

98fcab6

Reduces namespace cleanup times for some tests

Add a test suite that verifies the router metrics

7ee65a8

Tests the metrics endpoint, including metrics transformation, healthz, and ACL checks.

generated: completions

ff133d4

Print version of tests in extended

3facf8e

Don't eat the newline at the end of the config

1b0f7b4

smarterclayton force-pushed the router_metrics branch from 96ba73d to 1b0f7b4 Compare April 1, 2017 23:47

openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2017

openshift-bot merged commit b0710d4 into openshift:master Apr 2, 2017

sdodson mentioned this pull request Apr 4, 2017

Need to add ROUTER_METRICS_TYPE during router upgrades openshift/openshift-ansible#3845

Closed

knobunc mentioned this pull request Jul 17, 2018

Clearly Specify Router Metrics file format returned openshift/openshift-docs#10549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrument the HAProxy router with metrics that contain route info #13337

Instrument the HAProxy router with metrics that contain route info #13337

smarterclayton commented Mar 10, 2017 •

edited

Loading

smarterclayton commented Mar 12, 2017

smarterclayton commented Mar 12, 2017

smarterclayton commented Mar 13, 2017

smarterclayton commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

smarterclayton commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

smarterclayton commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

smarterclayton commented Mar 13, 2017 via email

stevekuznetsov commented Mar 14, 2017

smarterclayton commented Mar 14, 2017

smarterclayton commented Mar 16, 2017

rajatchopra commented Mar 16, 2017

smarterclayton commented Mar 16, 2017 via email

openshift-bot commented Apr 1, 2017

openshift-bot commented Apr 2, 2017

smarterclayton commented Apr 2, 2017 via email

openshift-bot commented Apr 2, 2017 •

edited

Loading

openshift-bot commented Apr 2, 2017

jhadvig commented Apr 3, 2017 •

edited

Loading

smarterclayton commented Apr 4, 2017 via email

jhadvig commented Apr 4, 2017

jhadvig commented Apr 4, 2017

smarterclayton commented Apr 4, 2017 via email

jhadvig commented Apr 4, 2017

smarterclayton commented Apr 4, 2017 via email

smarterclayton commented Apr 4, 2017 via email

stevekuznetsov commented Apr 4, 2017

smarterclayton commented Apr 4, 2017 via email

smarterclayton commented Apr 4, 2017 via email

smarterclayton commented Apr 17, 2017

Instrument the HAProxy router with metrics that contain route info #13337

Instrument the HAProxy router with metrics that contain route info #13337

Conversation

smarterclayton commented Mar 10, 2017 • edited Loading

smarterclayton commented Mar 12, 2017

smarterclayton commented Mar 12, 2017

smarterclayton commented Mar 13, 2017

smarterclayton commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

smarterclayton commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

smarterclayton commented Mar 13, 2017

stevekuznetsov commented Mar 13, 2017

smarterclayton commented Mar 13, 2017 via email

stevekuznetsov commented Mar 14, 2017

smarterclayton commented Mar 14, 2017

smarterclayton commented Mar 16, 2017

rajatchopra commented Mar 16, 2017

smarterclayton commented Mar 16, 2017 via email

openshift-bot commented Apr 1, 2017

openshift-bot commented Apr 2, 2017

smarterclayton commented Apr 2, 2017 via email

openshift-bot commented Apr 2, 2017 • edited Loading

openshift-bot commented Apr 2, 2017

jhadvig commented Apr 3, 2017 • edited Loading

smarterclayton commented Apr 4, 2017 via email

jhadvig commented Apr 4, 2017

jhadvig commented Apr 4, 2017

smarterclayton commented Apr 4, 2017 via email

jhadvig commented Apr 4, 2017

smarterclayton commented Apr 4, 2017 via email

smarterclayton commented Apr 4, 2017 via email

stevekuznetsov commented Apr 4, 2017

smarterclayton commented Apr 4, 2017 via email

smarterclayton commented Apr 4, 2017 via email

smarterclayton commented Apr 17, 2017

smarterclayton commented Mar 10, 2017 •

edited

Loading

openshift-bot commented Apr 2, 2017 •

edited

Loading

jhadvig commented Apr 3, 2017 •

edited

Loading