Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument the HAProxy router with metrics that contain route info #13337

Merged
merged 9 commits into from
Apr 2, 2017

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Mar 10, 2017

Instrument the template router with a new optional flag --metrics-type and --listen-addr, which follow the conventions of the other infrastructure components and register a health check (not used currently, since it doesn't check HAProxy), the pprof endpoints, and a prometheus metrics endpoint. Pprof and metrics are protected by the stats endpoint basic auth. oadm router now enables a default listen address of 0.0.0.0:1935.

On the template router, start with the prometheus haproxy_exporter to extract metrics from the running HAProxy process. Since large amounts of metrics are returned and reported, slim the list of metrics down to a smaller subset, with a focus on those that can provide meaningful input at large scales. Add a server threshold to stop reporting server metrics (and instead only backend metrics), when the number of servers in haproxy exceeds that, defaults to 1000 endpoints. Add a rate limiter on the number of times that the HAProxy stats endpoint can be called that is inversely proportional to the number total stats entries (usually dominated by backends + servers) to reduce total work - defaults to 5s for every 1000 servers or backends. Also, right before reload, capture the latest stats and report them on the next call (since we don't know how to merge metrics).

Add additional metrics on the router itself to report reload times and write config.

To properly expose metrics to the user, the following changes were made to the router template:

  1. Switch from _ as a separator in backend/server names to :, which is impossible to get from a route or ingress (previously ingress could result in _ being added, which becomes ambiguous)
  2. Switch server names from the IdHash to ID (endpoint IP + port number), which was not a required security step (cookie values are loaded separately).

With these in place, a caller can get a wide set of metrics from the router, which can be used for custom autoscaling, monitoring, and problem detecting:

(limited example, most backends removed).

# HELP haproxy_backend_connections_total Total number of connections.
# TYPE haproxy_backend_connections_total gauge
haproxy_backend_connections_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute"} 20
haproxy_backend_connections_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="zeroweightroute"} 0
# HELP haproxy_backend_current_queue Current number of queued requests not assigned to any server.
# TYPE haproxy_backend_current_queue gauge
haproxy_backend_current_queue{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute"} 0
haproxy_backend_current_queue{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="zeroweightroute"} 0
# HELP haproxy_backend_up Current health status of the backend (1 = UP, 0 = DOWN).
# TYPE haproxy_backend_up gauge
haproxy_backend_up{backend="http",namespace="default",route="backend4"} 1
haproxy_backend_up{backend="http",namespace="default",route="docker-registry"} 1
haproxy_backend_up{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="zeroweightroute"} 1
# HELP haproxy_exporter_csv_parse_failures Number of errors while parsing CSV.
# TYPE haproxy_exporter_csv_parse_failures counter
haproxy_exporter_csv_parse_failures 0
# HELP haproxy_exporter_scrape_interval The time in seconds before another scrape is allowed, proportional to size of data.
# TYPE haproxy_exporter_scrape_interval gauge
haproxy_exporter_scrape_interval 5
# HELP haproxy_exporter_server_threshold Number of servers tracked and the current threshold value.
# TYPE haproxy_exporter_server_threshold counter
haproxy_exporter_server_threshold{type="current"} 13
haproxy_exporter_server_threshold{type="limit"} 500
# HELP haproxy_exporter_total_scrapes Current total HAProxy scrapes.
# TYPE haproxy_exporter_total_scrapes counter
haproxy_exporter_total_scrapes 27
# HELP haproxy_frontend_bytes_in_total Current total of incoming bytes.
# TYPE haproxy_frontend_bytes_in_total gauge
haproxy_frontend_bytes_in_total{frontend="fe_no_sni"} 0
haproxy_frontend_bytes_in_total{frontend="fe_sni"} 0
haproxy_frontend_bytes_in_total{frontend="public"} 1680
haproxy_frontend_bytes_in_total{frontend="public_ssl"} 0
haproxy_frontend_bytes_in_total{frontend="stats"} 207
# HELP haproxy_frontend_bytes_out_total Current total of outgoing bytes.
# TYPE haproxy_frontend_bytes_out_total gauge
haproxy_frontend_bytes_out_total{frontend="fe_no_sni"} 0
haproxy_frontend_bytes_out_total{frontend="fe_sni"} 0
haproxy_frontend_bytes_out_total{frontend="public"} 2680
haproxy_frontend_bytes_out_total{frontend="public_ssl"} 0
haproxy_frontend_bytes_out_total{frontend="stats"} 479
# HELP haproxy_frontend_connections_total Total number of connections.
# TYPE haproxy_frontend_connections_total gauge
haproxy_frontend_connections_total{frontend="fe_no_sni"} 0
haproxy_frontend_connections_total{frontend="fe_sni"} 0
haproxy_frontend_connections_total{frontend="public"} 20
haproxy_frontend_connections_total{frontend="public_ssl"} 0
haproxy_frontend_connections_total{frontend="stats"} 3
# HELP haproxy_frontend_current_session_rate Current number of sessions per second over last elapsed second.
# TYPE haproxy_frontend_current_session_rate gauge
haproxy_frontend_current_session_rate{frontend="fe_no_sni"} 0
haproxy_frontend_current_session_rate{frontend="fe_sni"} 0
haproxy_frontend_current_session_rate{frontend="public"} 0
haproxy_frontend_current_session_rate{frontend="public_ssl"} 0
haproxy_frontend_current_session_rate{frontend="stats"} 0
# HELP haproxy_frontend_current_sessions Current number of active sessions.
# TYPE haproxy_frontend_current_sessions gauge
haproxy_frontend_current_sessions{frontend="fe_no_sni"} 0
haproxy_frontend_current_sessions{frontend="fe_sni"} 0
haproxy_frontend_current_sessions{frontend="public"} 0
haproxy_frontend_current_sessions{frontend="public_ssl"} 0
haproxy_frontend_current_sessions{frontend="stats"} 0
# HELP haproxy_frontend_http_responses_total Total of HTTP responses.
# TYPE haproxy_frontend_http_responses_total gauge
haproxy_frontend_http_responses_total{code="2xx",frontend="fe_no_sni"} 0
haproxy_frontend_http_responses_total{code="2xx",frontend="fe_sni"} 0
haproxy_frontend_http_responses_total{code="2xx",frontend="public"} 20
haproxy_frontend_http_responses_total{code="2xx",frontend="stats"} 2
haproxy_frontend_http_responses_total{code="5xx",frontend="fe_no_sni"} 0
haproxy_frontend_http_responses_total{code="5xx",frontend="fe_sni"} 0
haproxy_frontend_http_responses_total{code="5xx",frontend="public"} 0
haproxy_frontend_http_responses_total{code="5xx",frontend="stats"} 0
# HELP haproxy_frontend_max_session_rate Maximum observed number of sessions per second.
# TYPE haproxy_frontend_max_session_rate gauge
haproxy_frontend_max_session_rate{frontend="fe_no_sni"} 0
haproxy_frontend_max_session_rate{frontend="fe_sni"} 0
haproxy_frontend_max_session_rate{frontend="public"} 10
haproxy_frontend_max_session_rate{frontend="public_ssl"} 0
haproxy_frontend_max_session_rate{frontend="stats"} 2
# HELP haproxy_frontend_max_sessions Maximum observed number of active sessions.
# TYPE haproxy_frontend_max_sessions gauge
haproxy_frontend_max_sessions{frontend="fe_no_sni"} 0
haproxy_frontend_max_sessions{frontend="fe_sni"} 0
haproxy_frontend_max_sessions{frontend="public"} 1
haproxy_frontend_max_sessions{frontend="public_ssl"} 0
haproxy_frontend_max_sessions{frontend="stats"} 1
# HELP haproxy_process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE haproxy_process_cpu_seconds_total counter
haproxy_process_cpu_seconds_total 0.01
# HELP haproxy_process_max_fds Maximum number of open file descriptors.
# TYPE haproxy_process_max_fds gauge
haproxy_process_max_fds 40046
# HELP haproxy_process_resident_memory_bytes Resident memory size in bytes.
# TYPE haproxy_process_resident_memory_bytes gauge
haproxy_process_resident_memory_bytes 6.836224e+06
# HELP haproxy_process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE haproxy_process_start_time_seconds gauge
haproxy_process_start_time_seconds 1.48929703048e+09
# HELP haproxy_process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE haproxy_process_virtual_memory_bytes gauge
haproxy_process_virtual_memory_bytes 5.4120448e+07
# HELP haproxy_server_bytes_in_total Current total of incoming bytes.
# TYPE haproxy_server_bytes_in_total gauge
haproxy_server_bytes_in_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 840
# HELP haproxy_server_bytes_out_total Current total of outgoing bytes.
# TYPE haproxy_server_bytes_out_total gauge
haproxy_server_bytes_out_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 1340
# HELP haproxy_server_check_failures_total Total number of failed health checks.
# TYPE haproxy_server_check_failures_total gauge
haproxy_server_check_failures_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_connection_errors_total Total of connection errors.
# TYPE haproxy_server_connection_errors_total gauge
haproxy_server_connection_errors_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_connections_total Total number of connections.
# TYPE haproxy_server_connections_total gauge
haproxy_server_connections_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 10
# HELP haproxy_server_current_queue Current number of queued requests assigned to this server.
# TYPE haproxy_server_current_queue gauge
haproxy_server_current_queue{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_current_session_rate Current number of sessions per second over last elapsed second.
# TYPE haproxy_server_current_session_rate gauge
haproxy_server_current_session_rate{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_current_sessions Current number of active sessions.
# TYPE haproxy_server_current_sessions gauge
haproxy_server_current_sessions{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_downtime_seconds_total Total downtime in seconds.
# TYPE haproxy_server_downtime_seconds_total gauge
haproxy_server_downtime_seconds_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_http_responses_total Total of HTTP responses.
# TYPE haproxy_server_http_responses_total gauge
haproxy_server_http_responses_total{backend="http",code="2xx",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 10
haproxy_server_http_responses_total{backend="http",code="5xx",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_max_session_rate Maximum observed number of sessions per second.
# TYPE haproxy_server_max_session_rate gauge
haproxy_server_max_session_rate{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 5
# HELP haproxy_server_max_sessions Maximum observed number of active sessions.
# TYPE haproxy_server_max_sessions gauge
haproxy_server_max_sessions{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 1
# HELP haproxy_server_response_errors_total Total of response errors.
# TYPE haproxy_server_response_errors_total gauge
haproxy_server_response_errors_total{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 0
# HELP haproxy_server_up Current health status of the server (1 = UP, 0 = DOWN).
# TYPE haproxy_server_up gauge
haproxy_server_up{backend="http",namespace="default",route="backend1",server="172.17.0.3:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend1-1",server="172.17.0.3:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend1-2",server="172.17.0.3:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend1-2",server="172.17.0.4:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend1-3",server="172.17.0.3:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend2",server="172.17.0.4:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend3",server="172.17.0.4:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="backend4",server="172.17.0.4:8080"} 1
haproxy_server_up{backend="http",namespace="default",route="docker-registry",server="172.17.0.5:5000"} 1
haproxy_server_up{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.10:8080"} 1
haproxy_server_up{backend="http",namespace="extended-test-router-metrics-2l1f7-jk7c0",route="weightedroute",server="172.17.0.11:8080"} 1
haproxy_server_up{backend="other/be_no_sni",namespace="",route="",server="fe_no_sni"} 1
haproxy_server_up{backend="other/be_sni",namespace="",route="",server="fe_sni"} 1
# HELP haproxy_up Was the last scrape of haproxy successful.
# TYPE haproxy_up gauge
haproxy_up 1
# HELP openshift_build_info A metric with a constant '1' value labeled by major, minor, git commit & git version from which OpenShift was built.
# TYPE openshift_build_info gauge
openshift_build_info{gitCommit="39ecc8d",gitVersion="v1.5.0-alpha.3+39ecc8d-355-dirty",major="1",minor="5+"} 1
# HELP template_router_reload_seconds Measures the time spent reloading the router in seconds.
# TYPE template_router_reload_seconds summary
template_router_reload_seconds{quantile="0.5"} 0.067299316
template_router_reload_seconds{quantile="0.9"} 0.118891744
template_router_reload_seconds{quantile="0.99"} 0.128374662
template_router_reload_seconds_sum 1.69002939
template_router_reload_seconds_count 21
# HELP template_router_write_config_seconds Measures the time spent writing out the router configuration to disk in seconds.
# TYPE template_router_write_config_seconds summary
template_router_write_config_seconds{quantile="0.5"} 0.004694066
template_router_write_config_seconds{quantile="0.9"} 0.012401996
template_router_write_config_seconds{quantile="0.99"} 0.017210267
template_router_write_config_seconds_sum 0.15450527399999994
template_router_write_config_seconds_count 21

@smarterclayton smarterclayton force-pushed the router_metrics branch 4 times, most recently from 6358d63 to dfb3b2c Compare March 12, 2017 05:51
@smarterclayton smarterclayton changed the title WIP - Expose prometheus metrics for the router by default Instrument the HAProxy router with metrics that contain route info Mar 12, 2017
@smarterclayton
Copy link
Contributor Author

Performance testing on this:

@ 1000 routes / 1000 endpoints (below serverThreshold) - prometheus returns 2M of data, takes about 1-2% of CPU sustained to keep handling metrics every 2s (which is far more rapid than normal use). Memory usage is around 90M, no spiking.

@ 1001 routes / 1001 endpoints (above serverThreshold) - prometheus drops down to 500k of data, takes less CPU.

I expect this can handle up to 10k routes without tuning being needed, beyond that we may want to reduce the check interval or special case routes with only one endpoint.

@smarterclayton
Copy link
Contributor Author

@knobunc this gets the router up to roughly the bar the controllers and api server are at. Feedback on parameters, behavior, etc appreciated. When you have someone to do review this should be mostly functional. With this in place we could consider grabbing router metrics on a per namespace basis in a 3.7 timeframe and using custom autoscaling for edge traffic. My primary concern is observability and getting the router to be something you can look at from a central monitoring platform.

@smarterclayton smarterclayton added this to the 1.6.0 milestone Mar 12, 2017
@smarterclayton smarterclayton force-pushed the router_metrics branch 2 times, most recently from 94b837f to 9f2fbcf Compare March 13, 2017 04:30
@smarterclayton
Copy link
Contributor Author

[test]

@smarterclayton
Copy link
Contributor Author

@stevekuznetsov https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/150/ should have failed GCE (images not built) but it didn't. It looks like it might because extended.test is not being built

@stevekuznetsov
Copy link
Contributor

@smarterclayton from that log...

00:15:58.268 Successfully built: ... /openshifttmp/openshift/tito/x86_64/origin-tests-1.5.0-999.alpha.3+9f2fbcf.361.x86_64.rpm ...

@stevekuznetsov
Copy link
Contributor

oct correctly merged your commit 9f2fbcf into master...

00:03:46.832 TASK [remote-sync : synchronize the repository with the remote server] *********
00:03:46.832 task path: /var/lib/jenkins/origin-ci-tool/7a8f8a31a03ddd44278142ff40021abeb31efe7b/lib/python2.7/site-packages/oct/ansible/oct/playbooks/sync/roles/remote-sync/tasks/main.yml:36
00:03:48.521 changed: [172.18.5.223] => {
00:03:48.521     "after": "9f2fbcfe425a14843dcfa5252f707596e0f23a56", 
00:03:48.521     "before": "18b1328a85d32551f434f05676436dbc64c4dc87", 
00:03:48.521     "changed": true, 
00:03:48.521     "remote_url_changed": false, 
00:03:48.521     "warnings": []
00:03:48.521 }
00:03:48.529 
00:03:48.529 TASK [remote-sync : check out the desired post-merge state, if requested] ******
00:03:48.529 task path: /var/lib/jenkins/origin-ci-tool/7a8f8a31a03ddd44278142ff40021abeb31efe7b/lib/python2.7/site-packages/oct/ansible/oct/playbooks/sync/roles/remote-sync/tasks/main.yml:45
00:03:48.991 changed: [172.18.5.223] => {
00:03:48.991     "changed": true, 
00:03:48.991     "cmd": "/usr/bin/git checkout master", 
00:03:48.991     "delta": "0:00:00.116039", 
00:03:48.991     "end": "2017-03-13 01:17:24.180675", 
00:03:48.991     "rc": 0, 
00:03:48.991     "start": "2017-03-13 01:17:24.064636", 
00:03:48.991     "stderr": [
00:03:48.991         "Switched to branch 'master'"
00:03:48.991     ], 
00:03:48.991     "stdout": [], 
00:03:48.991     "warnings": []
00:03:48.991 }
00:03:48.997 
00:03:48.997 TASK [remote-sync : merge the resulting state into another branch, if requested] ***
00:03:48.997 task path: /var/lib/jenkins/origin-ci-tool/7a8f8a31a03ddd44278142ff40021abeb31efe7b/lib/python2.7/site-packages/oct/ansible/oct/playbooks/sync/roles/remote-sync/tasks/main.yml:51
00:03:49.450 changed: [172.18.5.223] => {
00:03:49.450     "changed": true, 
00:03:49.450     "cmd": "/usr/bin/git merge pull-13337", 
00:03:49.451     "delta": "0:00:00.119124", 
00:03:49.451     "end": "2017-03-13 01:17:24.640143", 
00:03:49.451     "rc": 0, 
00:03:49.451     "start": "2017-03-13 01:17:24.521019", 
00:03:49.451     "stderr": [], 
00:03:49.451     "stdout": [
00:03:49.451         "Updating 18b1328..9f2fbcf", 
00:03:49.451         "Fast-forward", 
00:03:49.451         " contrib/completions/bash/openshift                 |   6 +", 
00:03:49.451         " contrib/completions/zsh/openshift                  |   6 +", 
00:03:49.451         " images/router/haproxy/conf/haproxy-config.template |  66 +--", 
00:03:49.451         " pkg/cmd/admin/router/router.go                     |  10 +", 
00:03:49.451         " pkg/cmd/infra/router/router.go                     |   3 +", 
00:03:49.451         " pkg/cmd/infra/router/template.go                   |  63 +++", 
00:03:49.451         " pkg/router/controller/ingress.go                   |   2 +-", 
00:03:49.451         " pkg/router/metrics/haproxy/haproxy.go              | 625 +++++++++++++++++++++", 
00:03:49.451         " pkg/router/metrics/metrics.go                      |  46 ++", 
00:03:49.451         " pkg/router/template/plugin.go                      |   2 +", 
00:03:49.451         " pkg/router/template/router.go                      |  51 +-", 
00:03:49.451         " test/extended/router/metrics.go                    | 245 ++++++++", 
00:03:49.452         " test/extended/testdata/router-metrics.yaml         | 105 ++++", 
00:03:49.452         " test/extended/testdata/scoped-router.yaml          |   3 +", 
00:03:49.452         " test/extended/testdata/weighted-router.yaml        |   3 +", 
00:03:49.452         " .../k8s.io/kubernetes/test/e2e/framework/util.go   |   2 +", 
00:03:49.452         " 16 files changed, 1195 insertions(+), 43 deletions(-)", 
00:03:49.452         " create mode 100644 pkg/router/metrics/haproxy/haproxy.go", 
00:03:49.452         " create mode 100644 pkg/router/metrics/metrics.go", 
00:03:49.452         " create mode 100644 test/extended/router/metrics.go", 
00:03:49.452         " create mode 100644 test/extended/testdata/router-metrics.yaml"
00:03:49.452     ], 
00:03:49.452     "warnings": []
00:03:49.452 }

@smarterclayton
Copy link
Contributor Author

I think this is something in between RPMs being created and RPMs being installed.

@stevekuznetsov
Copy link
Contributor

Maybe the wrong RPM location is being chosen?

@stevekuznetsov
Copy link
Contributor

Publish step:

00:16:46.992 + gsutil -m cp -r artifacts/rpms gs://origin-ci-test/pr-logs/13337/test_pull_request_origin_extended_conformance_gce/150/artifacts/rpms
00:16:47.776 Copying file://artifacts/rpms/origin-tests-1.5.0-999.alpha.3+9f2fbcf.361.x86_64.rpm [Content-Type=application/x-rpm]...

Install step uses the same repo:

00:17:09.732 + ../../bin/local.sh ansible-playbook -e provision_gce_docker_storage_driver=devicemapper -e openshift_test_repo=https://storage.googleapis.com/origin-ci-test/pr-logs/13337/test_pull_request_origin_extended_conformance_gce/150/artifacts/rpms playbooks/launch.yaml

Running the tests doesn't re-build anything, must be using the extended.test from the RPM build...

00:33:54.873 + make test-extended SUITE=conformance
00:33:55.093 test/extended/conformance.sh 
00:34:29.311 [INFO] Running parallel tests N=25

@smarterclayton
Copy link
Contributor Author

@stevekuznetsov
Copy link
Contributor

Looks kosher?

00:34:05.549 + find _output/local/bin -ls
00:34:05.550 59187400    0 drwxr-sr-x   3 origin   origin-git       18 Mar 13 16:48 _output/local/bin
00:34:05.550 67921122    0 drwxr-sr-x   3 origin   origin-git       18 Mar 13 16:48 _output/local/bin/linux
00:34:05.550 76366864    4 drwxr-sr-x   2 origin   origin-git     4096 Mar 13 16:56 _output/local/bin/linux/amd64
00:34:05.550 76366865 63968 -rwxr-xr-x   1 origin   origin-git 65502210 Mar 13 16:49 _output/local/bin/linux/amd64/dockerregistry
00:34:05.550 76366866 117516 -rwxr-xr-x   1 origin   origin-git 120332984 Mar 13 16:54 _output/local/bin/linux/amd64/extended.test
00:34:05.550 76366867 86688 -rwxr-xr-x   1 origin   origin-git 88767680 Mar 13 16:56 _output/local/bin/linux/amd64/gendocs
00:34:05.550 76366868 258992 -rwxr-xr-x   1 origin   origin-git 265206432 Mar 13 16:56 _output/local/bin/linux/amd64/genman
00:34:05.550 76366869 53184 -rwxr-xr-x   1 origin   origin-git 54459696 Mar 13 16:49 _output/local/bin/linux/amd64/gitserver
00:34:05.551 76366870 5504 -rwxr-xr-x   1 origin   origin-git  5635062 Mar 13 16:49 _output/local/bin/linux/amd64/hello-openshift
00:34:05.551 76366871 2904 -rwxr-xr-x   1 origin   origin-git  2969783 Mar 13 16:49 _output/local/bin/linux/amd64/host-local
00:34:05.551 76366872    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kube-apiserver -> openshift
00:34:05.551 76366873    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kube-controller-manager -> openshift
00:34:05.551 76366874    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kube-proxy -> openshift
00:34:05.551 76366875    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kube-scheduler -> openshift
00:34:05.551 76366876    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kubectl -> openshift
00:34:05.551 76366877    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kubelet -> openshift
00:34:05.551 76366878    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/kubernetes -> openshift
00:34:05.552 76366879 2796 -rwxr-xr-x   1 origin   origin-git  2859526 Mar 13 16:49 _output/local/bin/linux/amd64/loopback
00:34:05.552 76426528    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/oadm -> openshift
00:34:05.552 76426529 90608 -rwxr-xr-x   1 origin   origin-git 92780520 Mar 13 16:52 _output/local/bin/linux/amd64/oc
00:34:05.552 76426530 260644 -rwxr-xr-x   1 origin   origin-git 266898016 Mar 13 16:53 _output/local/bin/linux/amd64/openshift
00:34:05.552 76426531    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-deploy -> openshift
00:34:05.552 76426532    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-docker-build -> openshift
00:34:05.552 76426533    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-recycle -> openshift
00:34:05.552 76426534    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-router -> openshift
00:34:05.552 76426535    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/openshift-sti-build -> openshift
00:34:05.553 76426536    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/origin -> openshift
00:34:05.553 76426537    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/osadm -> openshift
00:34:05.553 76426538    0 lrwxrwxrwx   1 origin   origin-git        9 Mar 13 16:56 _output/local/bin/linux/amd64/osc -> openshift
00:34:05.553 76426539 1116 -rwxr-xr-x   1 origin   origin-git  1138998 Mar 13 16:48 _output/local/bin/linux/amd64/pod
00:34:05.553 76426540 5600 -rwxr-xr-x   1 origin   origin-git  5732301 Mar 13 16:49 _output/local/bin/linux/amd64/sdn-cni-plugin
00:34:05.553 + git status
00:34:05.620 # On branch master
00:34:05.620 # Your branch is ahead of 'origin/master' by 10 commits.
00:34:05.620 #   (use "git push" to publish your local commits)
00:34:05.620 #
00:34:05.620 nothing to commit, working directory clean

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Mar 13, 2017 via email

@stevekuznetsov
Copy link
Contributor

Images are fixed, re[test]

@smarterclayton
Copy link
Contributor Author

@ramr @rajatchopra

@smarterclayton
Copy link
Contributor Author

I'm going to alter the haproxy server name one more time to put the service name in there, so we can filter on it.

@rajatchopra
Copy link
Contributor

What about when there are multiple services involved? AB testing case.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Mar 16, 2017 via email

@smarterclayton smarterclayton force-pushed the router_metrics branch 3 times, most recently from 4777e27 to 4e56f44 Compare March 20, 2017 02:35
It's possible a future change could result in a collision between
underscores and dashes, while ':' will never be part of the name.
Turn backend names into structured data:

* be_http:NAMESPACE:NAME -> {backend="http",namespace="NAMESPACE",route="NAME"}
* be_secure:NAMESPACE:NAME -> {backend="https",namespace="NAMESPACE",route="NAME"}
* be_edge_http:NAMESPACE:NAME -> {backend="https-edge",namespace="NAMESPACE",route="NAME"}
* be_tcp:NAMESPACE:NAME -> {backend="tcp",namespace="NAMESPACE",route="NAME"}
* `*` -> {backend="other/*"}

Allows per route / namespace aggregation of metrics.
Include service information in endpoints and parse it out
Services without target refs continue as they are.
Will help us understand how restarts happen in production
Reduces namespace cleanup times for some tests
Tests the metrics endpoint, including metrics transformation, healthz,
and ACL checks.
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 1b0f7b4

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/526/) (Base Commit: 50bbe53)

@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2017
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 2, 2017 via email

@openshift-bot
Copy link
Contributor

openshift-bot commented Apr 2, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/526/) (Base Commit: 50bbe53) (Image: devenv-rhel7_6112)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 1b0f7b4

@openshift-bot openshift-bot merged commit b0710d4 into openshift:master Apr 2, 2017
@jhadvig
Copy link
Member

jhadvig commented Apr 3, 2017

After this PR got merged we started to see following errors in the extended tests that we are running at the end of cluster install_upgrade job
https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_conformance_install_update/150/testReport/
@smarterclayton any ideas why this is happening ?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 4, 2017 via email

@jhadvig
Copy link
Member

jhadvig commented Apr 4, 2017

Check that the router has been created with metrics type haproxy (env var)

The env var is missing in the running router container

and that it has the right image (newest)

and yeas we are using the newest image
Here is the inspection of the router container http://pastebin.test.redhat.com/471601
The "ROUTER_METRICS_TYPE=haproxy" is missing

@jhadvig
Copy link
Member

jhadvig commented Apr 4, 2017

@smarterclayton the flag is missing in the upgrade job. We should be starting the router in the upgrade job with it.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 4, 2017 via email

@jhadvig
Copy link
Member

jhadvig commented Apr 4, 2017

The test should be skipped if that env var is unset. Can you verify that?

No, its not skipped, it fails
https://ci.openshift.redhat.com/jenkins/view/All/job/test_branch_origin_extended_conformance_install_update/155/consoleFull#117457015656cbb9a5e4b02b88ae8c2f77

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 4, 2017 via email

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 4, 2017 via email

@stevekuznetsov
Copy link
Contributor

my thought is that upgrade is not running the correct logic and so the guard in place is failing.

Isn't it on the developer landing a feature into Origin to get the correct logic into the upgrade job so that their new feature goes in cleanly?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 4, 2017 via email

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 4, 2017 via email

@smarterclayton
Copy link
Contributor Author

When we move to 1.6+, we can use https://github.com/haproxy/haproxy/blob/master/examples/seamless_reload.txt to save metrics prior to restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants