Allow Prometheus to get metrics from the router #19318

simonpasquier · 2018-04-11T15:53:04Z

Fix for #17685. Without this PR, the router can't validate the Prometheus token because it lacks the following permission:

- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create

It also removes the prometheus.io/... and prometheus.openshift.io/... annotations on the router's service since they are unused and create targets that Prometheus can't scrape.

I'm not clear on what happens when several routers are deployed. Most likely only the first one associated to the router service will be scraped but not the others.

simonpasquier · 2018-04-11T15:53:29Z

@smarterclayton FYI

smarterclayton · 2018-04-12T03:37:30Z

All routers should be scraped, the instance and namespace are enough to disambiguate.

We should have a role already for an endpoint that can do SAR, it should be a system role from upstream

smarterclayton · 2018-04-12T03:41:06Z

Since this worked before, I wonder what changed. Was working ootb when I deployed before the rebase. @liggitt anything maybe get lost with roles?

liggitt · 2018-04-12T03:50:42Z

@liggitt anything maybe get lost with roles?

Don't think so… we have unit tests that make visible any changes in bootstrap permissions, and nothing related to this showed up

simonpasquier · 2018-04-12T09:19:30Z

Does it make sense to extend the Prometheus e2e tests to check that router's metrics are collected then?

All routers should be scraped, the instance and namespace are enough to disambiguate.

If you're talking about multiple replicas of the default router service, I agree. My concern is more about what happens when another router is deployed (eg oc adm router router-foo ...) since the current Prometheus scrape configuration looks only for endpoints associated to the service named router

origin/examples/prometheus/prometheus.yaml

Lines 717 to 725 in 5d07751

    
                   kubernetes_sd_configs: 
        
                   - role: endpoints 
        
                     namespaces: 
        
                       names: 
        
                       - default 
        
                   relabel_configs: 
        
                   - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] 
        
                     action: keep 
        
                     regex: router;1936-tcp

Since I'm not familiar with routers, it might be a non-concern.

We should have a role already for an endpoint that can do SAR, it should be a system role from upstream

Do you mean system:auth-delegator?

simonpasquier · 2018-04-12T09:40:22Z

Since this worked before, I wonder what changed. Was working ootb when I deployed before the rebase.

In my tests, I'm using oc cluster up and I see that OpenShift Ansible is adding the cluster-reader role to the default:router SA. Can it be the explanation why you didn't see any problem?

simonpasquier · 2018-04-17T15:20:43Z

@smarterclayton @liggitt I'd appreciate your guidance regarding this PR. Is it going in the right direction? Or did I miss something?

smarterclayton · 2018-04-18T14:38:08Z

Why the heck is cluster-reader being given to the router (Its probably due to an old hack @openshift/networking)?

Does it make sense to extend the Prometheus e2e tests to check that router's metrics are collected then?

We already have an e2e test that verifies the production of them but not the scrape. I think yes, but it doesn't have to block this.

system:auth-delegator

Yes.

Multiple routers

Ignore for now.

simonpasquier · 2018-04-19T09:42:18Z

@smarterclayton I've updated the PR to bind the system:auth-delegator cluster role to the default:router SA. Is it what you had in mind?

simonpasquier · 2018-04-20T15:03:21Z

/retest

simonpasquier · 2018-04-23T15:10:35Z

/retest

simonpasquier · 2018-04-26T14:40:44Z

/retest

Instead of adding more rules to the system:router role, this change reuses the existing system:auth-delegator role.

This reverts commit 5d7f483.

simonpasquier · 2018-05-04T15:12:54Z

IIUC the router's service account is assigned the cluster-reader role by the Ansible installer to support router shards (for reference, BZ and openshift/openshift-ansible#3650). This is mentioned in the documentation too (see here).

The cluster-reader role is a super-set of the system:auth-delegator role which explains why it works when deploying with Ansible.

Now when I assign the system:auth-delegator cluster role to the router's SA instead of the create permission on subjectaccessreviews.authorization.k8s.io (as in 5d7f483), it works on my local machine (using oc cluster up) but it fails on the extended_conformance_install tests.

May  2 14:38:55.727: INFO: Running 'oc describe --config=/tmp/cluster-admin.kubeconfig --namespace=e2e-test-router-reencrypt-2gcf9 pod/router-1-lncfm'
May  2 14:38:56.223: INFO: Error running &{/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/oc [oc describe --config=/tmp/cluster-admin.kubeconfig --namespace=e2e-test-router-reencrypt-2gcf9 pod/router-1-lncfm] []   Error from server (NotFound): pods "router-1-lncfm" not found
 Error from server (NotFound): pods "router-1-lncfm" not found
 [] <nil> 0xc421143b60 exit status 1 <nil> <nil> true [0xc420d58808 0xc420d58830 0xc420d58830] [0xc420d58808 0xc420d58830] [0xc420d58810 0xc420d58828] [0x9439b0 0x943ab0] 0xc421f41560 <nil>}:
Error from server (NotFound): pods "router-1-lncfm" not found
May  2 14:38:56.223: INFO: Error retrieving description for pod "router-1-lncfm": exit status 1

The router's logs tell that it is an authorization problem but I fail to understand how switching from a single create permission on subjectaccessreviews.authorization.k8s.io to system:auth-delegator triggers the error.

E0502 14:33:05.250935       1 status.go:158] Unable to write router status - please ensure you reconcile your system policy or grant this router access to update route status: routes.route.openshift.io "registry-console" is forbidden: User "system:serviceaccount:default:router" cannot update routes.route.openshift.io/status in the namespace "default": User "system:serviceaccount:default:router" cannot update

simonpasquier · 2018-05-07T15:32:08Z

/retest

flake #19058

simonpasquier · 2018-05-14T14:26:22Z

@smarterclayton can you please take a look at this? for reference #19318 (comment)

smarterclayton · 2018-05-14T14:39:18Z

/lgtm

Sorry for the delay

simonpasquier · 2018-05-15T15:34:10Z

@smarterclayton anything else required to move the PR forward?

smarterclayton · 2018-05-15T16:06:05Z

/approve

openshift-ci-robot · 2018-05-15T16:06:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: simonpasquier, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/cmd/OWNERS~~ [smarterclayton]
~~pkg/oc/admin/router/OWNERS~~ [smarterclayton]
~~test/testdata/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

simonpasquier · 2018-05-16T11:20:27Z

/retest

simonpasquier · 2018-05-16T12:47:30Z

/retest

flake #19058

simonpasquier · 2018-05-16T14:00:42Z

/test gcp

simonpasquier · 2018-05-17T08:41:07Z

/retest

openshift-bot · 2018-05-17T13:32:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

simonpasquier · 2018-05-17T15:30:50Z

/retest

Klaas- · 2018-10-11T07:34:54Z

I think this is incomplete. On openshift enterprise 3.10 prometheus can not reach the router metrics because it's blocked by firewalld. Routers are not opening the port 1936 on their host systems.

simonpasquier · 2018-10-11T07:47:49Z

@Klaas- the firewall issue is tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1552235

Klaas- · 2018-10-11T07:59:35Z

@simonpasquier thanks, I'll follow that bz :)

openshift-ci-robot requested review from knobunc and smarterclayton April 11, 2018 15:53

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 11, 2018

simonpasquier force-pushed the fix-17685 branch from 02328c9 to 5ad9ba5 Compare April 12, 2018 09:29

simonpasquier force-pushed the fix-17685 branch from 5ad9ba5 to 6a90871 Compare April 13, 2018 15:37

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 13, 2018

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 19, 2018

ptescher mentioned this pull request Apr 20, 2018

Prometheus - kubernetes-service-endpoints down openshift/openshift-ansible#7986

Closed

simonpasquier force-pushed the fix-17685 branch from 61d68da to cf7804e Compare April 27, 2018 13:44

simonpasquier added 3 commits May 2, 2018 11:33

policy: fix system:router permissions

9736fcf

router: remove useless annotations

3231b87

router: assign system:auth-delegator role

5d7f483

Instead of adding more rules to the system:router role, this change reuses the existing system:auth-delegator role.

simonpasquier force-pushed the fix-17685 branch from cf7804e to 5d7f483 Compare May 2, 2018 13:42

Revert "router: assign system:auth-delegator role"

f4e6656

This reverts commit 5d7f483.

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 4, 2018

openshift-ci-robot assigned smarterclayton May 14, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2018

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2018

openshift-merge-robot merged commit 9bf3637 into openshift:master May 17, 2018

simonpasquier mentioned this pull request May 18, 2018

Prometheus example can't get metrics from the router instance #17685

Closed

simonpasquier deleted the fix-17685 branch May 18, 2018 09:29

simonpasquier mentioned this pull request May 24, 2018

Update Prometheus to scrape the router metrics openshift/openshift-ansible#8512

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Prometheus to get metrics from the router #19318

Allow Prometheus to get metrics from the router #19318

simonpasquier commented Apr 11, 2018

simonpasquier commented Apr 11, 2018

smarterclayton commented Apr 12, 2018

smarterclayton commented Apr 12, 2018

liggitt commented Apr 12, 2018

simonpasquier commented Apr 12, 2018

simonpasquier commented Apr 12, 2018

simonpasquier commented Apr 17, 2018

smarterclayton commented Apr 18, 2018

simonpasquier commented Apr 19, 2018

simonpasquier commented Apr 20, 2018

simonpasquier commented Apr 23, 2018

simonpasquier commented Apr 26, 2018

simonpasquier commented May 4, 2018

simonpasquier commented May 7, 2018

simonpasquier commented May 14, 2018

smarterclayton commented May 14, 2018

simonpasquier commented May 15, 2018

smarterclayton commented May 15, 2018

openshift-ci-robot commented May 15, 2018

simonpasquier commented May 16, 2018

simonpasquier commented May 16, 2018

simonpasquier commented May 16, 2018

simonpasquier commented May 17, 2018

openshift-bot commented May 17, 2018

simonpasquier commented May 17, 2018

Klaas- commented Oct 11, 2018

simonpasquier commented Oct 11, 2018

Klaas- commented Oct 11, 2018

Allow Prometheus to get metrics from the router #19318

Allow Prometheus to get metrics from the router #19318

Conversation

simonpasquier commented Apr 11, 2018

simonpasquier commented Apr 11, 2018

smarterclayton commented Apr 12, 2018

smarterclayton commented Apr 12, 2018

liggitt commented Apr 12, 2018

simonpasquier commented Apr 12, 2018

simonpasquier commented Apr 12, 2018

simonpasquier commented Apr 17, 2018

smarterclayton commented Apr 18, 2018

simonpasquier commented Apr 19, 2018

simonpasquier commented Apr 20, 2018

simonpasquier commented Apr 23, 2018

simonpasquier commented Apr 26, 2018

simonpasquier commented May 4, 2018

simonpasquier commented May 7, 2018

simonpasquier commented May 14, 2018

smarterclayton commented May 14, 2018

simonpasquier commented May 15, 2018

smarterclayton commented May 15, 2018

openshift-ci-robot commented May 15, 2018

simonpasquier commented May 16, 2018

simonpasquier commented May 16, 2018

simonpasquier commented May 16, 2018

simonpasquier commented May 17, 2018

openshift-bot commented May 17, 2018

simonpasquier commented May 17, 2018

Klaas- commented Oct 11, 2018

simonpasquier commented Oct 11, 2018

Klaas- commented Oct 11, 2018