Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Prometheus to get metrics from the router #19318

Merged
merged 4 commits into from
May 17, 2018

Conversation

simonpasquier
Copy link
Contributor

Fix for #17685. Without this PR, the router can't validate the Prometheus token because it lacks the following permission:

- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create

It also removes the prometheus.io/... and prometheus.openshift.io/... annotations on the router's service since they are unused and create targets that Prometheus can't scrape.

I'm not clear on what happens when several routers are deployed. Most likely only the first one associated to the router service will be scraped but not the others.

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 11, 2018
@simonpasquier
Copy link
Contributor Author

@smarterclayton FYI

@smarterclayton
Copy link
Contributor

All routers should be scraped, the instance and namespace are enough to disambiguate.

We should have a role already for an endpoint that can do SAR, it should be a system role from upstream

@smarterclayton
Copy link
Contributor

Since this worked before, I wonder what changed. Was working ootb when I deployed before the rebase. @liggitt anything maybe get lost with roles?

@liggitt
Copy link
Contributor

liggitt commented Apr 12, 2018

@liggitt anything maybe get lost with roles?

Don't think so… we have unit tests that make visible any changes in bootstrap permissions, and nothing related to this showed up

@simonpasquier
Copy link
Contributor Author

Does it make sense to extend the Prometheus e2e tests to check that router's metrics are collected then?

All routers should be scraped, the instance and namespace are enough to disambiguate.

If you're talking about multiple replicas of the default router service, I agree. My concern is more about what happens when another router is deployed (eg oc adm router router-foo ...) since the current Prometheus scrape configuration looks only for endpoints associated to the service named router

kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- default
relabel_configs:
- source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: router;1936-tcp

Since I'm not familiar with routers, it might be a non-concern.

We should have a role already for an endpoint that can do SAR, it should be a system role from upstream

Do you mean system:auth-delegator?

@simonpasquier
Copy link
Contributor Author

Since this worked before, I wonder what changed. Was working ootb when I deployed before the rebase.

In my tests, I'm using oc cluster up and I see that OpenShift Ansible is adding the cluster-reader role to the default:router SA. Can it be the explanation why you didn't see any problem?

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 13, 2018
@simonpasquier
Copy link
Contributor Author

@smarterclayton @liggitt I'd appreciate your guidance regarding this PR. Is it going in the right direction? Or did I miss something?

@smarterclayton
Copy link
Contributor

Why the heck is cluster-reader being given to the router (Its probably due to an old hack @openshift/networking)?

Does it make sense to extend the Prometheus e2e tests to check that router's metrics are collected then?

We already have an e2e test that verifies the production of them but not the scrape. I think yes, but it doesn't have to block this.

system:auth-delegator

Yes.

Multiple routers

Ignore for now.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 19, 2018
@simonpasquier
Copy link
Contributor Author

@smarterclayton I've updated the PR to bind the system:auth-delegator cluster role to the default:router SA. Is it what you had in mind?

@simonpasquier
Copy link
Contributor Author

/retest

@simonpasquier
Copy link
Contributor Author

/retest

1 similar comment
@simonpasquier
Copy link
Contributor Author

/retest

Instead of adding more rules to the system:router role, this change
reuses the existing system:auth-delegator role.
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 4, 2018
@simonpasquier
Copy link
Contributor Author

IIUC the router's service account is assigned the cluster-reader role by the Ansible installer to support router shards (for reference, BZ and openshift/openshift-ansible#3650). This is mentioned in the documentation too (see here).

The cluster-reader role is a super-set of the system:auth-delegator role which explains why it works when deploying with Ansible.

Now when I assign the system:auth-delegator cluster role to the router's SA instead of the create permission on subjectaccessreviews.authorization.k8s.io (as in 5d7f483), it works on my local machine (using oc cluster up) but it fails on the extended_conformance_install tests.

May  2 14:38:55.727: INFO: Running 'oc describe --config=/tmp/cluster-admin.kubeconfig --namespace=e2e-test-router-reencrypt-2gcf9 pod/router-1-lncfm'
May  2 14:38:56.223: INFO: Error running &{/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/oc [oc describe --config=/tmp/cluster-admin.kubeconfig --namespace=e2e-test-router-reencrypt-2gcf9 pod/router-1-lncfm] []   Error from server (NotFound): pods "router-1-lncfm" not found
 Error from server (NotFound): pods "router-1-lncfm" not found
 [] <nil> 0xc421143b60 exit status 1 <nil> <nil> true [0xc420d58808 0xc420d58830 0xc420d58830] [0xc420d58808 0xc420d58830] [0xc420d58810 0xc420d58828] [0x9439b0 0x943ab0] 0xc421f41560 <nil>}:
Error from server (NotFound): pods "router-1-lncfm" not found
May  2 14:38:56.223: INFO: Error retrieving description for pod "router-1-lncfm": exit status 1

The router's logs tell that it is an authorization problem but I fail to understand how switching from a single create permission on subjectaccessreviews.authorization.k8s.io to system:auth-delegator triggers the error.

E0502 14:33:05.250935       1 status.go:158] Unable to write router status - please ensure you reconcile your system policy or grant this router access to update route status: routes.route.openshift.io "registry-console" is forbidden: User "system:serviceaccount:default:router" cannot update routes.route.openshift.io/status in the namespace "default": User "system:serviceaccount:default:router" cannot update 

@simonpasquier
Copy link
Contributor Author

/retest

flake #19058

@simonpasquier
Copy link
Contributor Author

@smarterclayton can you please take a look at this? for reference #19318 (comment)

@smarterclayton
Copy link
Contributor

/lgtm

Sorry for the delay

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 14, 2018
@simonpasquier
Copy link
Contributor Author

@smarterclayton anything else required to move the PR forward?

@smarterclayton
Copy link
Contributor

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: simonpasquier, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2018
@simonpasquier
Copy link
Contributor Author

/retest

@simonpasquier
Copy link
Contributor Author

/retest

flake #19058

@simonpasquier
Copy link
Contributor Author

/test gcp

@simonpasquier
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@simonpasquier
Copy link
Contributor Author

/retest

@Klaas-
Copy link

Klaas- commented Oct 11, 2018

I think this is incomplete. On openshift enterprise 3.10 prometheus can not reach the router metrics because it's blocked by firewalld. Routers are not opening the port 1936 on their host systems.

@simonpasquier
Copy link
Contributor Author

@Klaas- the firewall issue is tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1552235

@Klaas-
Copy link

Klaas- commented Oct 11, 2018

@simonpasquier thanks, I'll follow that bz :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants