Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean Prometheus example #17992

Merged

Conversation

mjudeikis
Copy link
Contributor

Main changes:

  1. Unified secrets, config, data volumes names to have postfix so it would be easier to read and track who is who.
S: prometheus-tls		-> prometheus-tls-secret
...
C: prometheus                   -> prometheus-config
C: prometheus-alertmanage       -> prometheus-alertmanager-config
D: emptyDir		        -> prometheus-data
...
  1. Add clear SMTP example with multiple receivers for buffer and smtp
  2. Change order of elements so it would be easier to read.
  3. Add proxy for alertmanager as a 6th container.

If/When this gets merged will update ansible deployment too to represent this change.

Did this for client deployment for their monitoring solution evaluation.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 4, 2018
@mjudeikis
Copy link
Contributor Author

Tested this in my lab and I can see alerts being generated in alertmanager and pushed to both receivers.

@mjudeikis
Copy link
Contributor Author

@zgalor @smarterclayton Can you please review this. Its just cosmetic name changes and adding a proxy for the alert manager so it could be used in "Enterprise" clients now as it is.

Copy link
Contributor

@zgalor zgalor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mjudeikis
Copy link
Contributor Author

/retest

namespace: "${NAMESPACE}"
data:
alertmanager.yml: |
global:

#smtp mail configuration for mail alerts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, so please don't put it in the example. The alert manager should work out of the box, even if it does nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw we need to set this up in api.ci

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis point me where, and I can do this.

receivers:
- name: alert-buffer-wh
webhook_configs:
- url: http://localhost:9099/topics/alerts
- name: mail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, this won't work out of the box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am. It did when I tested. I can retest and confirm. Might did it wrong somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ou. Got you, sr. Will remov


restartPolicy: Always
volumes:
#prometheus mounts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these, use newlines to separate the sections if necessary.

secretName: alerts-proxy
- name: alerts-tls
secretName: prometheus-alertmanager-tls
- name: prometheus-alertmanager-proxy-secret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the prometheus- prefix on these - prometheus-alertmanager-proxy is redundant, alertmanager-proxy is sufficient

@smarterclayton
Copy link
Contributor

You need to run hack/update-generated-bindata.sh

@mjudeikis mjudeikis force-pushed the clean-prometheus-example branch from 796272a to 83b0ea9 Compare January 8, 2018 21:30
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 8, 2018
@mjudeikis mjudeikis force-pushed the clean-prometheus-example branch from 83b0ea9 to c1563d1 Compare January 8, 2018 21:33
Add proxy for alertmanager
@mjudeikis mjudeikis force-pushed the clean-prometheus-example branch from c1563d1 to 45eb81a Compare January 8, 2018 21:34
@mjudeikis
Copy link
Contributor Author

mjudeikis commented Jan 8, 2018

@smarterclayton done.

oc new-project myproject
oc create -f /data/go/src/github.com/openshift/origin/examples/prometheus/prometheus.yaml 
oc new-app prometheus -p NAMESPACE=myproject
oc get pods
NAME           READY     STATUS    RESTARTS   AGE
prometheus-0   6/6       Running   0          50s

Now only Prometheus has prefix Prometheus. alerts and alertmanager is as it is.

Should we start looking into splitting this to use alertmanager mesh too?

@smarterclayton
Copy link
Contributor

This looks pretty good, moderate concern about how this would break when someone tried to upgrade. Agree it's a cleaner setup

@mjudeikis
Copy link
Contributor Author

@smarterclayton as much I know we don't have upgrade pattern yet? + as technology preview we are not deploying this actively to the customers so I dont think that were is a big need to seamless upgrade path (considering we still missing HA)
This was the reasons I wanted to get this in.

I think next pending changes should be (in priority order):

  1. Add alertmanager mesh configuration and split Prometheus+AlertManager to 2 StatfulSet pods so we would have HA there. (Don't know if anybody working on this but I would like to take trello card if one exist)
  2. Update ansible playbook to use new Template (if this gets merged I can pick this too)
  3. Leave Alerts buffer for CM-Ops team to sort out (?)
  4. Alert Reloader (mailing list flying around from @zgalor )

So changes will need to be done either way, and I think faster we can do them, less impact we will have on future deployments.

@smarterclayton
Copy link
Contributor

split Prometheus+AlertManager to 2 StatfulSet pods so we would have HA there.

There is an HA design doc Simon is proposing, we are debating whether we actually want to split right now. I think I'm ok with updating this, it'll just be more work in the short term to fix the credentials in online.

@smarterclayton
Copy link
Contributor

@openshift/sig-networking testing panic

[Area:Networking] multicast when using a plugin that does not isolate namespaces by default 
  should block multicast traffic [Suite:openshift/conformance/parallel]
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/networking/multicast.go:25

[BeforeEach] [Top Level]
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:53
[BeforeEach] when using a plugin that does not isolate namespaces by default
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/networking/util.go:369
[BeforeEach] when using a plugin that does not isolate namespaces by default
  /tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:134
STEP: Creating a kubernetes client
Jan  8 22:44:29.407: INFO: >>> kubeConfig: /tmp/cluster-admin.kubeconfig
STEP: Building a namespace api object
Jan  8 22:44:29.763: INFO: configPath is now "/tmp/extended-test-multicast-4tz9l-dt9jn-user.kubeconfig"
Jan  8 22:44:29.763: INFO: The user is now "extended-test-multicast-4tz9l-dt9jn-user"
Jan  8 22:44:29.763: INFO: Creating project "extended-test-multicast-4tz9l-dt9jn"
Jan  8 22:44:29.861: INFO: Waiting on permissions in project "extended-test-multicast-4tz9l-dt9jn" ...
STEP: Waiting for a default service account to be provisioned in namespace
[It] should block multicast traffic [Suite:openshift/conformance/parallel]
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/networking/multicast.go:25
[AfterEach] when using a plugin that does not isolate namespaces by default
  /tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:135
STEP: Collecting events from namespace "extended-test-multicast-4tz9l-dt9jn".
STEP: Found 0 events.
Jan  8 22:44:30.012: INFO: POD                                                                 NODE  PHASE    GRACE  CONDITIONS
Jan  8 22:44:30.012: INFO: docker-registry-1-deploy                                                  Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:15:55 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: registry-console-1-deploy                                                 Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:15:59 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: router-1-deploy                                                           Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:15:31 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: pod-configmaps-d9f8b744-f4c4-11e7-9c40-0e51c2933872                       Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:39:56 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: pod-configmaps-d412345a-f4c4-11e7-894f-0e51c2933872                       Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:39:46 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: pod-configmaps-e0f40cec-f4c4-11e7-be26-0e51c2933872                       Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:40:08 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: client-containers-de4ea5f9-f4c4-11e7-94f4-0e51c2933872                    Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:40:03 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: metadata-volume-cd970e6e-f4c4-11e7-97af-0e51c2933872                      Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:39:35 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: privileged-pod                                                            Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:39:40 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: pod-e3300e5b-f4c4-11e7-bb19-0e51c2933872                                  Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:40:12 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: pod-cc041286-f4c4-11e7-9433-0e51c2933872                                  Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:39:33 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: sysctl-01a26392-f4c5-11e7-96b1-0e51c2933872                               Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:41:03 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: optimized-build                                                           Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:18:57 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: test-docker-1-build                                                       Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:18:47 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: test-sti-1-build                                                          Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:23:56 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: myphp-1-build                                                             Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:34:13 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: myphp-1-build                                                             Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:18:53 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: sample-build-1-build                                                      Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:34:41 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: minreadytest-1-deploy                                                     Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:39:50 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: deployment-simple-1-deploy                                                Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:24:30 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: docker-build-1-build                                                      Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:40:46 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: ruby-sample-build-ts-1-build                                              Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:24:00 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: ruby-sample-build-td-1-build                                              Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:30:16 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: a234567890123456789012345678901234567890123456789012345678-1-build        Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:24:11 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: execpodnbk4m                                                              Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:39:38 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: s2i-build-quota-1-build                                                   Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:40:10 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: cakephp-mysql-example-1-build                                             Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:39:50 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: mysql-1-deploy                                                            Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:39:50 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: prometheus-0                                                              Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:29:14 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: bootstrap-autoapprover-0                                                  Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2018-01-08 22:17:20 +0000 UTC Unschedulable no nodes available to schedule pods}]
Jan  8 22:44:30.012: INFO: 
STEP: Dumping a list of prepulled images on each node...
Jan  8 22:44:30.047: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
STEP: Destroying namespace "extended-test-multicast-4tz9l-dt9jn" for this suite.
Jan  8 22:44:36.115: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Jan  8 22:44:37.584: INFO: namespace: extended-test-multicast-4tz9l-dt9jn, resource: bindings, ignored listing per whitelist
Jan  8 22:44:37.584: INFO: namespace extended-test-multicast-4tz9l-dt9jn deletion completed in 7.521496845s


•! Panic [8.177 seconds]
[Area:Networking] multicast
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/networking/multicast.go:20
  when using a plugin that does not isolate namespaces by default
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/networking/util.go:368
    should block multicast traffic [Suite:openshift/conformance/parallel] [It]
    /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/networking/multicast.go:25

    Test Panicked
    runtime error: index out of range
    /usr/local/go/src/runtime/panic.go:491

    Full Stack Trace
    	/usr/local/go/src/runtime/panic.go:491 +0x283
    github.com/openshift/origin/test/extended/networking.testMulticast(0xc421667560, 0xc420ea7960, 0x0, 0x0)
    	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/networking/multicast.go:97 +0x8d9
    github.com/openshift/origin/test/extended/networking.glob..func2.1.1()
    	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/networking/multicast.go:26 +0x37
    github.com/openshift/origin/vendor/github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0xc4218bdbc0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:160 +0x40c
    github.com/openshift/origin/test/extended.TestExtended(0xc42076c870)
    	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/extended_test.go:34 +0x40
    testing.tRunner(0xc42076c870, 0x4221fe8)

@danwinship
Copy link
Contributor

danwinship commented Jan 11, 2018

@smarterclayton it looks like there were 0 schedulable nodes at that point, which also caused a bunch of other failures, but that particular test only checks for

if len(nodes.Items) == 1 {
	e2e.Skipf("Only one node is available in this environment")
}

and assumes that if there isn't 1 node, there must be more than 1. We can fix that, but anyway the real problem here is that there were no functioning nodes at that point for some reason.

@smarterclayton
Copy link
Contributor

I don't think you have to check for that, that is indeed a totally broken cluster. Hrm. Bootstrapping could have regressed, or have a flake.

@smarterclayton
Copy link
Contributor

@aweiteka can you look over it

@mjudeikis
Copy link
Contributor Author

Anything I can help with this? Got in touch with Simon, so will start looking to his HA design and what we need to change. Whats the plan for Prometheus GA? 3.9?

@smarterclayton
Copy link
Contributor

/retest

@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 14, 2018
@mjudeikis
Copy link
Contributor Author

Thanks @smarterclayton . Will change ansible now too.

@aweiteka
Copy link
Contributor

/lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aweiteka, mjudeikis, smarterclayton

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@mjudeikis
Copy link
Contributor Author

/retest

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 17992, 18091, 18118).

@openshift-merge-robot openshift-merge-robot merged commit ad95320 into openshift:master Jan 16, 2018
@openshift-ci-robot
Copy link

@mjudeikis: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/extended_conformance_install 45eb81a link /test extended_conformance_install

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants