apps: add prometheus metrics for rollout #14796

mfojtik · 2017-06-21T11:28:45Z

Add basic prometheus metrics for deployment config rollouts. The implementation should be consistent with build metrics: #15440

mfojtik · 2017-06-21T11:29:03Z

@Kargakis @tnozicka @smarterclayton open for suggestions.

mfojtik · 2017-06-21T12:15:52Z

@smarterclayton the fatal error count on deployer pods indicates deployer pods were gone missing or something terrible happened to them.

0xmichalis · 2017-06-21T12:23:53Z

pkg/deploy/controller/deployer/metrics.go

+		prometheus.CounterOpts{
+			Subsystem: DeployerControllerSubsystem,
+			Name:      "failure_count",
+			Help:      "Counter that counts total number of deployer pod errors per error type",


Does it make more sense to have a ratio as opposed to an absolute number? Or keep the number of errors that occured in the last hour for example?

@Kargakis i don't think prometheus has that, you can query that via prometheus console

Generally you would count errors separate from total.

0xmichalis · 2017-06-21T12:27:45Z

pkg/deploy/controller/deploymentconfig/metrics.go

+		prometheus.GaugeOpts{
+			Subsystem: DeploymentConfigControllerSubsystem,
+			Name:      "set_condition",
+			Help:      "Gauge measuring changes to deployment conditions per condition, reason and the status",


This metric is pretty vague. I would be interested in the ratio a deployment is flipping between True/False for Available, how many deployments are unavailable, how many deployments have failed over a period of time.

@Kargakis you can querty unavailable deployment via prometheus console (just query the condition counter and specify time).

Yeah, these metrics should be as raw as possible.

mfojtik · 2017-06-21T14:29:09Z

btw. I think we should always set reason for the condition, the empty reasons looks bad in metrics.

mfojtik · 2017-06-21T14:30:00Z

pkg/deploy/controller/deploymentconfig/metrics.go

+const DeploymentConfigControllerSubsystem = "deploymentconfig_controller"
+
+var (
+	deploymentConditionCounter = prometheus.NewGaugeVec(


@smarterclayton not sure if this should be gauge or counter (i decrease the number when we remove the condition)

Gauge. Counters must be monotonic increasing.

smarterclayton · 2017-06-21T16:02:24Z

pkg/deploy/controller/deploymentconfig/metrics.go

+	deploymentConditionCounter = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Subsystem: DeploymentConfigControllerSubsystem,
+			Name:      "set_condition",


Is this consistent with other naming of prometheus metrics?

@smarterclayton it looks ok with the subsystem set properly, probably should be plural (set_conditions)

(upstream use plurals for counters)

smarterclayton · 2017-06-21T16:03:04Z

pkg/deploy/util/util.go

 	status.Conditions = filterOutCondition(status.Conditions, condType)
+	// Decrease counters for conditions we are going to remove


fiddling with counters like this is dangerous - it's too easy to get them wrong. I would recommend instead simply counting the number of conditions.

Or doing this at a higher level. Since this is observational from the cache, you might just want to add a ResourceEventHandler on the cache and count those.

or have remove_condition counters as well. i think that will be more clean.

i just found why ResourceEventHandler is much better solution ;-) thx.

mfojtik · 2017-06-22T10:31:33Z

@smarterclayton demo: https://youtu.be/myqzqqDx6bI

mfojtik · 2017-06-22T10:34:09Z

[test]

0xmichalis · 2017-06-22T11:21:36Z

pkg/deploy/controller/deploymentconfig/metrics.go

+	for _, n := range new {
+		found := false
+		for _, o := range old {
+			if found = n == o; found {


This hurts... Can you simply do

if n == o { found = true break }

sure, fixed.

0xmichalis · 2017-06-22T11:24:59Z

just a nit, LGTM in general

@smarterclayton do we treat metrics as API?

mfojtik · 2017-06-22T12:59:23Z

@Kargakis @smarterclayton added deployer_controller_failed_rollouts metric that is counting all failed rollouts ;-) so if something go sudden during upgrade, we will know.

mfojtik · 2017-06-22T13:01:50Z

any objections merging this now so we can have these for next upgrade? we can follow up with kube-state-metrics in 3.7

mfojtik · 2017-06-22T13:34:45Z

pkg/deploy/controller/deployer/metrics.go

+}
+
+func updateFailedRolloutsMetrics(oldRC, newRC *v1.ReplicationController) {
+	if oldRC != nil && deployutil.IsFailedDeployment(oldRC) {


@Kargakis is this ok?

(this is a counter is it will just go up, which is fine)

Yes, it's fine.

mfojtik · 2017-06-22T14:02:07Z

flake: #14689

[test]

0xmichalis · 2017-06-22T15:59:32Z

pkg/deploy/controller/deploymentconfig/metrics.go

+	for _, n := range new {
+		found := false
+		for _, o := range old {
+			if n == o {


This will fail in the case of a Progressing condition where the only difference between the new and the old condition is the lastTransitionTime, right?

@Kargakis guess that is fine? it is not new condition, just timestamp update

If there is a timestamp update, found ends up being false so you increment the metric here and decrement it below. Also we don't really use == for comparing structs. I would prefer this to be more explicit about the fields we care.

gotcha, i will update this, thanks

mfojtik · 2017-06-23T08:44:47Z

@Kargakis comparison fixed, guess this is ready to go

mfojtik · 2017-06-23T14:18:58Z

[severity:blocker]

smarterclayton · 2017-06-27T14:49:55Z

We shouldn't merge this for 3.6 because we aren't gathering it yet.

mfojtik · 2017-09-08T14:39:18Z

@smarterclayton updated, it follow the way the build metrics are done (lister, simple counters)...

mfojtik · 2017-09-08T14:40:01Z

pkg/apps/metrics/prometheus/metrics.go

+		}
+
+		generation := d.Status.ObservedGeneration
+		ch <- prometheus.MustNewConstMetric(activeRolloutCountDesc, prometheus.GaugeValue, float64(generation), []string{


@smarterclayton i'm not sure if generation is the right value here, the builds are using start timestamp (unix)

0xmichalis · 2017-09-08T14:47:41Z

pkg/apps/metrics/prometheus/metrics.go

+	var available, failed, cancelled float64
+
+	for _, d := range result {
+		if util.IsTerminatedDeployment(d) {


result is DCs but these helpers are for RCs?

@Kargakis yeah, brainfart... it is all fixed :)

0xmichalis · 2017-09-08T15:19:24Z

pkg/apps/metrics/prometheus/metrics.go

+
+// Collect implements the prometheus.Collector interface.
+func (c *appsCollector) Collect(ch chan<- prometheus.Metric) {
+	result, err := c.lister.List(labels.Everything())


Why do we need the extra list here? We already do one inside the main sync loop of the controller. Can't you use those results?

i guess prometheus collector schedule the list, if I pass the main controller loop rcList there might be a race? or I will have to read lock it? not sure if it is worth, since this list is done via cache imho.

Discussed IRL, this is fine.

0xmichalis · 2017-09-09T16:35:46Z

pkg/apps/metrics/prometheus/metrics.go

+		}
+
+		// TODO: possible time screw?
+		durationSeconds := time.Now().Unix() - d.CreationTimestamp.Unix()


Isn't the time the deployer pod started running more precise than the creation of the RC for when the new rollout started?

@Kargakis i think there is some time skew here but the deployer pod might be already gone and I will need to get that pod which will slow down the collection?

mfojtik · 2017-09-12T14:20:07Z

[5][CM-OPS-Tools] Master Prometheus endpoint coverage

mfojtik · 2017-09-12T19:37:21Z

@smarterclayton tests are passing now, PTAL

mfojtik · 2017-09-14T13:27:33Z

I might use the #16347 for started/completed after that lands, would not block this.

mfojtik · 2017-09-17T19:51:28Z

@smarterclayton bump

smarterclayton · 2017-09-19T18:52:24Z

pkg/apps/metrics/prometheus/metrics.go

+		}
+
+		// Record duration in seconds for active rollouts
+		// TODO: possible time screw?


time "skew". Time "screw" would probably be very different.

classic @mfojtik comment

smarterclayton · 2017-09-19T18:53:37Z

/lgtm

openshift-merge-robot · 2017-09-19T18:53:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, smarterclayton

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/apps/OWNERS~~ [mfojtik,smarterclayton]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-merge-robot · 2017-09-19T23:50:49Z

Automatic merge from submit-queue (batch tested with PRs 14784, 16418, 16406, 16431, 14796)

mfojtik force-pushed the deployment-condition-metrics branch from 04fa119 to 6ce74a3 Compare June 21, 2017 11:31

0xmichalis reviewed Jun 21, 2017

View reviewed changes

mfojtik commented Jun 21, 2017

View reviewed changes

smarterclayton reviewed Jun 21, 2017

View reviewed changes

mfojtik force-pushed the deployment-condition-metrics branch from 27f7b40 to f96564c Compare June 22, 2017 10:31

mfojtik force-pushed the deployment-condition-metrics branch from f96564c to d458bf8 Compare June 22, 2017 10:33

mfojtik force-pushed the deployment-condition-metrics branch 2 times, most recently from 28804b6 to 4d9a804 Compare June 22, 2017 10:43

mfojtik changed the title ~~WIP: deploy: add prometheus metrics for deployment conditions~~ deploy: add prometheus metrics for deployment conditions Jun 22, 2017

0xmichalis reviewed Jun 22, 2017

View reviewed changes

mfojtik force-pushed the deployment-condition-metrics branch from 4d9a804 to fdbd630 Compare June 22, 2017 11:30

mfojtik commented Jun 22, 2017

View reviewed changes

0xmichalis reviewed Jun 22, 2017

View reviewed changes

mfojtik force-pushed the deployment-condition-metrics branch from b3cdbc6 to 1db906c Compare June 23, 2017 08:44

mfojtik force-pushed the deployment-condition-metrics branch 2 times, most recently from 3084879 to 4e9f491 Compare June 27, 2017 13:22

mfojtik changed the title ~~deploy: add prometheus metrics for deployment conditions~~ apps: add prometheus metrics for rollout Sep 8, 2017

mfojtik commented Sep 8, 2017

View reviewed changes

0xmichalis reviewed Sep 8, 2017

View reviewed changes

mfojtik force-pushed the deployment-condition-metrics branch 4 times, most recently from e8c06b5 to 1f771da Compare September 8, 2017 15:15

0xmichalis reviewed Sep 8, 2017

View reviewed changes

mfojtik force-pushed the deployment-condition-metrics branch 4 times, most recently from 96cad7b to 383fe38 Compare September 8, 2017 16:15

0xmichalis assigned smarterclayton and unassigned soltysh and 0xmichalis Sep 9, 2017

0xmichalis reviewed Sep 9, 2017

View reviewed changes

apps: add rollouts metrics

a6be016

mfojtik force-pushed the deployment-condition-metrics branch from 383fe38 to a6be016 Compare September 12, 2017 18:31

smarterclayton reviewed Sep 19, 2017

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2017

openshift-merge-robot merged commit 1c8041b into openshift:master Sep 19, 2017

mfojtik deleted the deployment-condition-metrics branch September 5, 2018 21:07

		status.Conditions = filterOutCondition(status.Conditions, condType)
		// Decrease counters for conditions we are going to remove

apps: add prometheus metrics for rollout #14796

apps: add prometheus metrics for rollout #14796

Conversation

mfojtik commented Jun 21, 2017 • edited Loading

mfojtik commented Jun 21, 2017

mfojtik commented Jun 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfojtik commented Jun 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfojtik commented Jun 22, 2017

mfojtik commented Jun 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis commented Jun 22, 2017

mfojtik commented Jun 22, 2017

mfojtik commented Jun 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfojtik commented Jun 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfojtik commented Jun 23, 2017

mfojtik commented Jun 23, 2017

smarterclayton commented Jun 27, 2017

mfojtik commented Sep 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfojtik Sep 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfojtik commented Sep 12, 2017

mfojtik commented Sep 12, 2017

mfojtik commented Sep 14, 2017

mfojtik commented Sep 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Sep 19, 2017

openshift-merge-robot commented Sep 19, 2017

openshift-merge-robot commented Sep 19, 2017

mfojtik commented Jun 21, 2017 •

edited

Loading

mfojtik Sep 8, 2017 •

edited

Loading