sdn: make pod operation metrics more useful and collectable #17250

dcbw · 2017-11-09T19:27:44Z

The pod operation error metrics were in the wrong place to capture the
overall pod setup/teardown operation. Move them to capture everything.

Next, the labels of the Latency metric meant that every observation was
a unique metric and no statistics could be determined from them in
aggregate. Change that (and pod errors) to follow the Kubelet dockershim
DockerOperations[Latency|Errors] metric pattern with a label for the
operation instead of the sandbox.

Fixes: #17494

@danwinship @eparis @openshift/networking @knobunc

danwinship · 2017-11-09T20:14:08Z

LGTM, but while double-checking the err vs result.Err discrepancy between the two pod.go cases, I noticed that there's a bug there a few lines before your patch:

			result.Response, err = json.Marshal(ipamResult)
			if result.Err == nil {

should be if err == nil on the second line. (result.Err is always nil at this point, but then, the json.Marshal() call shouldn't ever fail anyway, so probably this bug has never actually mattered...)

imcsk8 · 2017-11-10T02:14:41Z

LGTM

dcbw · 2017-11-22T21:02:03Z

@DirectXMan12 might be interesting to you too; the problem we had with NaN was the breakdown of metrics by labels; see the first comment for more info.

dcbw · 2017-11-22T21:03:00Z

should be if err == nil on the second line. (result.Err is always nil at this point, but then, the json.Marshal() call shouldn't ever fail anyway, so probably this bug has never actually mattered...)

@danwinship fixed in a separate commit here

knobunc · 2017-11-22T21:24:45Z

LGTM... holding for 3.9

dcbw · 2017-11-28T17:48:39Z

/test cmd

dcbw · 2017-11-28T18:25:49Z

/test cmd issue #16317

The pod operation error metrics were in the wrong place to capture the overall pod setup/teardown operation. Move them to capture everything. Next, the labels of the Latency metric meant that every observation was a unique metric and no statistics could be determined from them in aggregate. Change that (and pod errors) to follow the Kubelet dockershim DockerOperations[Latency|Errors] metric pattern with a label for the operation instead of the sandbox.

dcbw · 2017-11-29T17:22:33Z

/test unit
/test end_to_end

knobunc · 2017-12-13T20:40:02Z

/lgtm

openshift-ci-robot · 2017-12-13T20:40:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw, knobunc

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/network/OWNERS~~ [dcbw,knobunc]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-merge-robot · 2017-12-14T06:27:06Z

Automatic merge from submit-queue.

akram · 2018-08-28T08:03:41Z

@smarterclayton It is supposed to be fixed here, can you remove the drop for these metrics in https://github.com/openshift/origin/blob/master/examples/prometheus/prometheus.yaml#L556

smarterclayton · 2018-08-28T14:00:20Z

We don’t use that anymore for real installs. Changes need to be made to cluster monitoring operator. On Aug 28, 2018, at 4:03 AM, Akram Ben Aissi <[email protected]> wrote: @smarterclayton <https://github.com/smarterclayton> It is supposed to be fixed here, can you remove the drop for these metrics in https://github.com/openshift/origin/blob/master/examples/prometheus/prometheus.yaml#L556 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17250 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p6LV84H7cLRrIUbxngMIJuc6Yppzks5uVPlggaJpZM4QYhgk> .

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 9, 2017

openshift-merge-robot assigned soltysh and rajatchopra Nov 9, 2017

openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2017

soltysh removed their assignment Nov 21, 2017

dcbw force-pushed the make-pod-metrics-more-useful branch from e68908c to cca6de3 Compare November 22, 2017 21:00

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 22, 2017

dcbw changed the title ~~sdn: make PodSetup/TeardownErrors metric more useful~~ sdn: make pod operation metrics more useful and collectable Nov 22, 2017

dcbw force-pushed the make-pod-metrics-more-useful branch 2 times, most recently from 7fafabf to 9f031d3 Compare November 27, 2017 17:01

dcbw added 2 commits November 28, 2017 13:36

sdn: handle error from JSON marshal at pod setup

caf5f73

dcbw force-pushed the make-pod-metrics-more-useful branch from 9f031d3 to caf5f73 Compare November 28, 2017 19:37

dcbw mentioned this pull request Dec 1, 2017

openshift-sdn metrics are ~1/3 of all node metrics series #17494

Closed

openshift-ci-robot assigned knobunc Dec 13, 2017

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 13, 2017

openshift deleted a comment from openshift-merge-robot Dec 13, 2017

openshift-merge-robot merged commit 3fba38d into openshift:master Dec 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdn: make pod operation metrics more useful and collectable #17250

sdn: make pod operation metrics more useful and collectable #17250

dcbw commented Nov 9, 2017 •

edited

Loading

danwinship commented Nov 9, 2017

imcsk8 commented Nov 10, 2017

dcbw commented Nov 22, 2017

dcbw commented Nov 22, 2017

knobunc commented Nov 22, 2017

dcbw commented Nov 28, 2017

dcbw commented Nov 28, 2017

dcbw commented Nov 29, 2017

knobunc commented Dec 13, 2017

openshift-ci-robot commented Dec 13, 2017

openshift-merge-robot commented Dec 14, 2017

akram commented Aug 28, 2018

smarterclayton commented Aug 28, 2018 via email

sdn: make pod operation metrics more useful and collectable #17250

sdn: make pod operation metrics more useful and collectable #17250

Conversation

dcbw commented Nov 9, 2017 • edited Loading

danwinship commented Nov 9, 2017

imcsk8 commented Nov 10, 2017

dcbw commented Nov 22, 2017

dcbw commented Nov 22, 2017

knobunc commented Nov 22, 2017

dcbw commented Nov 28, 2017

dcbw commented Nov 28, 2017

dcbw commented Nov 29, 2017

knobunc commented Dec 13, 2017

openshift-ci-robot commented Dec 13, 2017

openshift-merge-robot commented Dec 14, 2017

akram commented Aug 28, 2018

smarterclayton commented Aug 28, 2018 via email

dcbw commented Nov 9, 2017 •

edited

Loading