Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdn: make pod operation metrics more useful and collectable #17250

Merged

Conversation

dcbw
Copy link
Contributor

@dcbw dcbw commented Nov 9, 2017

The pod operation error metrics were in the wrong place to capture the
overall pod setup/teardown operation. Move them to capture everything.

Next, the labels of the Latency metric meant that every observation was
a unique metric and no statistics could be determined from them in
aggregate. Change that (and pod errors) to follow the Kubelet dockershim
DockerOperations[Latency|Errors] metric pattern with a label for the
operation instead of the sandbox.

Fixes: #17494

@danwinship @eparis @openshift/networking @knobunc

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 9, 2017
@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2017
@danwinship
Copy link
Contributor

LGTM, but while double-checking the err vs result.Err discrepancy between the two pod.go cases, I noticed that there's a bug there a few lines before your patch:

			result.Response, err = json.Marshal(ipamResult)
			if result.Err == nil {

should be if err == nil on the second line. (result.Err is always nil at this point, but then, the json.Marshal() call shouldn't ever fail anyway, so probably this bug has never actually mattered...)

@imcsk8
Copy link
Contributor

imcsk8 commented Nov 10, 2017

LGTM

@soltysh soltysh removed their assignment Nov 21, 2017
@dcbw dcbw force-pushed the make-pod-metrics-more-useful branch from e68908c to cca6de3 Compare November 22, 2017 21:00
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 22, 2017
@dcbw dcbw changed the title sdn: make PodSetup/TeardownErrors metric more useful sdn: make pod operation metrics more useful and collectable Nov 22, 2017
@dcbw
Copy link
Contributor Author

dcbw commented Nov 22, 2017

@DirectXMan12 might be interesting to you too; the problem we had with NaN was the breakdown of metrics by labels; see the first comment for more info.

@dcbw
Copy link
Contributor Author

dcbw commented Nov 22, 2017

should be if err == nil on the second line. (result.Err is always nil at this point, but then, the json.Marshal() call shouldn't ever fail anyway, so probably this bug has never actually mattered...)

@danwinship fixed in a separate commit here

@knobunc
Copy link
Contributor

knobunc commented Nov 22, 2017

LGTM... holding for 3.9

@dcbw dcbw force-pushed the make-pod-metrics-more-useful branch 2 times, most recently from 7fafabf to 9f031d3 Compare November 27, 2017 17:01
@dcbw
Copy link
Contributor Author

dcbw commented Nov 28, 2017

/test cmd

@dcbw
Copy link
Contributor Author

dcbw commented Nov 28, 2017

/test cmd issue #16317

dcbw added 2 commits November 28, 2017 13:36
The pod operation error metrics were in the wrong place to capture the
overall pod setup/teardown operation.  Move them to capture everything.

Next, the labels of the Latency metric meant that every observation was
a unique metric and no statistics could be determined from them in
aggregate.  Change that (and pod errors) to follow the Kubelet dockershim
DockerOperations[Latency|Errors] metric pattern with a label for the
operation instead of the sandbox.
@dcbw dcbw force-pushed the make-pod-metrics-more-useful branch from 9f031d3 to caf5f73 Compare November 28, 2017 19:37
@dcbw
Copy link
Contributor Author

dcbw commented Nov 29, 2017

/test unit
/test end_to_end

@knobunc
Copy link
Contributor

knobunc commented Dec 13, 2017

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 13, 2017
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw, knobunc

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit 3fba38d into openshift:master Dec 14, 2017
@akram
Copy link
Contributor

akram commented Aug 28, 2018

@smarterclayton It is supposed to be fixed here, can you remove the drop for these metrics in https://github.com/openshift/origin/blob/master/examples/prometheus/prometheus.yaml#L556

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 28, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants