-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metrics for TemplateInstance controller #16455
add metrics for TemplateInstance controller #16455
Conversation
@smarterclayton @bparees there are 2 commits here: the 1st does things approximately like the existing build metrics; however I think the way in the 2nd commit is a better approach for both sets of metrics. Example output for discussion:
Notes:
|
flake #16414 |
Quick note: on the use of namespace/name/ etc. labels with the active build metric, those stemmed from @smarterclayton 's desire to have a "constant metric" where the value was the actual start time in unix domain time. That approach was derived from a similar metric that cadvisor had. My original approach to those active builds was a histogram similar to what I see in this PR @jim-minter for active/waiting template instances. But through iterating with @smarterclayton this was removed, at least for now. We did talk about revisiting/returning some of the histogram based metrics later on. Although not an exact apples to apples comparison, it is conceivable that your active/waiting templates instances metric could follow a similar "evolutionary path". |
1+2) what @gabemontero said. Essentially there was a desire to have a data point representation of each active build. I agree from a cardinality perspective that seems problematic, but that's the direction we got and should probably stick with for templateinstances
i'm also a little confused by the histogram behavior in your example... you've got one data point with a duration of 74s and it seems to be represented in all the buckets except "60s". The only conclusion I can reach is that prometheus counts it in the bucket as long as the value is "less than" the bucket value? Seems strange. ( would have expected it to put it in exactly one bucket, the one with the smallest value that's larger than the datapoint's value, ie 300 in this case) |
Please follow prometheus conventions regarding labels (i.e. no upper case,
match with other resources people have created in terms of general name
ordering).
…On Wed, Sep 20, 2017 at 11:20 PM, Ben Parees ***@***.***> wrote:
1+2) what @gabemontero <https://github.com/gabemontero> said. Essentially
there was a desire to have a data point representation of each active
build. I agree from a cardinality perspective that seems problematic, but
that's the direction we got and should probably stick with for
templateinstances
1. well templateinstances don't even have phases, so no argument
there. Similarly builds don't have conditions, which is why build metrics
are reported by phase, not condition. For objects that have conditions,
reporting how many objects are in each terminal condition, as well as how
many objects exist in terminal conditions in total, seems reasonable.
i'm also a little confused by the histogram behavior in your example...
you've got one data point with a duration of 74s and it seems to be
represented in all the buckets except "60s". The only conclusion I can
reach is that prometheus counts it in the bucket as long as the value is
"less than" the bucket value? Seems strange. ( would have expected it to
put it in exactly one bucket, the one with the smallest value that's larger
than the datapoint's value, ie 300 in this case)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16455 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p0JKL8nPOJqOVk31G90zpjpWXzSRks5skdYYgaJpZM4PeOAe>
.
|
pkg/template/controller/metrics.go
Outdated
func newTemplateInstancesTotal() *prometheus.GaugeVec { | ||
return prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Name: "TemplateInstanceController_TemplateInstances_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest openshift_template_instance_total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we rename the controller, I'm supposing that templateinstance_controller_templateinstances_total
might work, for example.
pkg/template/controller/metrics.go
Outdated
func newTemplateInstancesWaiting() prometheus.Histogram { | ||
return prometheus.NewHistogram( | ||
prometheus.HistogramOpts{ | ||
Name: "TemplateInstanceController_TemplateInstances_active_waiting_time_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether this remains a histogram or becomes a constant metric per the PR discussion thread most likely would NOT affect the name.
So still offering name suggestions prior to reaching consensus on that point seems OK :-)
I think openshift_template_active_wait_time_seconds
would be good.
templateInstancesTotal.WithLabelValues("", "").Inc() | ||
|
||
for _, cond := range templateInstance.Status.Conditions { | ||
templateInstancesTotal.WithLabelValues(string(cond.Type), string(cond.Status)).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up, during the build metrics review, @smarterclayton was big on following the code stylings of kube_state_metrics for actually registering the metrics.
See addCountGauge
and addTimeGauge
in pkg/build/metrics/prometheus/metrics.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reluctant to follow this coding approach because it duplicates logic that sits in the prometheus client library. My approach uses the client library logic to populate all the histogram buckets before sending them off. Although it matters less than the Histogram case, func (bc *buildCollector) Collect
is basically reimplementing the prometheus client Gauge.Inc() when I don't believe it needs to.
@bparees prometheus stores cumulative histograms. |
@smarterclayton I named these TemplateInstanceController_* because automatic controller metrics already exist under this name because that's what the controller name is defined as. Should I rename the controller? To |
Although TemplateInstanceController isn't the only one: APIServiceRegistrationController, AvailableConditionController, DiscoveryController |
Those are wrong, and must also be fixed.
openshift_template_instance_* is acceptable
On Sep 25, 2017, at 12:55 PM, Jim Minter <[email protected]> wrote:
Although TemplateInstanceController isn't the only one:
APIServiceRegistrationController, AvailableConditionController,
DiscoveryController
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16455 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p4Fi-lYvcXBxsUi1B9Nc28c7O7wvks5sl9sBgaJpZM4PeOAe>
.
|
61798ff
to
86e0679
Compare
updated:
|
86e0679
to
b3451c8
Compare
/retest |
@smarterclayton @gabemontero @bparees ptal - looking for sign-off on this so that it can merge today |
Stylistically, I like what @jim-minter has done here. As I previously noted, to some degree it breaks from some patterns that were imposed from our previous build work; but as generally acknowledged, we are very much in an iteration / evaluation cycle. No reason that can't apply to the way the metrics are coded, in addition to the specifics of the metrics. As to the specifics of the metrics, certainly and apples to apples comparison between templates and builds is not viable. Aside from there intrinsic differences, templates are not something the ops team has been monitoring with zabbix. Existing ops metrics in part have driven the path of build metrics. With that preamble, similar to the questions @smarterclayton posed on build metrics, I'd be curious @jim-minter how you might envision using the template metrics in online to gauge health of that component. Perhaps you can update the readme with some example queries, where even if we don't explicitly proclaim it yet, those queries might be of a flavor of something we might run in online to get a sense of the health of the component. For example, I'm assuming the status label will have some indication of success / failure / problems. Or template instance activation ... how long we expect those to be running ... Those type of elaborations would help me better review the precise contents of the metric. Of course if all that has been discussed previously and I've missed it, just point me to that or summarize, whatever is easier. thanks |
talked with @jim-minter a bit about the merits of the empty string label mechanism for showing the total count.. for now we agreed to leave it, but may want to introduce an explicit total count metric in the future since the empty string label feels a bit hacky. Agree with @gabemontero that contributions to the readme for some sample queries that use these metrics would be good, that can be done as a follow up though, so i'm going to lgtm this as it currently stands, I think it covers the fundamental metrics we'd be interested in seeing for template instance usage. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bparees, jim-minter The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
On Wed, Sep 27, 2017 at 2:49 PM, Ben Parees ***@***.***> wrote:
talked with @jim-minter <https://github.com/jim-minter> a bit about the
merits of the empty string label mechanism for showing the total count..
for now we agreed to leave it, but may want to introduce an explicit total
count metric in the future since the empty string label feels a bit hacky.
Agree with @gabemontero <https://github.com/gabemontero> that
contributions to the readme for some sample queries that use these metrics
would be good, that can be done as a follow up though, so i'm going to lgtm
this as it currently stands, I think it covers the fundamental metrics we'd
be interested in seeing for template instance usage.
/lgtm
For what it is worth I'm good with this.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16455 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADbadEnDxbPI1ASktvm4mz8XMG4DCiFHks5smpi6gaJpZM4PeOAe>
.
|
Don't use empty string labels. We have summarization in prometheus for that. |
https://prometheus.io/docs/practices/naming/#metric-names last paragraph alludes to it, but if prometheus |
sum would double count items that have multiple conditions. |
Then split the metrics out so they don't as KSM has done |
When in doubt look at how KSM does it. |
Automatic merge from submit-queue (batch tested with PRs 16293, 16455) |
Please fix the issue I mentioned. |
a separate metric for each condition? (and another separate metric for "total"?) |
You don't need total because you can sum the metrics. But yes, a separate
one. I recommend looking at KSM. Or you can look at us-west-1 prometheus
to see what it outputs.
…On Wed, Sep 27, 2017 at 5:11 PM, Ben Parees ***@***.***> wrote:
Then split the metrics out so they don't as KSM has done
a separate metric for each condition?
(and another separate metric for "total"?)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16455 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pwKzja96vT-XGQlADJkxpEuG6H55ks5smrn0gaJpZM4PeOAe>
.
|
we can't sum the metrics, you'd be double counting items which have two or more conditions associated with them. |
Sorry, I wasn't clear. I wanted one series "_total" per condition type,
with different labels for the readiness state. When I said "like KSM" i
meant "they do one metric per condition" but not "one series per template
instance"
…On Wed, Sep 27, 2017 at 5:26 PM, Ben Parees ***@***.***> wrote:
You don't need total because you can sum the metrics.
we can't sum the metrics, you'd be double counting items which have two or
more conditions associated with them.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16455 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pwvS_A6qgVnwqte2r49b9q2TZR89ks5smr2MgaJpZM4PeOAe>
.
|
and when @smarterclayton and I talked further, we would not do one metric per condition either. We'd do one series per condition. (one metric per condition means dynamically defining new metrics if new conditions are introduced). so one metric that's just a constant total(summed in the collector). and one metric that's instanceByCondition with {condition,value} labels, with the value being the count of instances that have that condition/value combination. |
Yeah, we don't need two metrics because you can sum by condition if
necessary.
|
i feel like we're going in circles. we can't sum by condition because templateinstances can potentially have multiple conditions, so summing by condition is going to double count things. so we still need two metrics. |
sum by (condition) YOUR_METRIC
is not going to double count things
…On Thu, Sep 28, 2017 at 12:26 AM, Ben Parees ***@***.***> wrote:
Yeah, we don't need two metrics because you can sum by condition if
necessary.
i feel like we're going in circles. we can't sum by condition because
templateinstances can potentially have multiple conditions, so summing by
condition is going to double count things.
so we still need two metrics.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16455 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p9PtgQqVHPMCvB2GhZJ5oBio9Am6ks5smx_fgaJpZM4PeOAe>
.
|
if i specify an explicit condition, sure... it's also not going to give me the total in the system. |
Which is a separate metric for total. I was referring to separate metrics
per condition, which are unnecessary. Originally I was talking about one
metric for total and one metric per condition.
On Sep 28, 2017, at 10:05 AM, Ben Parees <[email protected]> wrote:
sum by (condition) YOUR_METRIC is not going to double count things
if i specify an explicit condition, sure... it's also not going to give me
the total in the system.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16455 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p494f3Frw3Nx4EFIPL2JN0P7gucvks5sm6eogaJpZM4PeOAe>
.
|
Automatic merge from submit-queue. separate openshift_template_instance_status_condition_total and openshift_template_instance_total metrics follow-up from #16455 @smarterclayton @bparees ptal @gabemontero fyi
https://trello.com/c/wDxqVOqy/1181-5-prometheus-metrics-for-template-broker-techdebt