-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus alerts for openshift build subsystem #16495
prometheus alerts for openshift build subsystem #16495
Conversation
examples/prometheus/prometheus.yaml
Outdated
@@ -30,6 +30,15 @@ parameters: | |||
name: SESSION_SECRET | |||
generate: expression | |||
from: "[a-zA-Z0-9]{43}" | |||
- description: The threshold for the active build alert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not add parameters here. If you can't pick good defaults we shouldn't add the alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically parameterization is the devil. No parameters. It's either a good rule, or it shouldn't be here (add it commented out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to me the duration and volume of running builds that determines whether you have a problem or not is going to be determined in part by the size of your cluster. (possibly also derived from steady state observations). The fact that there aren't absolute values for it doesn't make it a bad rule.
if your suggestion is that such rules should be commented out rather than enabled out of the box, ok, but I would expect that's going to be the case for almost all our rules. It's hard to imagine a metric who's health/sick condition isn't going to be determined in part by cluster characteristics.
Maybe what I need is a better understanding of the intent of this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and is there some other file where we should be putting in rules that we want for our actual online prod cluster?)
|
||
for _, sample := range metrics { | ||
// first see if a metric has all the label names and label values we are looking for | ||
foundCorrectLabels := true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have helpers to do this you should create them. This is pretty ugly code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a prometheus client library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: prometheus client library, at least for my interpretation of that, there is an alpha one for remote access ....see https://github.com/prometheus/client_golang#client-for-the-prometheus-http-api and https://github.com/prometheus/client_golang/blob/master/api/prometheus/v1/api.go
but the version of prometheus vendored into origin does not have the above
there is "client_golang" stuff in the prometheus vendored in, but it is non-HTTP non-remote
unless I missed something
otherwise sure I can work up some helpers ... might do it in a non-trello card PR depending on what happens with this work
examples/prometheus/prometheus.yaml
Outdated
severity: "MEDIUM" | ||
message: "{{$labels.instance}} indicates at least ${PROMETHEUS_ACTIVE_BUILD_ALERT_THRESHOLD} OpenShift builds are taking more than ${PROMETHEUS_ACTIVE_BUILD_ALERT_DURATION} seconds to complete" | ||
- alert: BuildFailureRate | ||
expr: (count(openshift_build_terminal_phase_total{phase="failed"}) / count(openshift_build_terminal_phase_total{phase="complete"})) > ${PROMETHEUS_FAILED_BUILD_RATIO} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this alert get cleared? If openshift online has 9000 failed builds, do I have to wait for 9000 successful builds for the alert to go away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pending all the other discussion points in the PR, I almost did not include this alert to begin with (it was suggested to me) .... I'm inclined to simply remove it, but will wait a bit to see what other discussion arises with it before doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the suggester of the alert: i don't know what the prometheus workflow is like, or how the alerting behaves... can you have an alert that's ignored once tripped until it resets below the alert level?
Pruning old builds would be one way for it to be reset. Or we can punt it and do an alert based on "failed builds within the last X minutes" instead, but that's going to be another alert that needs to be tuned for your cluster/expectations.
examples/prometheus/prometheus.yaml
Outdated
expr: up{job="kubernetes-nodes"} == 0 | ||
annotations: | ||
miqTarget: "ContainerNode" | ||
severity: "HIGH" | ||
message: "{{$labels.instance}} is down" | ||
message: "{{$labels.instance}} is down" | ||
- alert: HangingBuild |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not hanging, it's slow.
examples/prometheus/prometheus.yaml
Outdated
expr: count(openshift_build_running_phase_start_time_seconds{} < time() - ${PROMETHEUS_ACTIVE_BUILD_ALERT_DURATION}) > ${PROMETHEUS_ACTIVE_BUILD_ALERT_THRESHOLD} | ||
annotations: | ||
severity: "MEDIUM" | ||
message: "{{$labels.instance}} indicates at least ${PROMETHEUS_ACTIVE_BUILD_ALERT_THRESHOLD} OpenShift builds are taking more than ${PROMETHEUS_ACTIVE_BUILD_ALERT_DURATION} seconds to complete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a more clear alarm. However, I'm not sure I want to check the other two in until you prove that you want these alerts on. I.e. when we upgrade to 3.7.0 I'll turn your proposed alerts on and send them to you. Then you can decide whether they work or not.
I actually probably would say we're not ready to merge any of these, and instead you should give me the alerts to turn on in production which i will hook up to your email.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on what I've learned as part of taking on this card (see the referenced google doc), I'm inclined to not do any of these alerts, either in this example template or manually submitting them to you (if not submitting to you is even an option), and wait until we get more practical experience with the build subsystem in production. But the card was assigned and initiating a PR to start discussions seemed the best thing to do.
On the values of the conditions / evil template parameters, for the current zabbix equivalent that ops is employing today, https://github.com/openshift/openshift-tools/blob/prod/scripts/monitoring/cron-send-stuck-builds.py#L60-L93, the "age" is not hard coded, but a variable passed in.
That said, I've re-initiated the email exchange I started with them during the build metrics work, and seeing if default values emerge (perhaps the calling to the referenced function hardcodes the value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's actually base it on real data. Suggest your top 3 alerts based on recent bugs, and I'll turn them on after the 3.7.0 upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10-4 re: alerts <=> recent bugs and #16495 (comment) .... @bparees and I will start conferring on all that
for completeness, an update on what ops is currently doing with their zabbix stuff:
- they are currently concerned with builds stuck in NEW|PENDING ... not slow RUNNING builds
- I elaborated a bit that is really a pod scheduling issue, and might be captured by any new k8s or platform mgmt team metrics that get injected/rebased into openshift/origin
- but that we could add it to our build specific stuff easy enough if need
- their build alerts are at the lowest level; so a warning on the zabbix board, but nobody gets paged
examples/prometheus/prometheus.yaml
Outdated
- description: The allowable active build duration for the active build alert | ||
name: PROMETHEUS_ACTIVE_BUILD_ALERT_DURATION | ||
value: "360" | ||
- description: The allowable ration of failed to complete builds for the failed build alert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ratio
So I think you guys got the hard one (being the first responders). Let's turn this around a bit. The point of metrics is to enable you to debug and assess the health of your component in the product. The point of alerts is to unambiguously indicate a serious problem is occurring that an engineer should react to. The process for this is best laid out as
I'd expect you guys to be in the gathering phase (between 1-4). I'd recommend you create a new section the readme and place your recommended snippets in there with some guesses and text that says what to do with them. I'll accumulate them into one of the prod clusters and then you can observe them. We won't ship the alert until it's something you're confident we could page you on at 1am on a Saturday night and you'd be happy because it means you just saved the world. |
Re: parameters every time you add a parameter to a template god kills a
kitten. Parameters are best used judiciously and sparingly, and making
them tunable makes it the end user's problem before they are ready. Most
users should be forking our config if they don't like our settings.
Parameters that are proportional to cluster size can be alerted
proportional to cluster size.
…On Thu, Sep 21, 2017 at 5:01 PM, Gabe Montero ***@***.***> wrote:
<https://camo.githubusercontent.com/193766a3b9959c5f4ed5cd8cf3251d015e839d51/68747470733a2f2f6769746875622e7472656c6c6f2e73657276696365732f696d616765732f6d696e692d7472656c6c6f2d69636f6e2e706e67> (5)
Prometheus alerts for build metrics
<https://trello.com/c/RskNHpfh/1334-5-prometheus-alerts-for-build-metrics>
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#16495 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p5D2b0AH5kTgSqq4XC9gdetYpe2Xks5sks6QgaJpZM4Pf6SJ>
.
|
Well, that was the metric that included the failure reason+start time date which was rejected as not conforming to the prometheus philosophy. Failed builds alone is not a very interesting metrics for us since there are lots of perfectly normal reasons for builds to fail. @gabemontero is going to circle back w/ a hybrid proposal between prometheus metrics and live query-based state analysis. Alternatively, rather than adding failure reason+timetamp labels to our "completed builds" metric, what if we introduced additional metrics like: FailedFetchSourceCount ? (the collector would do the work of summing those things when iterating the build cache). |
e07d075
to
838a344
Compare
OK, I've just pushed updates in response to the various comments. Specifics:
|
838a344
to
2126108
Compare
let's hang onto it for now and see how it performs in the real world. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits but i'm in agreement w/ the general changes.
examples/prometheus/README.md
Outdated
|
||
> openshift_build_failed_phase_total{} | ||
|
||
Returns the latest totals for failed builds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this different from the count() query above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the count() query above is doing a sum of all the collected data points ("yesterday there were a total of 100 failed builds in the system", "today there were a total of 101", "count=201") that doesn't seem too useful since it's going to double (or triple or quadruple..) count a lot of builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this query will return a multi-row table which includes count by reason; I'll push an update the clarifies that, as well as the aggregation that occurs with count (though that might need to be sum ... I'll sort that out as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks. (I think you explained that to me before in my cube, sorry i needed the refresher)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np / yw
activeBuildCount = "running_phase_start_time_seconds" | ||
activeBuildCountQuery = buildSubsystem + separator + activeBuildCount | ||
newPendingBuildCount = "new_pending_phase_creation_time_seconds" | ||
newPendingBuildCountQuery = buildSubsystem + separator + newPendingBuildCount | ||
) | ||
|
||
var ( | ||
// decided not to have a separate counter for failed builds, which have reasons, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is no longer accurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure isn't :-) ... will remove
} | ||
for reason, count := range reasons { | ||
addCountGauge(ch, failedBuildCountDesc, reason, float64(count)) | ||
} | ||
addCountGauge(ch, terminalBuildCountDesc, errorPhase, float64(error)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should probably be counted under failedBuilds with a reason of "Build pod encountered an error" or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(errored builds are closer to "failed" builds than they are to "completed+canceled" builds).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I went back and forth on this, considering each point you just mentioned vs. the fact that in the runtime, following a guru search, the StatusReason field in the build is never set with an error
build
bottom line, I'm not convinced ... unless there is more than one reason for builds in error
state, it is only wasteful to add a label reason for it ... we shouldn't use prometheus labels to clarify like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying we treat "Error" as the reason. Treat Failed and Error builds the same, when applying the reason label for the Error builds, just manufacture a reason.
In other words i want the "failedBuilds" metrics to include the count of Error builds, because otherwise they may be ignored under "terminated" builds as not interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see ... ok, we should make the reason string consistent in look/feel to the others ... I'll go with BuildPodError
.... there already BuildPodDeleted
and BuildPodExists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
case buildapi.BuildPhaseFailed: | ||
failed++ | ||
// currently only failed builds have reasons | ||
reasons[string(b.Status.Reason)] = 1 | ||
case buildapi.BuildPhaseError: | ||
error++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may live to regret this variable name choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ... a golang thing perhaps ... good catch / I'll change
e, cc, cp, r := bc.collectBuild(ch, b) | ||
for key, value := range r { | ||
count := reasons[key] | ||
count = count + value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count:=reasons[value]+value ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or better reasons[key]=reasons[key]+value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep ... iterated a bit on this piece and didn't finish clean up .. thx
cmd test run looks to have hit flake #16468 though that was presumably fixed / closed by @bparees then, another cmd test did a both of course unrelated to this change |
2126108
to
dda3ba5
Compare
responses to sept 26 comments from @bparees pushed |
/lgtm but will to give @smarterclayton a last opportunity to raise concerns before we merge this tomorrow. ( @gabemontero you can remove the WIP label) |
Thought it just compiled for me locally, so not obvious what is up:
looking ... |
a rebase, |
yep ... 31fe387#diff-5e0dc194b78051f414a42e52f4caf7dd merged yesterday, changed things up |
aa4eeb8
to
c209a95
Compare
c209a95
to
456dce5
Compare
rebase after @jim-minter pr merge pushed at least for the cmd test failures that were available ... again appear to be random flakes ... was able to execute same tests locally |
/lgtm |
cmd test failures same as last time:
and
But I can run |
Found it:
will need to push another update. Not sure yet why it happens on ci and not locally, but given the nil ref / panic, does not matter. |
must be some sort of golang / version / compile diff thingy |
456dce5
to
cf023ef
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bparees, gabemontero The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
the cmd tests are still running, but the authenticate.sh tests that failed before have passed |
unrelated cmd /test cmd |
/test cmd |
/test extended_networking_minimal |
/retest Please review the full test history for this PR and help us cut down flakes. |
Automatic merge from submit-queue. |
https://trello.com/c/RskNHpfh/1334-5-prometheus-alerts-for-build-metrics
A WIP initial pass at alerts for the openshift build subsystem
@openshift/devex @smarterclayton @zgalor @moolitayer @mfojtik ptal, defer if bandwidth dictates, and/or pull in others as you each deem fit
Disclaimers:
I'm still debating the pros/cons of these alerts with https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.2efurbugauf in mind
still debating the template parameters / defaults for the various thresholds ... I still have a to-do to revisit with ops contacts potential default values based on their existing zabbix monitoring
still debating the severity as well
based on the activity in Update test alert metadata #16026 I did not include the
miqTarget
annotationI also removed the space in the existing alert name based on how I interpreted various naming conventions.
And other than the query on the alerts URI, the extended test changes stemmed from flakiness experienced during testing that was unrelated to the addition of the alerts.
thanks