-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apps: add prometheus metrics for rollout #14796
apps: add prometheus metrics for rollout #14796
Conversation
@Kargakis @tnozicka @smarterclayton open for suggestions. |
04fa119
to
6ce74a3
Compare
@smarterclayton the fatal error count on deployer pods indicates deployer pods were gone missing or something terrible happened to them. |
prometheus.CounterOpts{ | ||
Subsystem: DeployerControllerSubsystem, | ||
Name: "failure_count", | ||
Help: "Counter that counts total number of deployer pod errors per error type", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make more sense to have a ratio as opposed to an absolute number? Or keep the number of errors that occured in the last hour for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis i don't think prometheus has that, you can query that via prometheus console
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally you would count errors separate from total.
prometheus.GaugeOpts{ | ||
Subsystem: DeploymentConfigControllerSubsystem, | ||
Name: "set_condition", | ||
Help: "Gauge measuring changes to deployment conditions per condition, reason and the status", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric is pretty vague. I would be interested in the ratio a deployment is flipping between True/False for Available, how many deployments are unavailable, how many deployments have failed over a period of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis you can querty unavailable deployment via prometheus console (just query the condition counter and specify time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, these metrics should be as raw as possible.
btw. I think we should always set reason for the condition, the empty reasons looks bad in metrics. |
const DeploymentConfigControllerSubsystem = "deploymentconfig_controller" | ||
|
||
var ( | ||
deploymentConditionCounter = prometheus.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton not sure if this should be gauge or counter (i decrease the number when we remove the condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gauge. Counters must be monotonic increasing.
deploymentConditionCounter = prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Subsystem: DeploymentConfigControllerSubsystem, | ||
Name: "set_condition", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this consistent with other naming of prometheus metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton it looks ok with the subsystem set properly, probably should be plural (set_conditions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(upstream use plurals for counters)
pkg/deploy/util/util.go
Outdated
status.Conditions = filterOutCondition(status.Conditions, condType) | ||
// Decrease counters for conditions we are going to remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fiddling with counters like this is dangerous - it's too easy to get them wrong. I would recommend instead simply counting the number of conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or doing this at a higher level. Since this is observational from the cache, you might just want to add a ResourceEventHandler on the cache and count those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or have remove_condition counters as well. i think that will be more clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just found why ResourceEventHandler is much better solution ;-) thx.
27f7b40
to
f96564c
Compare
f96564c
to
d458bf8
Compare
[test] |
28804b6
to
4d9a804
Compare
for _, n := range new { | ||
found := false | ||
for _, o := range old { | ||
if found = n == o; found { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hurts... Can you simply do
if n == o {
found = true
break
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, fixed.
just a nit, LGTM in general @smarterclayton do we treat metrics as API? |
4d9a804
to
fdbd630
Compare
@Kargakis @smarterclayton added deployer_controller_failed_rollouts metric that is counting all failed rollouts ;-) so if something go sudden during upgrade, we will know. |
any objections merging this now so we can have these for next upgrade? we can follow up with kube-state-metrics in 3.7 |
} | ||
|
||
func updateFailedRolloutsMetrics(oldRC, newRC *v1.ReplicationController) { | ||
if oldRC != nil && deployutil.IsFailedDeployment(oldRC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis is this ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is a counter is it will just go up, which is fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's fine.
flake: #14689 [test] |
for _, n := range new { | ||
found := false | ||
for _, o := range old { | ||
if n == o { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail in the case of a Progressing condition where the only difference between the new and the old condition is the lastTransitionTime, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis guess that is fine? it is not new condition, just timestamp update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a timestamp update, found
ends up being false so you increment the metric here and decrement it below. Also we don't really use == for comparing structs. I would prefer this to be more explicit about the fields we care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, i will update this, thanks
b3cdbc6
to
1db906c
Compare
@Kargakis comparison fixed, guess this is ready to go |
[severity:blocker] |
3084879
to
4e9f491
Compare
We shouldn't merge this for 3.6 because we aren't gathering it yet. |
@smarterclayton updated, it follow the way the build metrics are done (lister, simple counters)... |
} | ||
|
||
generation := d.Status.ObservedGeneration | ||
ch <- prometheus.MustNewConstMetric(activeRolloutCountDesc, prometheus.GaugeValue, float64(generation), []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton i'm not sure if generation is the right value here, the builds are using start timestamp (unix)
var available, failed, cancelled float64 | ||
|
||
for _, d := range result { | ||
if util.IsTerminatedDeployment(d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result is DCs but these helpers are for RCs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis yeah, brainfart... it is all fixed :)
e8c06b5
to
1f771da
Compare
|
||
// Collect implements the prometheus.Collector interface. | ||
func (c *appsCollector) Collect(ch chan<- prometheus.Metric) { | ||
result, err := c.lister.List(labels.Everything()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the extra list here? We already do one inside the main sync loop of the controller. Can't you use those results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess prometheus collector schedule the list, if I pass the main controller loop rcList there might be a race? or I will have to read lock it? not sure if it is worth, since this list is done via cache imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed IRL, this is fine.
96cad7b
to
383fe38
Compare
} | ||
|
||
// TODO: possible time screw? | ||
durationSeconds := time.Now().Unix() - d.CreationTimestamp.Unix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the time the deployer pod started running more precise than the creation of the RC for when the new rollout started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis i think there is some time skew here but the deployer pod might be already gone and I will need to get that pod which will slow down the collection?
383fe38
to
a6be016
Compare
@smarterclayton tests are passing now, PTAL |
I might use the #16347 for started/completed after that lands, would not block this. |
@smarterclayton bump |
} | ||
|
||
// Record duration in seconds for active rollouts | ||
// TODO: possible time screw? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time "skew". Time "screw" would probably be very different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classic @mfojtik comment
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mfojtik, smarterclayton The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue (batch tested with PRs 14784, 16418, 16406, 16431, 14796) |
Add basic prometheus metrics for deployment config rollouts. The implementation should be consistent with build metrics: #15440