Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine ID: Add Prometheus metrics for loop tasks #52410

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

timothyb89
Copy link
Contributor

@timothyb89 timothyb89 commented Feb 22, 2025

This adds a number of Prometheus metrics to help track success, failure, and timing for loop iterations. The loop helper is used across tbot services, so these metrics universally cover identity and output renewals, among other tasks. Also, included the Teleport build collector.

New metrics include:

  • tbot_task_iteration_duration_seconds: histogram of iteration time, including all retries
  • tbot_task_iterations_successful: histogram of # of attempts needed for a particular iteration to succeed
  • tbot_task_iterations_failed: count of failures by task
  • tbot_task_iterations: simple counter of iterations attempted per task, regardless of outcome

This additionally renames service_heatbeat.go, which was misspelled.

changelog: Machine ID: Added new Prometheus metrics to track success and failure of renewal loops

This adds a number of Prometheus metrics to help track success,
failure, and timing for loop iterations. The loop helper is used
across tbot services, so these metrics universally cover identity
and output renewals, among other tasks.

Also, renames `service_heatbeat.go`, which was misspelled.
@timothyb89
Copy link
Contributor Author

Sample of new metrics:

# HELP tbot_task_iteration_duration_seconds Time between beginning and ultimate end of one task iteration regardless of outcome, including all retries
# TYPE tbot_task_iteration_duration_seconds histogram
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.1"} 0
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.17500000000000002"} 0
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.30625"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.5359375000000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="0.9378906250000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="1.6413085937500003"} 1
tbot_task_iteration_duration_seconds_bucket{name="bot-identity-renewal",le="+Inf"} 1
tbot_task_iteration_duration_seconds_sum{name="bot-identity-renewal"} 0.230272
tbot_task_iteration_duration_seconds_count{name="bot-identity-renewal"} 1
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.1"} 0
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.17500000000000002"} 0
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.30625"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.5359375000000001"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="0.9378906250000001"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="1.6413085937500003"} 2
tbot_task_iteration_duration_seconds_bucket{name="output-renewal",le="+Inf"} 2
tbot_task_iteration_duration_seconds_sum{name="output-renewal"} 0.407756875
tbot_task_iteration_duration_seconds_count{name="output-renewal"} 2
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.1"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.17500000000000002"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.30625"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.5359375000000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="0.9378906250000001"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="1.6413085937500003"} 1
tbot_task_iteration_duration_seconds_bucket{name="submit-heartbeat",le="+Inf"} 1
tbot_task_iteration_duration_seconds_sum{name="submit-heartbeat"} 0.031338792
tbot_task_iteration_duration_seconds_count{name="submit-heartbeat"} 1
# HELP tbot_task_iterations Number of task iteration attempts, not counting retries
# TYPE tbot_task_iterations counter
tbot_task_iterations{name="bot-identity-renewal"} 1
tbot_task_iterations{name="output-renewal"} 2
tbot_task_iterations{name="submit-heartbeat"} 1
# HELP tbot_task_iterations_successful Histogram of task iterations that ultimately succeeded, bucketed by number of retries before success
# TYPE tbot_task_iterations_successful histogram
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="0"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="1"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="2"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="3"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="4"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="5"} 1
tbot_task_iterations_successful_bucket{name="bot-identity-renewal",le="+Inf"} 1
tbot_task_iterations_successful_sum{name="bot-identity-renewal"} 0
tbot_task_iterations_successful_count{name="bot-identity-renewal"} 1
tbot_task_iterations_successful_bucket{name="output-renewal",le="0"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="1"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="2"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="3"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="4"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="5"} 2
tbot_task_iterations_successful_bucket{name="output-renewal",le="+Inf"} 2
tbot_task_iterations_successful_sum{name="output-renewal"} 0
tbot_task_iterations_successful_count{name="output-renewal"} 2
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="0"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="1"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="2"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="3"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="4"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="5"} 1
tbot_task_iterations_successful_bucket{name="submit-heartbeat",le="+Inf"} 1
tbot_task_iterations_successful_sum{name="submit-heartbeat"} 0
tbot_task_iterations_successful_count{name="submit-heartbeat"} 1
# HELP teleport_build_info Provides build information of Teleport including gitref (git describe --long --tags), Go version, and Teleport version. The value of this gauge will always be 1.
# TYPE teleport_build_info gauge
teleport_build_info{gitref="api/v17.0.0-dev.gusr.1-2795-g75cc82e38e",goversion="go1.24.0",version="18.0.0-dev"} 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue here is that all of our output services use output-renewal as their task name, so they'll be grouped together. Do we want to make that more specific? I'd suggest either appending something more specific to the name (e.g. output-renewal/application) or adding a subtype field + prometheus label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant