Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instaslice custom metrics #353

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mamy-CS
Copy link
Contributor

@mamy-CS mamy-CS commented Jan 3, 2025

Addresses - #53
Enable Kubernetes metrics from controllers
And add Custom metrics

  • instaslice_gpu_slices_total : Total number of GPU slices allocated per node

  • instaslice_current_deployed_pod_total: Pods that are deployed currently on slices

  • instaslice_pending_gpu_slice_requests : Number of pending GPU slice requests

  • instaslice_current_gpu_compatible_profiles: Profiles compatible with remaining GPU slices

  • instaslice_total_processed_gpu_slices : Number of total processed GPU slices

  • Unit tests added for metrics reconcile funcs and calls in instaslice _controller and prometheus_manager

Updated: Successful run of make test, make lint, make test-e2e, and make test-e2e-kind-emulated

@openshift-ci openshift-ci bot requested review from asm582 and harche January 3, 2025 16:46
Copy link

openshift-ci bot commented Jan 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mamy-CS

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 3, 2025
@mamy-CS mamy-CS marked this pull request as draft January 3, 2025 16:46
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2025
@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 11, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2025
@mamy-CS mamy-CS marked this pull request as ready for review February 6, 2025 17:14
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2025
@openshift-ci openshift-ci bot requested review from empovit and sairameshv February 6, 2025 17:16
@asm582
Copy link
Contributor

asm582 commented Feb 18, 2025

@mamy-CS the API has changed. We should move to the new API

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 19, 2025
@asm582
Copy link
Contributor

asm582 commented Feb 19, 2025

I would recommend to squash commits, also please report if e2e pass in both modes

@mamy-CS
Copy link
Contributor Author

mamy-CS commented Feb 20, 2025

pr is rebased and ready

@kannon92
Copy link
Contributor

Not sure if openshift is in scope for this yet.

https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/

There are some requirements for having this work with openshift monitoring. We also do not want to disable TLS as our default.

As of right now this seems to work with a custom deployed promethesus but I don't think it would work with Openshift.

@asm582
Copy link
Contributor

asm582 commented Feb 20, 2025

Not sure if openshift is in scope for this yet.

https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/

There are some requirements for having this work with openshift monitoring. We also do not want to disable TLS as our default.

As of right now this seems to work with a custom deployed promethesus but I don't think it would work with Openshift.

@kannon92 I think we should make this work on Kubernetes (maybe KinD) and then modify as per OpenShift platform requirements if you agree

@mamy-CS
Copy link
Contributor Author

mamy-CS commented Feb 20, 2025

his work on Kubernetes (maybe KinD) and then modify as per OpenShift

Yes, this custom metrics work is currently on KinD, we can modify it to OpenShift Req as @asm582 mentioned.

@mamy-CS mamy-CS force-pushed the add-metrics branch 3 times, most recently from a503cd4 to dd48022 Compare February 20, 2025 21:07
Signed-off-by: MohammedAbdi <[email protected]>

update metrics

Signed-off-by: MohammedAbdi <[email protected]>

nit

Signed-off-by: MohammedAbdi <[email protected]>

update metrics

Signed-off-by: MohammedAbdi <[email protected]>

update

Signed-off-by: MohammedAbdi <[email protected]>

update deployed pod total and total processed slices metrics

Signed-off-by: MohammedAbdi <[email protected]>

updateMetricsAllSlotsFree

Signed-off-by: MohammedAbdi <[email protected]>

nits

Signed-off-by: MohammedAbdi <[email protected]>

update promethues

Signed-off-by: MohammedAbdi <[email protected]>

update deployed pod total metrics call

Signed-off-by: MohammedAbdi <[email protected]>

remove fake capacity file

Signed-off-by: MohammedAbdi <[email protected]>

update profile map extraction automation

Signed-off-by: MohammedAbdi <[email protected]>

update

Signed-off-by: MohammedAbdi <[email protected]>

Track total fit across all GPUs correctly

Signed-off-by: MohammedAbdi <[email protected]>

add unit tests

Signed-off-by: MohammedAbdi <[email protected]>

update metrics url

Signed-off-by: MohammedAbdi <[email protected]>

nit

Signed-off-by: MohammedAbdi <[email protected]>

nit

Signed-off-by: MohammedAbdi <[email protected]>

adjust unit tests

Signed-off-by: MohammedAbdi <[email protected]>

nit

Signed-off-by: MohammedAbdi <[email protected]>

update

Signed-off-by: MohammedAbdi <[email protected]>

update manifests

Signed-off-by: MohammedAbdi <[email protected]>

update test file

Signed-off-by: MohammedAbdi <[email protected]>

update compatible profiles

Signed-off-by: MohammedAbdi <[email protected]>

nit

Signed-off-by: MohammedAbdi <[email protected]>

address reviews

Signed-off-by: MohammedAbdi <[email protected]>

address comments

Signed-off-by: MohammedAbdi <[email protected]>

nit

Signed-off-by: MohammedAbdi <[email protected]>
Copy link

openshift-ci bot commented Feb 20, 2025

@mamy-CS: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants