Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document reliance on ConfigMaps and RBAC implications #185

Open
tardieu opened this issue Oct 18, 2024 · 3 comments
Open

Document reliance on ConfigMaps and RBAC implications #185

tardieu opened this issue Oct 18, 2024 · 3 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@tardieu
Copy link
Contributor

tardieu commented Oct 18, 2024

InstaSlice injects the NVIDIA_VISIBLE_DEVICES env variable into pod specs to assign containers to specific MIG slices. However, pod specs are immutable once created, hence before a decision about MIG slice allocation can be made. To work around this limitation, the InstaSlice webhook injects into the pod spec a reference to a ConfigMap using this indirection to make it possible to later populate the ConfigMap with the id of the chosen MIG slice. Once the ConfigMap is ready, InstaSlice ungates the pod.

Because the MIG slice id is pulled from the ConfigMap after a delay, a user or process with permission to create/update/patch/delete ConfigMaps in the pod namespace could alter the content of the ConfigMap prior to the container runtime configuring and starting the container. This could result in the container having access to unintended slices possibly interfering with other pods.

For the intended use case, i.e., backend clusters with no or limited user access, this is acceptable. However, we should document InstaSlice's dependency on unaltered ConfigMaps in general. It should be noted that typical GPU clusters with standard deployments of the NVIDIA GPU operator are susceptible to similar abuse as they permit accessing GPUs without having to request an nvidia.com/gpu resource in the first place. See NVIDIA/k8s-device-plugin#61 for details.

Alternative approaches could be considered to remove the dependency on a ConfigMap (with other drawbacks):

  • InstaSlice could target (i.e., intercept) deployments or jobs instead of pods.
  • InstaSlice could delete and recreate the pod once an allocation decision has been made.
@tardieu
Copy link
Contributor Author

tardieu commented Oct 18, 2024

We should make the ConfigMap immutable, hence preventing accidental updates. This however does not preclude deleting and recreating the ConfigMap.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2025
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

2 participants