-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requesting zero GPUs allocates all GPUs #61
Comments
currently, device plugin doesn't have the ability to inject env vars to pods. However, You can implement this feature with Admission Webhook. Just writing small web server to mutate the env var to user definitions. I think it's not so difficult. (Actually, I did the same thing in our cluster.) |
Does it mean if I have two containers both requesting for nvidia.com/gpu: 0, then they could share the GPU? |
@yukuo78 basically yes, this is equal to the node-selector trick to share GPUs, as described in: go check the follow-ups on the thread for more information. |
@dhague it is prerequisite for both nvidia.com/gpu: 0 and NVIDIA_VISIBLE_DEVICES env var should be set together, isn't it?
|
@everpeace could you share your custom Admission Webhook code, especially the part that mutates NVIDIA_VISIBLE_DEVICES ? |
I 've tested. If we add
And The nvidia-container-runtime may take the last one to decide which devices to mount, then will |
@Cas-pian what you meant was two pods setup respectively with NVIDIA_VISIBLE_DEVICES=none and nvidia.com/gpu:1 on same node, wasn't that? |
1 similar comment
@Cas-pian what you meant was two pods setup respectively with NVIDIA_VISIBLE_DEVICES=none and nvidia.com/gpu:1 on same node, wasn't that? |
@Davidrjx no, just I found a bug of nvidia-container-runtime-hook(nvidia-container-tolkit): multi step 1: I use a cuda image has env NVIDIA_VISIBLE_DEVICES=all to start a pod (without setting resources.requests for GPU), then all GPUs will be mounted into the container, this will make k8s-device-plugin useless, and break the environment of pod who use resources.requests for GPU. step 2: In order to fix the problem in step 1, I add
|
@Cas-pian oh, now i understand what you mean. |
I wrote a Kubernetes Mutating Admission Webhook called gpu-admission-webhook to handle this case. It sets NVIDIA_VISIBLE_DEVICES to "none" if you do not request a GPU. It also deletes environment variables that would cause issues or bypass this constraint. |
After reading the documentation about From the doc:
|
I've tried to set:
My idea is to have multiple pods on the same node sharing single GPU. But it looks like in such case, the app in container does not utilise GPU at all. What am I missing? |
This is no longer an issue if you have the following lines in your accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true And you deploy the nvidia-device-plugin with the values compatWithCPUManager: true
deviceListStrategy: volume-mounts |
@ktarplee thanks for the clue! |
Needs to be set on the host, not inside a container. Here’s a link to the details: https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit |
@orkenstein the config file mentioned is installed on every host along with the NVIDIA Container Toolkit / NVIDIA Docker. |
@orkenstein does that mean that you're not using the NVIDIA Device Plugin to allow GPU usage on GCloud but using instead? (could you provide a link to the |
@elezar drivers gets installed like this: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers |
@orkenstein GKE does not (currently) use the NVIDIA device plugin nor the NVIDIA container toolkit. Which means that the suggestion by @ktarplee is not applicable to you. |
Ah, okay. What should I do then? |
This is unfortunately not something that I can help with. You could try post your request here https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues (which contains the device plugin used on GKE systems). |
Thanks for this solution. However, I'm deploying https://github.com/NVIDIA/gpu-operator to my k3s cluster with a docker backend, using gpu-operator to install the container runtime. Is it possible to inject this configuration into the helm deployment? |
@sjdrc Currently its not possible to set these parameters through gpu-operator helm deployment as the toolkit container doesn't support configuring these yet. We will look into adding this support. Meanwhile, these need to be added manually to |
Thanks for your prompt reply. So just to clarify, I should configure I do not have that file, but I do have |
Hey, I'm still having issues getting this working.
|
Adding a bit more information about my setup process (from clean) |
|
I'm still running into issues. Steps to reproduce
Result
|
So it looks like your Can you show your entire file for |
Hi there, This is the full file contents:
After editing, does anything need to be done to reload the file? Running |
So basically it seems that after a reboot I need to add those lines back and re-deploy gpu-operator helm chart and everything works fine. If I reboot without doing that I run into the above error. |
Hi all, so it's already possible to set |
I have found a solution to this very old issue, after a customer of the managed k8s product I lead complained about the same thing, and I read through these years of posts. After digging into the code for a while to see how the config files are read and parsed, I saw these https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/toolkit/toolkit.go#L159-L170 which were added in this commit 2 years ago NVIDIA/nvidia-container-toolkit@90518e0 So I threw this into the gpu-operator helm values:
Everything came up green, and I tested pods with no gpu resources specified, with a limit of Without a resource specified or with a limit of 0, the nvidia tooling and GPUs are indeed not mounted into the pod. Seems like everything is working as expected so I wanted to share for anyone else who stumbles on this thread. |
Here is the document explaining how those are to be used: We will be adding these instructions to our official docs soon (which is obviously long overdue). |
The README.md states:
I discovered a workaround for this, which is to set the environment variable
NVIDIA_VISIBLE_DEVICES
tonone
in the container spec.With a resource request for
nvidia.com/gpu: 0
this environment variable should be set automatically.The text was updated successfully, but these errors were encountered: