Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting zero GPUs allocates all GPUs #61

Closed
dhague opened this issue Jul 10, 2018 · 37 comments
Closed

Requesting zero GPUs allocates all GPUs #61

dhague opened this issue Jul 10, 2018 · 37 comments
Assignees

Comments

@dhague
Copy link

dhague commented Jul 10, 2018

The README.md states:

WARNING: if you don't request GPUs when using the device plugin with NVIDIA images all the GPUs on the machine will be exposed inside your container.

I discovered a workaround for this, which is to set the environment variable NVIDIA_VISIBLE_DEVICES to none in the container spec.

With a resource request for nvidia.com/gpu: 0 this environment variable should be set automatically.

@everpeace
Copy link
Contributor

With a resource request for nvidia.com/gpu: 0 this environment variable should be set automatically.

currently, device plugin doesn't have the ability to inject env vars to pods.

However, You can implement this feature with Admission Webhook. Just writing small web server to mutate the env var to user definitions. I think it's not so difficult. (Actually, I did the same thing in our cluster.)

@yukuo78
Copy link

yukuo78 commented Oct 13, 2018

Does it mean if I have two containers both requesting for nvidia.com/gpu: 0, then they could share the GPU?

@thomasjungblut
Copy link

@yukuo78 basically yes, this is equal to the node-selector trick to share GPUs, as described in:
kubernetes/kubernetes#52757 (comment)

go check the follow-ups on the thread for more information.

@Davidrjx
Copy link

@dhague it is prerequisite for both nvidia.com/gpu: 0 and NVIDIA_VISIBLE_DEVICES env var should be set together, isn't it?
recently, if only nvidia.com/gpu: 0 is set , related pod scheduled on GPU node probably crash resulting from "OutOfnvidia.com/gpu", its status specially looks like

status:
  message: 'Pod Node didn''t have enough resource: nvidia.com/gpu, requested: 0, used:
    1, capacity: 0'
  phase: Failed
  reason: OutOfnvidia.com/gpu
  startTime: "2019-05-09T03:05:49Z"

@aebischers
Copy link

@everpeace could you share your custom Admission Webhook code, especially the part that mutates NVIDIA_VISIBLE_DEVICES ?

@Cas-pian
Copy link

Cas-pian commented Nov 2, 2019

I 've tested. If we add NVIDIA_VISIBLE_DEVICES=none to pod.spec.containers[*].env, then a pod which want to use 1 GPU by k8s's container resource requests, the envionment list when nvidia-container-runtime was executed will be(the order is important):

NVIDIA_VISIBLE_DEVICES=GPU-xxx-xxx-xxx-xxx-xxx
NVIDIA_VISIBLE_DEVICES=none

And The nvidia-container-runtime may take the last one to decide which devices to mount, then will result in no devices avaiable in container which is not expected. This action is up to the version of nvidia-container-runtime-hook(renamed to nvidia-container-toolkit recently) you use, please referer to this)

@Davidrjx
Copy link

Davidrjx commented Nov 5, 2019

@Cas-pian what you meant was two pods setup respectively with NVIDIA_VISIBLE_DEVICES=none and nvidia.com/gpu:1 on same node, wasn't that?

1 similar comment
@Davidrjx
Copy link

Davidrjx commented Nov 5, 2019

@Cas-pian what you meant was two pods setup respectively with NVIDIA_VISIBLE_DEVICES=none and nvidia.com/gpu:1 on same node, wasn't that?

@Cas-pian
Copy link

Cas-pian commented Nov 6, 2019

@Davidrjx no, just I found a bug of nvidia-container-runtime-hook(nvidia-container-tolkit): multi NVIDIA_VISIBLE_DEVICES envs which is not handled, and will make GPUs not mounted as expected.

step 1: I use a cuda image has env NVIDIA_VISIBLE_DEVICES=all to start a pod (without setting resources.requests for GPU), then all GPUs will be mounted into the container, this will make k8s-device-plugin useless, and break the environment of pod who use resources.requests for GPU.

step 2: In order to fix the problem in step 1, I add NVIDIA_VISIBLE_DEVICES=none to pod.spec.containers[*].env to disable the default value of NVIDIA_VISIBLE_DEVICES in image, but what I saw is no GPU was mounted into the pod even I use resource.requests to use GPU, even if you use resources.requests!!

And finally I found it's not a good design to use the same logic(env NVIDIA_VISIBLE_DEVICES) for single node GPU allocation and cluster GPU allocation, because CUDA images are made for single node usage, it's better to use a diffenent logics(eg: different envs). @flx42

@Davidrjx
Copy link

Davidrjx commented Nov 6, 2019

@Cas-pian oh, now i understand what you mean.

@ktarplee
Copy link

I wrote a Kubernetes Mutating Admission Webhook called gpu-admission-webhook to handle this case. It sets NVIDIA_VISIBLE_DEVICES to "none" if you do not request a GPU. It also deletes environment variables that would cause issues or bypass this constraint.

@XciD
Copy link

XciD commented Mar 25, 2021

After reading the documentation about NVIDIA_VISIBLE_DEVICES I advices you to set void instead of none.

From the doc:

nvidia-container-runtime will have the same behavior as runc (i.e. neither GPUs nor capabilities are exposed)

@orkenstein
Copy link

I've tried to set:

        resources:
          limits:
            nvidia.com/gpu: 0

My idea is to have multiple pods on the same node sharing single GPU. But it looks like in such case, the app in container does not utilise GPU at all. What am I missing?

@ktarplee
Copy link

This is no longer an issue if you have the following lines in your /etc/nvidia-container-runtime/config.toml

accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true 

And you deploy the nvidia-device-plugin with the values

compatWithCPUManager: true
deviceListStrategy: volume-mounts

@orkenstein
Copy link

@ktarplee thanks for the clue!
Talking about /etc/nvidia-container-runtime/config.toml. I have a container build on top of tensorflow/tensorflow:1.14.0-gpu-py3 but see no config.toml. Or where it should be edited?

@klueska
Copy link
Contributor

klueska commented Jun 28, 2021

Needs to be set on the host, not inside a container.

Here’s a link to the details: https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit

@elezar
Copy link
Member

elezar commented Jun 28, 2021

@orkenstein the config file mentioned is installed on every host along with the NVIDIA Container Toolkit / NVIDIA Docker.

@orkenstein
Copy link

Thanks @klueska @elezar
I'm not sure how to do that on GCloud. Should I tweak nvidia-installer somehow?

@elezar
Copy link
Member

elezar commented Jun 29, 2021

@orkenstein does that mean that you're not using the NVIDIA Device Plugin to allow GPU usage on GCloud but using instead?

(could you provide a link to the nvidia-installer you mention).

@orkenstein
Copy link

@elezar
Copy link
Member

elezar commented Jun 29, 2021

@orkenstein GKE does not (currently) use the NVIDIA device plugin nor the NVIDIA container toolkit. Which means that the suggestion by @ktarplee is not applicable to you.

@orkenstein
Copy link

@orkenstein GKE does not (currently) use the NVIDIA device plugin nor the NVIDIA container toolkit. Which means that the suggestion by @ktarplee is not applicable to you.

Ah, okay. What should I do then?

@elezar
Copy link
Member

elezar commented Jun 30, 2021

This is unfortunately not something that I can help with. You could try post your request here https://github.com/GoogleCloudPlatform/container-engine-accelerators/issues (which contains the device plugin used on GKE systems).

@sjdrc
Copy link

sjdrc commented Nov 23, 2021

This is no longer an issue if you have the following lines in your /etc/nvidia-container-runtime/config.toml

accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true 

And you deploy the nvidia-device-plugin with the values

compatWithCPUManager: true
deviceListStrategy: volume-mounts

Thanks for this solution. However, I'm deploying https://github.com/NVIDIA/gpu-operator to my k3s cluster with a docker backend, using gpu-operator to install the container runtime. Is it possible to inject this configuration into the helm deployment?

@shivamerla
Copy link
Contributor

@sjdrc Currently its not possible to set these parameters through gpu-operator helm deployment as the toolkit container doesn't support configuring these yet. We will look into adding this support. Meanwhile, these need to be added manually to /usr/local/nvidia/toolkit/.config/config.toml file, but device-plugin settings can be configured through --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY --set devicePlugin.env[0].value="volume-mounts" parameters during operator install. compatWithCPUManager setting is already default through gpu-operator deployment.

@sjdrc
Copy link

sjdrc commented Nov 24, 2021

Thanks for your prompt reply.

So just to clarify, I should configure /usr/local/nvidia/toolkit/.config/config.toml on the host, and by setting volume-mounts, the device plugin will use the host configuration?

I do not have that file, but I do have /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml

@sjdrc
Copy link

sjdrc commented Dec 3, 2021

Hey, I'm still having issues getting this working.

  1. Should the config changes go into /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml - This file is present on my host, but not /usr/local/nvidia/toolkit/.config/config.toml
  2. What section in the config file do these changes go? I have a top level section, [nvidia-container-cli], and [nvidia-container-runtime]
  3. How can I make these persist? Every time I restart k3s the file content gets reverted.

@sjdrc
Copy link

sjdrc commented Dec 3, 2021

Adding a bit more information about my setup process (from clean)

@shivamerla
Copy link
Contributor

Hey, I'm still having issues getting this working.

  1. Should the config changes go into /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml - This file is present on my host, but not /usr/local/nvidia/toolkit/.config/config.toml

Sorry, /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml is the right location of this file.

  1. What section in the config file do these changes go? I have a top level section, [nvidia-container-cli], and [nvidia-container-runtime]

You need to add those lines as global params.

disable-require = false
accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true 

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  1. How can I make these persist? Every time I restart k3s the file content gets reverted.

I think this was because they were not added as global params.

@sjdrc
Copy link

sjdrc commented Dec 6, 2021

I'm still running into issues.

Steps to reproduce

  1. Install ubuntu server 20.04
  2. Install docker
curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker
  1. Blacklist nouveau
cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nvidia-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
  1. Disable apparmor
    sudo apt remove --assume-yes --purge apparmor
  2. install k3s with --docker flag
  3. helm install --version 1.9.0 --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY --set devicePlugin.env[0].value="volume-mounts"
  4. Add globally to /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true 
  1. Reboot

Result

nvidia-device-plugin-validator is giving an error and refusing to start:

Error: failed to start container "plugin-validation": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: device error: /var/run/nvidia-container-devices: unknown device: unknown

@klueska
Copy link
Contributor

klueska commented Dec 9, 2021

So it looks like your k8s-device-plugin settings are working (i.e. the envvars set in your helm chart), but your toolkit configs are not. This is verified by the fact that the "device" the toolkit is seeing is an unknown device with the name /var/run/nvidia-container-devices (which is what the plugin will set NVIDIA_VISIBLE_DEVICES to if it is listing the devices as volume mounts instead of via this envvar).

Can you show your entire file for /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml?

@sjdrc
Copy link

sjdrc commented Jan 28, 2022

Hi there,

This is the full file contents:

disable-require = false
accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]

After editing, does anything need to be done to reload the file? Running systemctl restart k3s will reset the file, removing the two added lines.

@sjdrc
Copy link

sjdrc commented Jan 28, 2022

So basically it seems that after a reboot I need to add those lines back and re-deploy gpu-operator helm chart and everything works fine. If I reboot without doing that I run into the above error.

@elgalu
Copy link

elgalu commented Apr 4, 2023

Hi all, so it's already possible to set volume-mounts for DEVICE_LIST_STRATEGY https://github.com/NVIDIA/gpu-operator/blob/f38dc96ac4e74b6f0926b5c497a87878265cf689/deployments/gpu-operator/values.yaml#L220-L221 since ~2 years via the gpu-operator or am I missing something?

@ozen
Copy link

ozen commented Nov 22, 2023

@elgalu The issue with the gpu operator seems not the device plugin's DEVICE_LIST_STRATEGY env var, but the toolkit's config file at /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml. You can read the steps followed by @sjdrc above.

@Apsu
Copy link

Apsu commented Oct 9, 2024

I have found a solution to this very old issue, after a customer of the managed k8s product I lead complained about the same thing, and I read through these years of posts.

After digging into the code for a while to see how the config files are read and parsed, I saw these https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/toolkit/toolkit.go#L159-L170 which were added in this commit 2 years ago NVIDIA/nvidia-container-toolkit@90518e0

So I threw this into the gpu-operator helm values:

toolkit:
  env:
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
      value: "false"
    - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
      value: "true"
devicePlugin:
  env:
    - name: DEVICE_LIST_STRATEGY
      value: volume-mounts

Everything came up green, and /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml has the right flag values in it.

I tested pods with no gpu resources specified, with a limit of nvidia.com/gpu: 0, multiple with 1 GPU on the same node, and with 1 GPU and privileged: true in the securityContext.

Without a resource specified or with a limit of 0, the nvidia tooling and GPUs are indeed not mounted into the pod. nvidia-smi disappears and python -c 'import torch; num_of_gpus = torch.cuda.device_count(); print(num_of_gpus);' shows 0 as well. 1 GPU pods on the same node get different ones (nvidia-smi shows unique Bus IDs), and the 1 GPU privileged pod indeed can only see the 1 it was allocated -- also a unique Bus ID.

Seems like everything is working as expected so I wanted to share for anyone else who stumbles on this thread.

@klueska
Copy link
Contributor

klueska commented Oct 9, 2024

Here is the document explaining how those are to be used:
https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit

We will be adding these instructions to our official docs soon (which is obviously long overdue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests