Running Containerd with Nvidia GPU support
Motivation Link to heading
Whether you’re operating the container runtime directly, or using through a workload manager, such as Kubernetes, containerd is a great choice. It’s faster than Docker and allows running seperate runtimes for trusted and untrusted workloads.
Prerequisites Link to heading
This guide assumes you’ve got hardware (or a VM) with a CUDA enabled Nvidia graphics card and that you’re running Ubuntu 18.04 Bionic Beaver.
Installing and Configuring Link to heading
Install Nvidia drivers Link to heading
If you’ve not already, install the official Nvidia drivers. This will
install the stable driver, replace autoinstall
with install
if you’d
prefer a newer version of the driver.
sudo ubuntu-drivers autoinstall
If you’re mixing GPU vendors, usually for power efficiency reasons, you may need to set the Nvidia GPU as the default.
sudo prime-select nvidia
Reboot the node.
sudo reboot
If you get command not found
when executing the ubuntu-drivers
command,
you’ll need to install it.
sudo apt install ubuntu-drivers-common -y
Install Containerd Link to heading
At the time of writing, containerd
is not available in universe,
so we’ll need to add the PPA.
sudo apt-add-repository ppa:mwhudson/devirt
Then install containerd itself.
sudo apt install containerd -y
Install Nvidia container runtime Link to heading
The nvidia-container-runtime
is a patched version of runc
that
adds a custom
pre-start hook,
which enables GPU support from within the container.
Up-to-date instructions should be available here. At the time of writing, these are the steps.
Add the repository.
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$(. /etc/os-release;echo $ID$VERSION_ID)/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
Update the cache.
sudo apt update
If you need Kubernetes support.
sudo apt install nvidia-container-runtime -y
If you don’t need Kubernetes support, you only need to install the hook package.
sudo apt install nvidia-container-runtime-hook -y
Containerd handles the injection of the pre-start hook, but this is not yet handled by Kubernetes (1.14 at the time of writing).
Configure Containerd to use Nvidia container runtime Link to heading
First off, if /etc/containerd
doesn’t exist, create it.
sudo mkdir /etc/containerd
If you’re operating containerd via a consumer that supports the gpus
option, such as the included CLI; ctr
, you can stick to the default
configuration. This can be generated with the following command.
containerd config default | sudo tee /etc/containerd/config.toml
However, due to this bug,
if you want to operate via Kubernetes, you’ll need to change the default
runtime from runc
to nvidia-container-runtime
, before starting the
kubelet. Be aware, this will break the gpus
option’s usage with
containerd as well as consumers such as ctr
.
sudo sed -i 's/runtime = "runc"/runtime = "nvidia-container-runtime"/g' /etc/containerd/config.toml
Then, restart the containerd service.
sudo systemctl restart containerd
Testing with GPU Workloads Link to heading
Assuming you’ve used the default runtime, runc
, we can test that GPU is
accessible from within a container. If you’ve used the
nvidia-container-runtime
, you need to omit --gpus 0
from the ctr run
command below.
Pulling Nvidia container image Link to heading
First, we need to pull the image from Docker Hub.
sudo ctr images pull docker.io/nvidia/cuda:9.0-base
Running a container with GPU support Link to heading
Then, we can run nvidia-smi
from within a container.
sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
This should display something like the following.
Wed Apr 3 16:08:22 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116 Driver Version: 390.116 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 56C P0 75W / 149W | 0MiB / 11441MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
If you see something like…
ctr: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: stat failed: /dev/nvidia-modeset: no such file or directory\\\\n\\\"\"": unknown
…you’ll need to get the device created. The following command should fix it.
sudo nvidia-container-cli -k -d /dev/tty info