Running Containerd with Nvidia GPU support

PUBLISHED ON 29 MAR 2019 — DEVELOPMENT

Motivation

Whether you’re operating the container runtime directly, or using through a workload manager, such as Kubernetes, containerd is a great choice. It’s faster than Docker and allows running seperate runtimes for trusted and untrusted workloads.

Prerequisites

This guide assumes you’ve got hardware (or a VM) with a CUDA enabled Nvidia graphics card and that you’re running Ubuntu 18.04 Bionic Beaver.

Installing and Configuring

Install Nvidia drivers

If you’ve not already, install the official Nvidia drivers. This will install the stable driver, replace autoinstall with install if you’d prefer a newer version of the driver.

sudo ubuntu-drivers autoinstall

If you’re mixing GPU vendors, usually for power efficiency reasons, you may need to set the Nvidia GPU as the default.

sudo prime-select nvidia

Reboot the node.

sudo reboot

If you get command not found when executing the ubuntu-drivers command, you’ll need to install it.

sudo apt install ubuntu-drivers-common -y

Install Containerd

At the time of writing, containerd is not available in universe, so we’ll need to add the PPA.

sudo apt-add-repository ppa:mwhudson/devirt

Then install containerd itself.

sudo apt install containerd -y

Install Nvidia container runtime

The nvidia-container-runtime is a patched version of runc that adds a custom pre-start hook, which enables GPU support from within the container.

Up-to-date instructions should be available here. At the time of writing, these are the steps.

Add the repository.

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -

curl -s -L https://nvidia.github.io/nvidia-container-runtime/$(. /etc/os-release;echo $ID$VERSION_ID)/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list

Update the cache.

sudo apt update

If you need Kubernetes support.

sudo apt install nvidia-container-runtime -y

If you don’t need Kubernetes support, you only need to install the hook package.

sudo apt install nvidia-container-runtime-hook -y

Containerd handles the injection of the pre-start hook, but this is not yet handled by Kubernetes (1.14 at the time of writing).

Configure Containerd to use Nvidia container runtime

First off, if /etc/containerd doesn’t exist, create it.

sudo mkdir /etc/containerd

If you’re operating containerd via a consumer that supports the gpus option, such as the included CLI; ctr, you can stick to the default configuration. This can be generated with the following command.

containerd config default | sudo tee /etc/containerd/config.toml

However, due to this bug, if you want to operate via Kubernetes, you’ll need to change the default runtime from runc to nvidia-container-runtime, before starting the kubelet. Be aware, this will break the gpus option’s usage with containerd as well as consumers such as ctr.

sudo sed -i 's/runtime = "runc"/runtime = "nvidia-container-runtime"/g' /etc/containerd/config.toml

Then, restart the containerd service.

sudo systemctl restart containerd

Testing with GPU Workloads

Assuming you’ve used the default runtime, runc, we can test that GPU is accessible from within a container. If you’ve used the nvidia-container-runtime, you need to omit --gpus 0 from the ctr run command below.

Pulling Nvidia container image

First, we need to pull the image from Docker Hub.

sudo ctr images pull docker.io/nvidia/cuda:9.0-base

Running a container with GPU support

Then, we can run nvidia-smi from within a container.

sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi

This should display something like the following.

Wed Apr  3 16:08:22 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   56C    P0    75W / 149W |      0MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If you see something like…

ctr: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: stat failed: /dev/nvidia-modeset: no such file or directory\\\\n\\\"\"": unknown

…you’ll need to get the device created. The following command should fix it.

sudo nvidia-container-cli -k -d /dev/tty info
comments powered by Disqus