Using NVIDIA GPUs on Flatcar

    Installation

    Flatcar Container Linux offers support for the installation and customization of NVIDIA drivers for Tesla GPUs (both in VMs and on bare metal). Please take note that NVIDIA drivers have been migrated from being solely available on AWS and Azure to being accessible on all platforms with the release of version 3637.0.0. If you are using an older version, please be aware that it is restricted to AWS and Azure only.

    Currently, there are two ways of installing NVIDIA drivers. First is using the built in nvidia.service, which automatically compiles the drivers from source on the first boot. Second is via an official NVIDIA drivers sysext, which contains prebuilt drivers, speeding up the provisioning. When using secure boot, only the prebuilt sysext will work, as the modules are signed.

    We recommend using the prebuilt sysexts, but the nvidia.service is kept for backwards compatibility.

    nvidia.service method

    During the initial boot, the nvidia.service automates hardware detection and triggers driver installation within a dedicated Flatcar developer container, ensuring a streamlined process. The current version of the installed NVIDIA driver can be found in the /usr/share/flatcar/nvidia-metadata file, assuming it’s a vanilla installation and the version hasn’t been customized (see below).

    Since the installation is triggered after boot, the overall installation time is around 5-10 minutes. To check the installation status, use the following command:

    # journalctl -u nvidia -f
    

    To customize the version number of the NVIDIA driver, you can override the value in the /etc/flatcar/nvidia-metadata file by specifying the desired version in the NVIDIA_DRIVER_VERSION. However, it’s important to ensure that the chosen driver version is compatible with the GPU hardware present in the instance. E.g., for older GPUs the 460 driver series is needed because the latest drivers dropped support for them.

    echo "NVIDIA_DRIVER_VERSION=460.106.00" | sudo tee /etc/flatcar/nvidia-metadata
    sudo systemctl restart nvidia
    

    Butane Config

    variant: flatcar
    version: 1.0.0
    storage:
      files:
        - path: /etc/flatcar/nvidia-metadata
          mode: 0644
          contents:
            inline: |
              NVIDIA_DRIVER_VERSION=460.106.00
    

    Prebuilt sysext method

    Flatcar provides official NVIDIA drivers sysext, built with every Flatcar release. As the kernel modules are built together with the kernel, they are signed with the ephemeral kernel modules signing key, which is important for secure boot support. During provisioning, the NVIDIA drivers sysext is downloaded and activated. The nvidia.service automatically detects an NVIDIA sysext has already been loaded and skips downloading and building them from source (therefore the version specified in NVIDIA_DRIVER_VERSION will be ignored).

    The drivers come in two flavours: open and non-open for amd64 architecture.

    You can find the latest nvidia-runtime releases here .

    To activate the NVIDIA sysext:

    ---
    # config.yaml
    # butane < config.yaml > config.json
    variant: flatcar
    version: 1.0.0
    
    storage:
      files:
        - path: /etc/flatcar/enabled-sysext.conf
          contents:
            inline: |
              nvidia-drivers-570-open
        - path: /opt/extensions/nvidia-runtime/nvidia-runtime-v1.17.9-x86-64.raw
          mode: 0644
          contents:
            source: https://extensions.flatcar.org/extensions/nvidia-runtime-v1.17.9-x86-64.raw
      links:
        - target: /opt/extensions/nvidia-runtime/nvidia-runtime-v1.17.9-x86-64.raw
          path: /etc/extensions/nvidia-runtime.raw
          hard: false
    

    Testing

    Once the installation is complete (either via nvidia.service or sysext), you will have access to various NVIDIA commands. To verify the installation, run the command:

    nvidia-smi
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 570.181                Driver Version: 570.181        CUDA Version: 12.8     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA L40S                    Off |   00000000:05:00.0 Off |                    0 |
    | N/A   31C    P0             63W /  350W |       0MiB /  46068MiB |      4%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+
    

    Verify the container workload works

    sudo ctr images pull nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
    sudo ctr run --rm --gpus 0 \
        nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 \
        vectoradd
    

    The output of the container should look like this

    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
    

    Kubernetes usage

    For Kubernetes usage, it is required to disable the driver and toolkit when installing the NVIDIA GPU operator .

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
    helm install --wait --generate-name \
        -n gpu-operator --create-namespace \
        nvidia/gpu-operator \
        --set driver.enabled=false \
        --set toolkit.enabled=false