使用 Linux vfio 将 Nvidia GPU 透传给 QEMU 虚拟机

如果想在 V2EX 获得更好的推广效果，欢迎了解 PRO 会员机制：
https://www.v2ex.com/pro/about

如果你经常使用铜币置顶主题，持有 V2EX Solana Token 会在每日签到时获得额外铜币：
https://www.v2ex.com/solana

This topic created in 1669 days ago, the information mentioned may be changed or developed.

Linux 上虚拟机 GPU 透传需要使用 vfio 的方式。主要是因为在 vfio 方式下对虚拟设备的权限和 DMA 隔离上做的更好。但是这么做也有个缺点，这个物理设备在主机和其他虚拟机都不能使用了。

qemu 直接使用物理设备本身命令行是很简单的，关键在于事先在主机上对系统、内核和物理设备的一些配置。

单纯从 qemu 的命令行来看，其实和普通虚拟机启动就差了最后那个 -device 的选项。这个选项也比较容易理解，就是把主机上的设备 0000:00:01.0 传给了虚拟机使用。

$ qemu-system-x86_64 -m 4096 -smp 4 --enable-kvm \
  -drive file=~/guest/fedora.img \
  -device vfio-pci,host=0000:00:01.0

系统及硬件准备

BIOS 中打开 IOMMU

设备直通在 x86 平台上需要打开 iommu 功能。这是 Intel 虚拟技术 VT-d(Virtualization Technology for Device IO) 中的一个部分。有时候这部分的功能没有被打开。

打开的方式在 BIOS 设置中 Security->Virtualization->VT-d 这个位置。当然不同的 BIOS 位置可能会略有不同。记得在使用直通设备前要将这个选项打开。

内核配置勾选 IOMMU

INTEL_IOMMU
│ Location: │
│ -> Device Drivers │
│ (2) -> IOMMU Hardware Support (IOMMU_SUPPORT [=y])

内核启动参数 enable IOMMU

BIOS 中打开，内核编译选项勾选还不够。还需要在引导程序中添加上内核启动参数

# 对应编辑 /etc/default/grub, 设置 GRUB_CMDLINE_LINUX=
$ cat /etc/default/grub
...
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 rdblacklist=nouveau nouveau.modeset=0"
...
 
# 重新生成 grub 引导配置文件
$ grub2-mkconfig -o /boot/grub2/grub.cfg
 
# 将 vfio 相关 module 设置为开机 load
$ cat /etc/modules-load.d/vfio.conf
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

Setting up IOMMU Kernel parameters

找到 nvidia GPU BusID

record PCI addresses and hardware IDs of the GPU

$ lspci -k | grep -i nvidia -A 3                   
41:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
        Subsystem: Device 1b4c:11bf
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau
41:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
        Subsystem: Device 1b4c:11bf
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
# pci address => 41:00.0,41:00.1
# device id => 1b4c:11bf
 
# 这里找到了两张 nvidia 卡，它们的 device id 都是 1b4c:11bf, 一张是 Audio device
# 这样是不能 passthrough 进去的，因为:
# vfio-pci use your vendor and device id pair to identify which device they need to bind to at boot,
# if you have two GPUs sharing such an ID pair you will not be able to get your passthough driver to bind with just one of them
# 使用下面的脚本解决这种情况：
 
$ cat /usr/bin/vfio-pci-override.sh
#!/bin/sh
 
for i in $(find /sys/devices/pci* -name boot_vga); do
    if [ $(cat "$i") -eq 0 ]; then
        GPU="${i%/boot_vga}"
        AUDIO="$(echo "$GPU" | sed -e "s/0$/1/")"
        echo "vfio-pci" > "$GPU/driver_override"
        if [ -d "$AUDIO" ]; then
            echo "vfio-pci" > "$AUDIO/driver_override"
        fi
    fi
done
 
modprobe -i vfio-pci
 
# 把脚本传入 /etc/modprobe.d/vfio.conf
$ cat /etc/modprobe.d/vfio.conf
install vfio-pci /usr/bin/vfio-pci-override.sh
options vfio-pci ids=10de:1c82 disable_vga=1

使用 vfio 管理 GPU

# /etc/modprobe.d/vfio.conf, ids 为 lspci 找到的 hardware id, 多个设备的话用','分割
$ cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:134d disable_vga=1
 
# 禁用 NVIDIA nouveau 开源驱动, /etc/modprobe.d/blacklist.conf
$ cat /etc/modprobe.d/blacklist.conf
blacklist nouveau
 
# kvm 模块配置, /etc/modprobe.d/kvm.conf
$ cat /etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1

重启系统，启动完成后查看当前的 nvidia GPU 是否被 vfio-pci 模块使用, 确认 IOMMU 功能确实打开。

$ dmesg | grep -e DMAR -e IOMMU | grep enabled
 
# 如果能搜索到
DMAR: IOMMU enabled
# 表示上述配置成功。
 
# 查看 GPU 是否被 vfio-pci 使用
# 另外注意检查看看 41:00.1 Audio device 是否也被 vfio-pci 使用
$ lspci -k | grep -i -e nvidia -A 3
41:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
    Subsystem: Device 1b4c:11bf
    Kernel driver in use: vfio-pci # GTX 1050 Ti GPU 被 vfio-pci 使用
    Kernel modules: nouveau
41:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
    Subsystem: Device 1b4c:11bf
    Kernel driver in use: vfio-pci # 发现 Audio device 也被 vfio-pci 使用了
    Kernel modules: snd_hda_intel
...

# list GPU IOMMU group
$ find /sys/kernel/iommu_groups/ -type l | grep 41:00
/sys/kernel/iommu_groups/27/devices/0000:41:00.0
/sys/kernel/iommu_groups/27/devices/0000:41:00.1
 
# 找到 IOMMU Group 管理的 PCI 设备
#!/bin/bash
shopt -s nullglob
for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done

使用 qemu 透传 nvidia GPU

准备好 centos7 镜像，然后在虚拟机里面安装 nvidia 官方闭源驱动和 cuda SDK

# 我从服务器上拷贝过来的是 vmdk 的镜像，先把它转换成 qcow2 的格式
$ /usr/local/qemu-2.9.0/bin/qemu-img convert -f vmdk -O qcow2 centos-7.3.1611-20180104.vmdk centos-7.3.1611-20180104.qcow2
 
# 使用 qemu 启动，注意-cpu 需要 kvm=off 参数
# kvm=off will hide the kvm hypervisor signature, this is required for NVIDIA cards
# since its driver will refuse to work on an hypervisor and result in Code 43 on windows
$ cat startvm.sh
#!/bin/sh
/usr/local/qemu-2.9.0/bin/qemu-system-x86_64 -enable-kvm \
-m 4096 -cpu host,kvm=off -smp 4,sockets=1,cores=4,threads=1 \
-drive file=./centos-7.3.1611-20180104.qcow2 \
-device vfio-pci,host=41:00.0,multifunction=on,addr=0x16 \
-device vfio-pci,host=41:00.1 \
-net nic,model=e1000 -net user,hostfwd=tcp::5022-:22 \
-vnc :1
 
# 这台虚拟机开了 vnc 和 ssh 端口转发，可以使用 vnc 或者 ssh 访问
 
# 从 host 进入虚拟机
$ ssh 127.0.0.1 -p 5022
 
# 查看虚拟机透传进来的显卡
$ lspci -k | grep -i nvidia -A 3
00:04.0 Audio device: NVIDIA Corporation Device 0fb9 (rev a1)
    Subsystem: Device 1b4c:11bf
    Kernel driver in use: snd_hda_intel
    Kernel modules: snd_hda_intel
00:16.0 VGA compatible controller: NVIDIA Corporation GP107 (rev a1)
    Subsystem: Device 1b4c:11bf
    Kernel modules: nouveau

安装 nvidia 驱动和 Cuda

nvidia 驱动需要从官方下载，如果先安装 cuda 的话会一同安装 nvidia 驱动。接下来采用虚拟机先安装驱动再安装 cuda 的步骤。

参考： installing-nvidia-drivers-centos-7 NVIDIA CUDA GETTINGS STARTED GUIDE FOR LINUX

安装 nvidia 驱动

下载地址： http://www.nvidia.com/object/unix.html

# update 后如果更新内核，需要重启
$ yum -y update
 
# 安装 gcc 、make 、glibc 等工具和库
$ yum -y groupinstall "Development Tools"
$ yum -y install kernel-devel
 
# Download the latest NVIDIA driver for unix.
$ wget http://us.download.nvidia.com/XFree86/Linux-x86_64/390.42/NVIDIA-Linux-x86_64-390.42.run
$ yum -y install epel-release
$ yum -y install dkms
 
# Edit /etc/default/grub. Append the following  to “GRUB_CMDLINE_LINUX”
rd.driver.blacklist=nouveau nouveau.modeset=0
 
# Generate a new grub configuration to include the above changes.
$ grub2-mkconfig -o /boot/grub2/grub.cfg
 
# Edit/create /etc/modprobe.d/blacklist.conf and append:
blacklist nouveau
 
# Backup your old initramfs and then build a new one
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
$ dracut /boot/initramfs-$(uname -r).img $(uname -r)
 
# 重启 again
 
# Run the NVIDIA driver installer and enter yes to all options.
$ sh NVIDIA-Linux-x86_64-*.run
 
# 装好后再一次重启，lspci -k 看下 gpu 使用的驱动是否是 nvidia
$ lspci -k | grep -i nvidia -A 3
00:04.0 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
00:16.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
    Kernel driver in use: nvidia # 发现已经使用 nvidia 驱动
    Kernel modules: nouveau, nvidia_drm, nvidia
 
# 执行 nvidia-smi 看下输出和温度
$ nvidia-smi
Thu Mar 15 01:31:09 2018      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.42                 Driver Version: 390.42                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:00:16.0 Off |                  N/A |
| 40%   32C    P0    N/A / 100W |      0MiB /  4040MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 
$ nvidia-smi -q -d TEMPERATURE
 
==============NVSMI LOG==============
 
Timestamp                           : Thu Mar 15 01:32:42 2018
Driver Version                      : 390.42
 
Attached GPUs                       : 1
GPU 00000000:00:16.0
    Temperature
        GPU Current Temp            : 32 C
        GPU Shutdown Temp           : 102 C
        GPU Slowdown Temp           : 99 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

安装 cuda

下载地址： https://developer.nvidia.com/cuda-downloads 这里选择 runfile ，以后为了方便也可以选择 rpm(network)的方式，会自动帮我们安装 nvidia 驱动

$ wget https://developer.nvidia.com/compute/cuda/9.1/Prod/local_installers/cuda_9.1.85_387.26_linux
 
# Say no to installing the NVIDIA driver.
# The standalone driver you already installed is typically newer than what is packaged with CUDA.
# Use the default option for all other choices.
$ sh cuda_*.run
 
# 添加 CUDA 相关的环境变量
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
 
# make samples
$ cd ~/NVIDIA_CUDA-9.1_Samples; make -j 4
$ cd bin/x86_64/linux/release
$ ./deviceQuery # 查询 gpu 信息
./deviceQuery Starting...
 
 CUDA Device Query (Runtime API) version (CUDART static linking)
 
Detected 1 CUDA Capable device(s)
 
Device 0: "GeForce GTX 1050 Ti"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 4040 MBytes (4236312576 bytes)
  ( 6) Multiprocessors, (128) CUDA Cores/MP:     768 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             3504 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
...
 
$ ./bandwidtTest # 使用 cuda 测试 gpu bandwidth
Running on...
 
 Device 0: GeForce GTX 1050 Ti
 Quick Mode
 
 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            9719.0
 
 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            9215.8
 
 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            95525.1
 
Result = PASS
 
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

作者: 云联壹云李泽玺

GitHub: https://github.com/yunionio/cloudpods

开源地址： https://www.cloudpods.org/

Cloudpods 是一个开源的 Golang 实现的云原生的多云和混合云融合平台。Cloudpods 不仅可以管理本地的虚拟机和物理机资源，还可以管理其他公有云和私有云平台的资源。

No Comments Yet

device nVIDIA GPU vfio-pci