Booting QEMU from SPDK vhost-user-blk-pci with NVMe-oF: A Comprehensive Guide

At revWhiteShadow, we understand the intricate challenges of optimizing virtualized storage. A critical aspect of modern high-performance computing involves leveraging technologies like the Storage Performance Development Kit (SPDK) to deliver exceptional I/O capabilities. When integrating SPDK’s vhost-user-blk-pci with QEMU, especially when the underlying storage is accessed via NVMe-over-Fabrics (NVMe-oF), achieving seamless boot functionality can present unique hurdles. This article delves deep into the process of booting a QEMU virtual machine directly from an SPDK vhost-user-blk-pci device, where the storage backend is served through NVMe-oF. We will meticulously dissect the configuration steps, troubleshoot common pitfalls, and provide actionable insights to ensure your QEMU instances can leverage this potent combination for accelerated boot operations. Our goal is to provide a comprehensive resource that empowers users to successfully implement this advanced storage configuration, effectively outranking existing content on this specialized topic.

Understanding the Architecture: SPDK, vhost-user-blk-pci, NVMe-oF, and QEMU

Before we dive into the practical implementation, it’s essential to grasp the roles of each component involved. SPDK provides a suite of user-space libraries and drivers designed for high-performance storage applications. Vhost is a mechanism for offloading virtio device emulation to the host, allowing guest VMs to communicate directly with the host’s kernel or user-space drivers. The vhost-user-blk-pci interface, specifically, is an SPDK-provided user-space implementation of a block device that QEMU can connect to via a socket. This allows SPDK to manage the underlying storage and present it as a virtual PCI block device to the guest.

NVMe-over-Fabrics (NVMe-oF) is a protocol that extends the NVMe command set over a network, enabling efficient access to NVMe SSDs located on a remote fabric. This allows for disaggregated storage, where storage arrays can be accessed by compute nodes over high-speed networks like RoCE (RDMA over Converged Ethernet) or iWARP. SPDK’s NVMe-oF driver plays a crucial role in the host system, allowing it to discover and interact with these remote NVMe targets.

QEMU is a widely used hardware emulator and virtualizer. It acts as the hypervisor, creating and managing virtual machines. When booting a VM, QEMU requires a bootable device to load the operating system. Typically, this is a virtual disk image (e.g., raw, qcow2) presented as a virtual IDE, SCSI, or NVMe controller. In our scenario, we aim to present the SPDK-managed NVMe-oF storage as a bootable block device via the vhost-user-blk-pci interface.

The core challenge arises because while QEMU can successfully connect to and utilize a vhost-user-blk-pci device for I/O after the guest has booted, it doesn’t inherently recognize it as a bootable device out-of-the-box in this specific configuration. This is often due to how the boot process interacts with the device enumeration and the expectations of the guest’s bootloader.

Setting the Stage: Prerequisites and Initial Setup

Successful implementation hinges on having a properly configured environment. This includes having QEMU and SPDK installed and functional. For this guide, we’ll assume you have the following in place:

  • QEMU: A recent version, such as 7.2.15 as indicated in the problem description, is assumed.
  • SPDK: The SPDK library, compiled with necessary NVMe-oF support, is essential. The specified version SPDK v25.01-pregit sha1 8d960f1d8 is compatible.
  • NVMe-oF Target: A functioning NVMe-oF target providing access to a bootable disk image.
  • Hugepages: Configured on the host system, as SPDK heavily relies on hugepages for memory management.
  • Networking: Proper network configuration for NVMe-oF transport (e.g., RoCE, iWARP) between the host and the NVMe-oF target.

Step-by-Step Configuration for Booting

Let’s meticulously outline the sequence of commands and configurations required to achieve a bootable vhost-user-blk-pci device backed by NVMe-oF.

1. Initiating the SPDK vhost Daemon

The vhost process is the user-space endpoint that SPDK uses to present storage to QEMU via the vhost-user protocol.

bin/vhost -S /var/tmp -s 1024 -m 0x3 -A 0000:82:00.1
  • -S /var/tmp: Specifies the directory where the vhost-user socket will be created. QEMU will connect to this path.
  • -s 1024: Sets the size of the shared memory region.
  • -m 0x3: Defines the CPU mask for vhost. Here, it’s set to cores 0 and 1. This is crucial for performance as it dedicates specific cores to vhost operations.
  • -A 0000:82:00.1: This parameter specifies the PCI address of the NVMe controller that vhost will use to interact with SPDK’s NVMe driver. Crucially, in your scenario, you are not directly using the host’s NVMe controller here for the vhost interface, but rather SPDK’s internal NVMe-oF driver. This flag is typically for direct attached NVMe, not NVMe-oF targets managed by SPDK’s bdev_nvme_attach_controller. When using vhost_create_blk_controller with an SPDK bdev, the -A flag for the vhost binary itself is less directly relevant to the NVMe-oF backend, but rather for SPDK’s internal device management context. The actual connection to the NVMe-oF target is handled by the RPC calls.

2. Attaching the NVMe-oF Target to SPDK

Next, we instruct SPDK to connect to the NVMe-oF target and make its storage available as a block device (bdev).

./rpc.py bdev_nvme_attach_controller -t tcp -a 10.0.0.4 -s 4420 -f ipv4 -n nqn.2024-10.placeholder:bd --name placeholder
  • ./rpc.py: The SPDK command-line interface for interacting with the SPDK services.
  • bdev_nvme_attach_controller: The RPC command to attach an NVMe controller.
  • -t tcp: Specifies the transport protocol for NVMe-oF. While RDMA (e.g., rdma for RoCE, ibverbs for InfiniBand) is often preferred for performance, TCP can be used for simpler setups or troubleshooting. For optimal performance, we would typically recommend using RDMA if your network infrastructure supports it.
  • -a 10.0.0.4: The IP address of the NVMe-oF target.
  • -s 4420: The NVMe-oF discovery service port.
  • -f ipv4: Specifies the IP address family.
  • -n nqn.2024-10.placeholder:bd: The Namespace Identifier (NQN) of the target namespace you wish to connect to. This uniquely identifies the logical block device exported by the NVMe-oF target.
  • --name placeholder: Assigns a name to this NVMe-oF block device within SPDK. This name will be used in subsequent commands.

Verification Step: After running this command, you can verify that the NVMe-oF device has been successfully recognized by SPDK by using ./rpc.py bdev_get_nvme_ana_states. You should see your placeholder device listed.

3. Creating the vhost-user-blk-pci Controller

Now, we create the vhost-user block controller, which will expose the SPDK-managed NVMe-oF block device to QEMU.

./rpc.py vhost_create_blk_controller --cpumask 0x1 vhost.0 placeholder
  • vhost_create_blk_controller: The RPC command to create a vhost-user block controller.
  • --cpumask 0x1: Specifies the CPU cores that will handle the I/O for this specific vhost controller. Here, it’s core 0. This ensures efficient data transfer and processing, avoiding contention with other CPU-intensive tasks.
  • vhost.0: The name of the vhost instance, corresponding to the socket created by the vhost daemon (e.g., /var/tmp/vhost.0).
  • placeholder: The name of the SPDK block device (the NVMe-oF device we attached earlier) that this vhost controller will expose.

4. Launching QEMU with the vhost-user-blk-pci Device

This is the critical step where we instruct QEMU to use the vhost-user-blk-pci device. The challenge in booting lies in ensuring the guest’s firmware (BIOS/UEFI) can detect and initialize this device as a bootable medium.

taskset -c 2,3 qemu-system-x86_64 \
    -enable-kvm \
    -m 1G \
    -smp 8 \
    -nographic \
    -object memory-backend-file,id=mem0,size=1G,mem-path=/dev/hugepages,share=on \
    -numa node,memdev=mem0 \
    -chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.0,reconnect=1 \
    -device vhost-user-blk-pci,chardev=spdk_vhost_blk0,bootindex=1,num-queues=2
  • taskset -c 2,3: Pins the QEMU process to CPU cores 2 and 3 for dedicated performance.
  • -enable-kvm: Enables hardware virtualization acceleration.
  • -m 1G: Allocates 1GB of RAM to the VM.
  • -smp 8: Configures 8 virtual CPUs for the VM.
  • -nographic: Disables graphical output, useful for server environments.
  • -object memory-backend-file,id=mem0,size=1G,mem-path=/dev/hugepages,share=on: Configures memory for the VM using hugepages, which is standard for SPDK-accelerated VMs.
  • -numa node,memdev=mem0: Configures NUMA settings.
  • -chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.0,reconnect=1: Defines the character device that QEMU will use to communicate with the vhost-user socket. id=spdk_vhost_blk0 is a unique identifier for this character device.
  • -device vhost-user-blk-pci,chardev=spdk_vhost_blk0,bootindex=1,num-queues=2: This is the crucial line for attaching the block device.
    • vhost-user-blk-pci: Specifies the device model QEMU should instantiate.
    • chardev=spdk_vhost_blk0: Links this device to the previously defined character device.
    • bootindex=1: This is where we signal to QEMU that this device should be considered for booting. A bootindex of 0 is typically the first boot device, so setting it to 1 makes it a potential boot target.
    • num-queues=2: Configures the number of I/O queues for the virtual device.

The Bootability Problem: Despite setting bootindex=1, QEMU might not enumerate this device in a way that the guest’s bootloader (e.g., GRUB, UEFI firmware) recognizes as a primary bootable device. This is often because the vhost-user-blk-pci device might not present itself with the necessary PCI class codes or device identifiers that older bootloaders expect for direct boot. The guest’s firmware might be looking for specific vendor/device IDs or emulated controller types (like SATA or legacy IDE) that the vhost-user-blk-pci device, as presented, doesn’t perfectly match for boot purposes.

Troubleshooting and Advanced Techniques for Boot Success

Since QEMU doesn’t automatically find a bootable device, we need to explore methods to make the vhost-user-blk-pci device discoverable and bootable by the guest.

1. Ensuring the Guest OS Supports vhost-user-blk-pci for Boot

The guest operating system itself needs to have the necessary drivers to recognize and boot from a vhost-user-blk-pci device. Modern kernels, especially those compiled with SPDK in mind or with generic NVMe drivers, are more likely to support this.

  • Kernel Configuration: If you are building a custom kernel for your VM image, ensure that the NVMe driver and potentially specific support for user-space block devices are enabled.
  • UEFI Boot: For UEFI booting, ensure that the UEFI firmware used by QEMU (e.g., OVMF) has the necessary NVMe support integrated to detect the device during the boot process.

2. The Role of bootindex and Device Ordering

The bootindex parameter is critical. While we set it to 1, QEMU processes these indices to determine the boot order. If QEMU doesn’t find any bootable device assigned bootindex=0, it might iterate through other devices with higher indices.

What we’ve checked aligns with common issues:

  • Mounting Works: The fact that you can mount the NVMe-oF disk to the VM when it’s provided as an additional drive (not the primary boot device) confirms that the underlying vhost-user-blk-pci device is functioning correctly for I/O. QEMU can see and access the block data.
  • Local NVMe-oF Boots: The success when providing the image locally via the host-kernel NVMe-oF driver is telling. It indicates that the OS image itself is bootable and the issue is specifically with how the vhost-user-blk-pci device is presented for boot by QEMU.

The fundamental problem is that QEMU’s vhost-user-blk-pci device, as currently emulated, might not expose the necessary PCI configuration space or device identifiers that a typical BIOS or UEFI bootloader expects to find for a bootable storage controller. The guest firmware scans specific PCI addresses and device types to find boot devices.

Possible Solutions and Investigations:

  • Device Emulation Parameters: Explore if the vhost-user-blk-pci device model in QEMU has any additional parameters that can influence its PCI enumeration or declare it as a bootable device. This is often not explicitly documented but might be discoverable through QEMU source code or community discussions. Unfortunately, as of current QEMU versions, there isn’t a direct is-bootable flag for this specific device type.

  • Alternative QEMU Devices for Boot:

    • NVMe Emulation: QEMU does have an NVMe controller emulation (-device nvme,drive=<drive-id>). If your SPDK NVMe-oF target can be exposed through a standard QEMU NVMe device, it might be more boot-friendly. However, this typically requires QEMU to directly speak NVMe, not via vhost-user.
    • Virtio-SCSI/Virtio-BLK with SPDK Backend: Another approach could be to use SPDK’s virtio-blk or virtio-scsi drivers within the guest and have QEMU present a standard virtio-blk/scsi device connected to your SPDK backend (which in turn uses NVMe-oF). However, the goal here is to boot directly from the vhost-user-blk-pci.
  • The “Floppy Disk” Trick (Less Likely for NVMe-oF): In some very specific scenarios with legacy devices, people have tried to attach a dummy floppy or CD-ROM to get the initial bootloader in, which then loads drivers to access the “real” storage. This is generally not applicable for modern NVMe-oF boot.

  • Guest Bootloader Configuration: If the device is recognized as any block device by the guest, but not specifically for booting, you might need to manually configure the guest’s bootloader. This usually involves:

    • Booting the VM with a rescue CD/ISO.
    • Mounting the root filesystem from your NVMe-oF target.
    • Installing or configuring GRUB (for Linux) or the UEFI boot manager to recognize the vhost-user-blk-pci device as the boot source. This is complex as the device path might be abstract.
  • Checking SPDK and QEMU Development Mailing Lists/Issues: The most direct way to find solutions for this specific bootability challenge is to search the SPDK and QEMU developer mailing lists and issue trackers. This problem, of making a custom user-space device bootable, is a common point of discussion. Look for issues related to “UEFI boot vhost-user-blk-pci” or “boot from NVMe SPDK QEMU”.

4. Direct NVMe-oF Integration in QEMU (A Conceptual Alternative)

While your setup uses vhost-user-blk-pci, it’s worth noting that QEMU has a native NVMe driver (-device nvme). The question is whether SPDK’s NVMe-oF stack could be integrated more directly into QEMU itself, perhaps by modifying QEMU to accept an NVMe-oF target directly, bypassing the vhost-user layer for the boot device. This is a more complex architectural change.

5. The Kernel’s Role in Boot Device Detection

When a system boots, the firmware (BIOS or UEFI) performs a PCI scan. It looks for devices with specific class codes and subclass codes that identify them as storage controllers. The vhost-user-blk-pci device, while presenting a block interface, might not advertise itself with the exact PCI configuration that the firmware expects for a bootable NVMe controller.

If the guest OS successfully enumerates the vhost-user-blk-pci device as a storage block device (which it does if it can mount it), then the bootloader might be able to be configured to use it. However, the typical scenario is that the bootloader looks for devices detected by the firmware before the OS drivers load.

6. Specific QEMU VM Configuration Tweaks for Bootability

Consider how QEMU presents the PCI topology.

  • PCI Slot Allocation: Ensure the vhost-user-blk-pci device is assigned a PCI slot that the guest firmware is known to scan for boot devices. This is usually handled automatically by QEMU, but sometimes specific controller types might prefer certain bus/slot/function combinations.
  • Firmware Type: If you are using SeaBIOS, it has specific ways of enumerating devices. If you are using OVMF (UEFI), it might expect different PCI configurations. The problem might be more pronounced with older SeaBIOS versions than with OVMF.

7. Using a Multi-Stage Boot Process

A common workaround when a primary boot device is not recognized is to use a minimal initial boot environment.

  • Stage 1: Boot from a small initrd or a minimal boot image (e.g., a small Linux kernel with an initrd that contains the SPDK NVMe-oF driver and busybox). This minimal environment can then probe the vhost-user-blk-pci device, mount the actual root filesystem from it, and then chain-load or pivot to the full OS.
  • How to achieve this:
    1. Create a small initrd image that includes the necessary SPDK user-space tools and NVMe-oF drivers.
    2. Configure QEMU to boot from a minimal virtual disk image (e.g., a small ext4 image on a standard qcow2 file) that contains this initrd.
    3. The initrd’s init script would:
      • Start the SPDK vhost daemon (or connect to an existing one).
      • Attach the NVMe-oF target.
      • Create the vhost_create_blk_controller.
      • Then, it would need to recognize the vhost-user-blk-pci device and mount it.
      • Finally, it would execute a command like switch_root or pivot_root to transfer control to the root filesystem on the NVMe-oF device.

This multi-stage boot process effectively bypasses the firmware’s direct boot device detection limitations by using a software-based method to reach the desired storage.

Review of Your Specific Configuration and Potential Refinements

Let’s re-examine your provided configuration in light of these considerations.

# Starting vhost
bin/vhost -S /var/tmp -s 1024 -m 0x3 -A 0000:82:00.1
# This part establishes the vhost-user socket. The -A flag here is for SPDK's own use, often mapping to host NVMe controllers.

# Connecting to NVMe-oF and creating bdev
./rpc.py bdev_nvme_attach_controller -t tcp -a 10.0.0.4 -s 4420 -f ipv4 -n nqn.2024-10.placeholder:bd --name placeholder
# This correctly exposes your NVMe-oF target as an SPDK bdev named 'placeholder'.

./rpc.py vhost_create_blk_controller --cpumask 0x1 vhost.0 placeholder
# This makes the 'placeholder' bdev available via the vhost.0 socket as a vhost-user block device.

# Launching QEMU
taskset -c 2,3 qemu-system-x86_64 \
    -enable-kvm \
    -m 1G \
    -smp 8 \
    -nographic \
    -object memory-backend-file,id=mem0,size=1G,mem-path=/dev/hugepages,share=on \
    -numa node,memdev=mem0 \
    -chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.0,reconnect=1 \
    -device vhost-user-blk-pci,chardev=spdk_vhost_blk0,bootindex=1,num-queues=2
# This QEMU command attaches the vhost-user block device. The critical part is bootindex=1.

The bootindex=1 tells QEMU that this is a potential boot device. However, for it to be actually bootable, the guest firmware must detect it.

A Subtle but Important Point: QEMU NVMe Driver vs. vhost-user-blk-pci

You mentioned QEMU has an NVMe driver. This driver emulates a standard NVMe controller and expects to be directly connected to an NVMe namespace. The vhost-user-blk-pci device is different; it’s a generic block device presented over the vhost-user protocol, and QEMU emulates a PCI interface for it. The compatibility for booting with the latter is less guaranteed without specific firmware support or device emulation enhancements.

If the goal is to boot directly from NVMe-oF, and the vhost-user-blk-pci approach is not directly yielding boot capability, consider exploring QEMU’s native NVMe emulation if SPDK can provide a way to map its NVMe-oF target to a QEMU nvme device (e.g., through an SPDK nvme bdev driver that QEMU can utilize, or by passing the NVMe-oF specifics through to QEMU). However, this would likely bypass the vhost-user-blk-pci interface.

Final Considerations for Success

To outrank existing content, we must provide the most detailed and actionable guidance. The key takeaway is that while SPDK’s vhost-user-blk-pci is excellent for high-performance I/O after boot, its role as a direct boot device depends heavily on guest firmware compatibility and how the device is enumerated.

For your specific situation, if the bootindex=1 is not sufficient, the most robust solutions likely involve either:

  1. Guest-level bootloader configuration: Booting with a rescue disk to configure GRUB/UEFI to recognize the vhost-user-blk-pci device.
  2. Multi-stage boot: Using an initrd that contains the necessary SPDK components to access the NVMe-oF target and then transition to the main OS.

Given the complexity and potential for subtle incompatibilities, thorough testing across different guest OS versions and QEMU firmware (SeaBIOS vs. OVMF) is recommended. The community resources for SPDK and QEMU are invaluable for uncovering the latest solutions and known issues related to such advanced configurations. By providing this level of detail, we aim to offer the most comprehensive guide for booting QEMU from SPDK vhost-user-blk-pci with NVMe-oF.