Monday, January 28, 2019

libvirt v4.10 released, providing PCI passthrough support

libvirt v4.10, available for download at the libvirt project website, adds support for PCI passthrough devices on IBM Z (requires Linux kernel 4.14 and QEMU v2.11).
To setup passthrough for a PCI device, follow these steps:
  1. Make sure the vfio-pci module is  available, e.g. using the modinfo command:
       $ modinfo vfio-pci
       filename:       /lib/modules/4.18.0/kernel/drivers/vfio/pci/vfio-pci.ko
       description:    VFIO PCI - User Level meta-driver
  2. Verify that the pciutils package, providing the lspci command et al, is available using your distro's package manager
  3. Determine the PCI device's address using the lspci command:
       $ lspci

       0002:00:00.0 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family

                    [ConnectX-3/ConnectX-3 Pro Virtual Function]
     
  4. Add the following element to the guest domain XML's devices section:
       <hostdev mode='subsystem' type='pci' managed='yes'>

         <source>

           <address domain='0x0002' bus='0x06' slot='0x00' function='0x0'/>

         </source>

       </hostdev>

    Note that if attribute managed is set to no (which is the default), it becomes the user's duty to unbind the PCI device from the respective device driver, and rebind to vfio-pci in the host prior to starting the guest.
Once done and the guest is started, running the lspci command in the guest should show the PCI device, and one can proceed to configure it as needed.
It is well worth checking out the expanded domain XML:
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0002' bus='0x06' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0002' bus='0x00' slot='0x01' function='0x0'>
        <zpci uid='0x0001' fid='0x00000000'/>
      </address>
    </hostdev>

Theoretically, the PCI address in the guest can change between boots. However, the <zpci> element guarantees address persistence inside of the guest. The actual address of the passthrough device is based solely on the uid attribute: The uid becomes the PCI domain, all remaining values of the address (PCI bus, slot and function) are set to zero. Therefore, in this example, the PCI address in the guest would be 0001:00:00:0.
Take note of the fid attribute, whose value is required to hotplug/hotunplug PCI devices within a guest.
Furthermore note that the target PCI address is not visible anywhere (except within the QEMU process) at all. I.e. it is not related to the PCI address as observed within the KVM guest, and could be set to an arbitrary value. However, choosing the "wrong" values might have undesired subtle side effects with QEMU. Therefore, we strongly recommend not to specify a target address, and to rather rely on the auto-assignment. I.e. if the guest's PCI address has to be chosen, at a maximum restrict the target address element to uid (for PCI address definition) and fid (so that e.g. scripts in the guest for hotplugging PCI devices can rely on a specific value) as follows:
   <address type='pci'>
     <zpci uid='0x0001' fid='0x00000000'/>
   </address>


For further (rather technical) details see here and here (git commit).

Monday, December 17, 2018

QEMU v3.1 released

QEMU v3.1 is out. Besides a number of small enhancements, some items that we would like to highlight from a KVM on Z perspective:
  • Huge Pages Support: KVM guests can now utilize 1MB pages. As this removes one layer of address translation for the guest backing, less page-faults need to be processed, and less translation lookaside buffer (TLB) entries are needed to hold translations. This, as well as the TLB improvements in z14, will improve KVM guest performance.
    To use:
    Create config file /etc/modprobe.d/kvmhpage.conf file with the following content to enable huge pages for KVM:

       options kvm hpage=1


    Furthermore, add the following line to /etc/sysctl.conf to reserve N huge pages:

       vm.nr_hugepages = N

    Alternatively, append the following statement to the kernel parameter line in case support is compiled into the kernel: kvm.hpage=1 hugepages=N.
    Note that means to add hugepages dynamically after boot exist, but with effects like memory fragmentation, it is preferable to define huge pages as early as possible.
    If successful, the file /proc/sys/vm/nr_hugepages should show N huge pages. See here for further documentation.
    Then, to enable huge pages for a guest, add the following element to the respective domain XML:

       <memoryBacking>
         <hugepages/>
       </memoryBacking>


    The use of huge pages in the host is orthogonal to the use of huge pages in the guest. Both will improve the performance independently by reducing the number of page faults and the number of page table walks after a TLB miss.
    The biggest performance improvement can be achieved by using huge pages in both, host and guest, e.g. with libhugetlbfs, as this will also make use of the larger 1M TLB entries in the hardware.
    Requires Linux kernel 4.19.
  • virtio-ap: The Adjunct Processor (AP) facility is an IBM Z cryptographic facility comprised of three AP instructions and up to 256 cryptographic adapter cards, each of which can be group into up to 85 domains , providing cryptographic services. virtio-ap maps a subset of the AP devices/domains to one or more KVM guests, such that the host and each guest has exclusive access to a discrete set of AP devices.
    Here is a small sample script illustrating host setup:

       # load vfio-ap device driver
       modprobe vfio-ap

       # create an mdev by specifying a UUID (or use uuidgen instead)
       UUID=e926839d-a0b4-4f9c-95d0-c9b34190c4ba
       echo $UUID /sys/devices/vfio_ap/matrix/create

       # reserve AP queue 7 on adapter 3 for use by a KVM guest
       echo -0x3 > /sys/bus/ap/apmask
       echo -0x7 > /sys/bus/ap/aqmask

       # create a mediated device (mdev) to provide userspace access
       # to a device in a secure manner
       echo $UUID > /sys/devices/vfio_ap/matrix/mdev_supported_types/ \
                    vfio_ap-passthrough/create
       # assign adapter, domain and control domain
       echo +0x3 > /sys/devices/vfio_ap/matrix/${UUID}/assign_adapter
       echo +0x7 > /sys/devices/vfio_ap/matrix/${UUID}/assign_domain
       echo +0x7 > /sys/devices/vfio_ap/matrix/${UUID}/assign_control_domain


    To make use of the AP device in a KVM guest, add the following element to the respective domain XML:

       <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-ap'>
         <source>
           <address uuid='e926839d-a0b4-4f9c-95d0-c9b34190c4ba'/>
         </source>
       </hostdev>


    Once complete, use the passthrough device in a KVM guest just like a regular crypto adapter.
    Requires Linux kernel 4.20 and libvirt 4.9.

Thursday, December 13, 2018

SLES 12 SP4 released

SLES 12 SP4 is out! See the announcement and their release note with Z-specific changes.
It ships the following code levels:
  • Linux kernel 4.12 (SP3: 4.4),
  • QEMU v2.11 (SP3: v2.9), and
  • libvirt v4.0 (SP3: v3.3).
See previous blog entries on QEMU v2.10 and v2.11 for details on new features that become available by the QEMU package update.
See previous blog entries on Linux kernel 4.8 and 4.11 for details on new features becoming available through the kernel update, e.g. nested virtualization support.
An additional feature in this release is the availability of STHYI information in LPAR environments. Requires qclib v1.3 or later. See this blog post for general information on qclib.
Furthermore, note that these changes provide a full CPU model, which provides protection against live guest migration compatibility troubles. E.g. migrating a guest exploiting the latest features to a KVM instance running on an earlier IBM Z machine lacking said feature would be detected an prevented.
Note: With this feature, live guest migration back to a KVM instance that does not yet support CPU models (e.g. SLES 12 SP3) will not work anymore.

Friday, October 19, 2018