【翻译】QEMU内部机制：顶层概览

原文地址：http://blog.vmsplice.net/2011/03/qemu-internals-big-picture-overview.htmllinux

做者介绍：Stefan Hajnoczi来自红帽公司的虚拟化团队，负责开发和维护QEMU项目的block layer, network subsystem和tracing subsystem。api

目前工做是multi-core device emulation in QEMU和host/guest file sharing using vsock，过去从事过disk image formats, storage migration和I/O performance optimization安全

本文将对QEMU顶层架构进行介绍，以帮助理解其线程模型。并发

vm是经过qemu程序建立的，通常就是qemu-kvm或kvm程序。在一台运行有3个vm的宿主机上，将看到3个对应的qemu进程：

ide

当vm关机时，qemu进程就会退出。vm重启时，为了方便，qemu进程不会被重启。不过先关闭并从新开启qemu进程也是彻底能够的。

当qemu启动时，vm的内存就会被分配。-mem-path参数能够支持使用基于文件的内存系统，好比说hugetlbfs等。不管如何，内存会映射至qemu进程的地址空间，从vm的角度来看，这就是其物理内存。

qemu支持大端和小端字节序架构，因此从qemu代码访问vm内存时要当心。字节序的转换不是直接访问vm的内存进行的，而是由辅助函数负责。这使得能够从宿主机上运行具备不一样字节序的vm。

kvm是linux内核中的虚拟化特性，它使得相似于qemu的程序可以安全的在宿主cpu上直接执行vm的代码。可是必须获得宿主机cpu的支持。当前，kvm在x86, ARMv8, ppc, s390和MIPS等CPU上都已被支持。

为了使用kvm执行vm的代码，qemu进程会访问/dev/kvm设备并发起KVM_RUN ioctl调用。kvm内核模块会借助Intel和AMD提供的硬件虚拟化扩展功能执行vm的代码。当vm须要访问硬件设备寄存器、暂停访问cpu或执行其余特殊操做时，kvm就会把控制权交给qemu。此时，qemu会模拟操做的预期输出，或者只是在暂停的vm CPU的场景下等待下一次vm的中断。

vm的cpu执行任务的基本流程以下：
open("/dev/kvm")
ioctl(KVM_CREATE_VM)
ioctl(KVM_CREATE_VCPU)
for (;;) {
ioctl(KVM_RUN)
switch (exit_reason) {
case KVM_EXIT_IO: /* ... */
case KVM_EXIT_HLT: /* ... */
}
}

宿主机内核会像调度其余进程同样来调度qemu进程。多个vm在彻底不知道其余vm的存在的状况下同时运行。Firefox或Apache等应用程序也会与qemu竞争宿主机资源，但可使用资源控制方法来隔离qemu并使其优先级更高。

qemu在用户空间进程中模拟了一个完整的虚拟机，它无从得知vm中运行有哪些进程。也就是说，qemu为vm提供了RAM、可以执行vm的代码而且可以模拟硬件设备，从而，在vm中能够运行任何类型的操做系统。宿主机没法“窥视”任意vm。

vm会在每一个虚拟cpu上拥有一个vcpu线程。一个独立的iothread线程会经过运行select事件循环来处理相似于网络数据包或硬盘等IO请求。具体细节已经在上一篇【翻译】QEMU内部机制：宏观架构和线程模型中讨论过。

===精选评论===
->/dev/kvm设备是由谁打开的？vcpu线程仍是iothread线程？
它是在qemu启动时由主线程调用的。请注意，每一个vcpu都有其本身的文件描述符，而/dev/kvm是vm的全局文件描述符，它并不是专属于某个vcpu。

->vcpu线程如何给出cpu的错觉?它是否须要在上下文切换时维护cpu上下文？或者说它是否必须硬件虚拟化技术的支持？
是的，须要硬件支持。kvm模块的ioctl接口支持操做vcpu寄存器的状态。与物理CPU在机器开启时具备初始寄存器状态同样，QEMU在复位时也会对vcpu进行初始化。
KVM须要硬件的支持，intel cpu上称为VMX，AMD cpu上称为SVM。两者并不兼容，因此kvm为两者分别提供了代码上的支持。

->hypervisor和vm如何实现直接交互？
这要看你是须要什么样的交互，virtio-serial能够用于任何vm/host的交互。
qemu guest agent程序创建于virtio-serial之上，他支持json rpc api。
它支持宿主机在vm中调用一系列命令（好比查询IP、备份应用等）

->可否解释vm中的一个应用的IO操做流程是怎样的么？
根据KVM仍是TCG、MMIO仍是PIO，ioeventfd是否启用，会有多种不一样的代码路径。
基本流程是vm的内存或IO读写操做会“陷入”kvm内核模块，kvm从KVM_RUN的ioctl调用返回后，将返回值交给QEMU处理。
QEMU找到负责该块内存地址的模拟设备，并调用其处理函数。当该设备处理完成以后，QEMU会使用KVM_RUN的ioctl调用重入进vm。
若是没有使用Passthrough，那么vm看到的是一个被模拟出来的网卡设备，此时，kvm和qemu不会直接读写物理网卡。他们会将数据包交给linux的网络栈(好比tap设备)进行处理。
virtio能够模拟网络、存储等虚拟设备，它会用一些优化方式，可是基本原理仍是模拟一个“真实”设备。

->kvm内核模块是如何捕捉到virtqueue的kick动做的？
ioeventfd被注册为KVM的IO总线设备，kvm使用ioeventfd_write()通知ioeventfd。
trace流程为：
vmx_handle_exit with EXIT_REASON_IO_INSTRUCTION
--> handle_io
--> emulate_instruction
--> x86_emulate_instruction
--> x86_emulate_insn
--> writeback
--> segmented_write
--> emulator_write_emulated
--> emulator_read_write_onepage
--> vcpu_mmio_write
--> ioeventfd_write
该流程显示了当vm kicks宿主机时，信号是如何通知ioeventfd。

QEMU Internals: Big picture overview

Last week I started the QEMU Internals series to share knowledge of how QEMU works. I dove straight in to the threading model without a high-level overview. I want to go back and provide the big picture so that the details of the threading model can be understood more easily.

The story of a guest

A guest is created by running the qemu program, also known as qemu-kvm or just kvm. On a host that is running 3 virtual machines there are 3 qemu processes:

When a guest shuts down the qemu process exits. Reboot can be performed without restarting the qemu process for convenience although it would be fine to shut down and then start qemu again.

Guest RAM

Guest RAM is simply allocated when qemu starts up. It is possible to pass in file-backed memory with -mem-pathsuch that hugetlbfs can be used. Either way, the RAM is mapped in to the qemu process' address space and acts as the "physical" memory as seen by the guest:

QEMU supports both big-endian and little-endian target architectures so guest memory needs to be accessed with care from QEMU code. Endian conversion is performed by helper functions instead of accessing guest RAM directly. This makes it possible to run a target with a different endianness from the host.

KVM virtualization

KVM is a virtualization feature in the Linux kernel that lets a program like qemu safely execute guest code directly on the host CPU. This is only possible when the target architecture is supported by the host CPU. Today KVM is available on x86, ARMv8, ppc, s390, and MIPS CPUs.
In order to execute guest code using KVM, the qemu process opens /dev/kvm and issues the KVM_RUN ioctl. The KVM kernel module uses hardware virtualization extensions found on modern Intel and AMD CPUs to directly execute guest code. When the guest accesses a hardware device register, halts the guest CPU, or performs other special operations, KVM exits back to qemu. At that point qemu can emulate the desired outcome of the operation or simply wait for the next guest interrupt in the case of a halted guest CPU.
The basic flow of a guest CPU is as follows:

open("/dev/kvm")
ioctl(KVM_CREATE_VM)
ioctl(KVM_CREATE_VCPU)
for (;;) {
     ioctl(KVM_RUN)
     switch (exit_reason) {
     case KVM_EXIT_IO:  /* ... */
     case KVM_EXIT_HLT: /* ... */
     }
}

The host's view of a running guest

The host kernel schedules qemu like a regular process. Multiple guests run alongside without knowledge of each other. Applications like Firefox or Apache also compete for the same host resources as qemu although resource controls can be used to isolate and prioritize qemu.

Since qemu system emulation provides a full virtual machine inside the qemu userspace process, the details of what processes are running inside the guest are not directly visible from the host. One way of understanding this is that qemu provides a slab of guest RAM, the ability to execute guest code, and emulated hardware devices; therefore any operating system (or no operating system at all) can run inside the guest. There is no ability for the host to peek inside an arbitrary guest.

Guests have a so-called vcpu thread per virtual CPU. A dedicated iothread runs a select(2) event loop to process I/O such as network packets and disk I/O completion. For more details and possible alternate configuration, see the threading model post.

The following diagram illustrates the qemu process as seen from the host:

Further information

Hopefully this gives you an overview of QEMU and KVM architecture. Feel free to leave questions in the comments and check out other QEMU Internals posts for details on these aspects of QEMU.

Here are two presentations on KVM architecture that cover similar areas if you are interested in reading more:

Jan Kiszka's Linux Kongress 2010 presentation on the Architecture of the Kernel-based Virtual Machine (KVM). Very good material.
My own attempt at presenting a KVM Architecture Overview from 2010.