April 2019
S	M	T	W	T	F	S
	1	2 LLDB/LLVM report for March 2019	3	4 Continuation of signal semantics improvements	5	6
7	8	9 From Zero to NVMM	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Michał Górny (NetBSD Blog)

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

Originally, LLDB was ported to NetBSD by Kamil Rytarowski. However, multiple upstream changes and lack of continuous testing have resulted in decline of support. So far we haven't been able to restore the previous state.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. My initial effort was focused on restoring continuous integration via buildbot and restoring core file support. You can read more about that in my Feb 2019 report.

In March, I have been continuing this work and this report aims to summarize what I have done and what challenges still lie ahead of me.

Followup from last month and buildbot updates

By the end of February, I was working on fixing the most urgent test failures in order to resume continuous testing on buildbot. In early March, I was able to get all the necessary fixes committed. The list includes:

fixing tests to always use libc++ to avoid libstdc++-related breakage: r355273,
enabling passing -pthread flag: r355274,
enabling support for finding libc++ headers relatively to the executable location: r355282,
fixing additional case of bind/connect address mismatch: r355285,
passing appropriate -L and -Wl,-rpath flags for tests: r355502 with followup test fix in r355510.

The commit series also included an update for the code finding main executable location to use sysctl(): r355283. However, I've reverted it afterwards as it did not work reliably: r355302. Since this was neither really necessary nor unanimously considered correct, I've abandoned the idea.

Once those initial issues were fixed, I was able to enable the full LLDB test suite on the buildbot. Based on the initial results, I updated the list of tests known to fail on NetBSD: r355320, and later r355774, and start looking into ‘flaky’ tests.

Tests are called flaky if they can either pass or fail unpredictably — usually as a result of race conditions, timeouts and other events that may depend on execution order, system load, etc. The LLDB test suite provides a workaround for flaky tests — through executing them multiple times, and requiring only one of the runs to pass.

I have initially attempted to take advantage of this, committing r355830, then r355838. However, this approach turned out to be suboptimal. Sadly, marking more tests flaky only yielded more failures than previously — presumably because of the load increased through rerunning failing tests. Therefore, I've decided it more prudent to instead focus on finding the root issue.

Currently we are facing a temporary interruption in our buildbot service. We will restore it as soon as possible.

Threaded and AArch64 core file support

The next task on the list was to finish work on improving NetBSD core file support that was started by Kamil Rytarowski from almost two years ago. This involved rebasing his old patches after major upstream refactoring, adding tests and addressing remarks from upstream.

Firstly, I've addressed support for core files created from threaded programs. The new code itself landed as r355736. However, one of the tests was initially broken as it relied on symbol names provided by libc, and therefore failed on non-NetBSD systems. After initially disabling the test, I've finally fixed it by refactoring the code to fail in regular program function rather than libc call: r355786.

Secondly, I've added support for core files from AArch64 systems. To achieve this, I have set up a QEMU VM with a lot of help from Jared McNeill. For completeness, I include his very helpful instructions here:

Fetch http://snapshots.linaro.org/components/kernel/leg-virt-tianocore-edk2-upstream/latest/QEMU-AARCH64/RELEASE_GCC49/QEMU_EFI.fd
Fetch and uncompress latest arm64.img.gz from http://nycdn.netbsd.org/pub/NetBSD-daily/HEAD/latest/evbarm-aarch64/binary/gzimg/
Use qemu-img resize arm64.img <newsize> to expand the image for your needs

Start qemu:

SMP=4
MEM=2g
qemu-system-aarch64 -M virt -machine gic-version=3 -cpu cortex-a53 -smp $SMP -m $MEM \
   -drive if=none,file=arm64.img,id=hd0 -device virtio-blk-device,drive=hd0 \
   -netdev type=user,id=net0 -device virtio-net-device,netdev=net0,mac=00:11:22:33:44:55 \
   -bios QEMU_EFI.fd \
   -nographic

With this approach, I've finally been able to run NetBSD/arm64 via QEMU. I used it to create matching core dumps for the tests, and afterwards implemented AArch64 support: r357399.

Other improvements

During the month, upstream has introduced a new SBReproducer module that wrapped most of the LLDB API. The original implementation involved generating the wrappers for all modules in a single file. Combined with heavy use of templates, building this file caused huge memory consumption, making it impossible to build on systems with 4 GiB of RAM. I've discussed possible solutions with upstream and finally implemented one splitting and moving wrappers to individual modules: r356481.

Furthermore, as an effort to reduce flakiness of tests, I've worked on catching and fixing more functions potentially interrupted via signals (EINTR): r356703, with fixup in r356960. Once the LLVM bot will be back, we will check whether that solved our flakiness problem.

LLVM 8 release

Additionally, during the month LLVM 8.0.0 was finally released. Following Kamil's request, by the end of the month I've started working on updating NetBSD src for this release. This is still a work in progress, and I've only managed to get LLVM and Clang to build (with other system components failing afterwards). Nevertheless, if you're interested the current set of changes can be seen on my GitHub fork of netbsd/src, llvm8 branch. The URL for comparison is: https://github.com/NetBSD/src/compare/6e11444..mgorny:llvm8

Please note that for practical reasons this omits the commit updating distributed LLVM and Clang sources.

Future plans

The plans for the nearest future include finishing the efforts mentioned here and working with others on resuming the buildbot. I have also discussed with Pavel Labath and he suggested working on improving error reporting to help us resolve current and future test failures.

The next milestones in the LLDB development plan are:

Add support for FPU registers support for NetBSD/i386 and NetBSD/amd64.
Support XSAVE, XSAVEOPT, ... registers in core(5) files on NetBSD/amd64.

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

http://netbsd.org/donations/#how-to-donate

Posted late Tuesday evening, April 2nd, 2019 Tags: blog

Kamil Rytarowski (NetBSD Blog)

Over the past month I've finally managed to correct masking semantics of crash signals (SIGSEGV, SIGTRAP, SIGILL, SIGFPE, SIGBUS). Additionally I've fixed masking semantics in forks(2) and vforks(2) (they trigger a crash signal SIGTRAP). There is remaining work in signal semantics for other types of events (mainly thread related). The coverage of signal code in ptrace(2) regression tests keeps continuously incrementing.

Crash signal masking

Certain applications and frameworks mask signals that occur during crashes. This can happen deliberately or by an accident when masking all signals in a process.

There are two basic types of signals in this regard:

emitted by a debugger-related event (such as software or hardware breakpoint),
emitted by other source such as other process (kill(2)) or raised in a thread (raise(2)).

The NetBSD kernel had no subtlety to distinguish these two events and regular signal masking was affecting both types of sources of these signals. This caused various side effects such as a developer being unable to single step a code or after placing a software trap and silently moving over it crashing an application due to abnormal conditions.

Not only debuggers were affected, but software reusing the debugging APIs internally, including the DTrace tools in userland.

Right now the semantics of crash signals has been fixed for traps issued by crashes (such as software breakpoint of segmentation fault) and fork(2)/vfork(2) events.

New ATF tests for ptrace(2)

Browsing the available Linux resources with tests against ptrace(2), I got an inspiration to validate whether unaligned memory access through the PT_READ/PT_WRITE and PIOD READ/WRITE/READ_AUXV operations. These calls are needed to transfer data between the memory of a debugger and a debuggee. They are documented and expected to be safe for a potentially misaligned access. Newly added tests validate whether it is true.

It's much better to detect a potential problem with ATF rather than a kernel crash on a more sensitive CPU (most RISC-ones) during operation.

Plan for the next milestone

Keep preparing kernel fixes and after thorough verification applying them to the mainline distribution.

This work was sponsored by The NetBSD Foundation.

http://netbsd.org/donations/#how-to-donate

Posted in the wee hours of Wednesday night, April 4th, 2019 Tags: blog

Maxime Villard (NetBSD Blog)

Six months ago, I told myself I would write a small hypervisor for an old x86 AMD CPU I had. Just to learn more about virtualization, and see how far I could go alone on my spare time. Today, it turns out that I've gone as far as implementing a full, fast and flexible virtualization stack for NetBSD. I'd like to present here some aspects of it.

Design Aspects

General Considerations

In order to achieve hardware-accelerated virtualization, two components need to interact together:

A kernel driver that will switch machine's CPU to a mode where it will be able to safely execute guest instructions.
A userland emulator, which talks to the kernel driver to run virtual machines.

Simply said, the emulator asks the kernel driver to run virtual machines, and the kernel driver will run them until a VM exit occurs. When this happens, the kernel driver returns to the emulator, telling it along the way why the VM exit occurred. Such exits can be IO accesses for instance, that a virtual machine is not allowed to perform, and that require the emulator to virtualize them.

The NVMM Design

NVMM provides the infrastructure needed for both the kernel driver and the userland emulators.

The kernel NVMM driver comes as a kernel module that can be dynamically loaded into the kernel. It is made of a generic machine-independent frontend, and of several machine-dependent backends. In practice, it means that NVMM is not specific to x86, and could support ARM 64bit for example. During initialization, NVMM selects the appropriate backend for the system. The frontend handles everything that is not CPU-specific: the virtual machines, the virtual CPUs, the guest physical address spaces, and so forth. The frontend also has an IOCTL interface, that a userland emulator can use to communicate with the driver.

When it comes to the userland emulators, NVMM does not provide one. In other words, it does not re-implement a Qemu, a VirtualBox, a Bhyve (FreeBSD) or a VMD (OpenBSD). Rather, it provides a virtualization API via the libnvmm library, which allows to effortlessly add NVMM support in already existing emulators. This API is meant to be simple and straightforward, and is fully documented. It has some similarities with WHPX on Windows and HVF on MacOS.

Fig. A: General overview of the NVMM design.

The Virtualization API: An Example

The virtualization API is installed by default on NetBSD. The idea is to provide an easy way for applications to use NVMM to implement services, which can go from small sandboxing systems to advanced system emulators.

Let's put ourselves in the context of a simple C application we want to write, to briefly showcase the virtualization API. Note that this API may change a little in the future.

Creating Machines and VCPUs

In libnvmm, each machine is described by an opaque nvmm_machine structure. We start with:

#include <nvmm.h>
...
	struct nvmm_machine mach;
	nvmm_machine_create(&mach);
	nvmm_vcpu_create(&mach, 0);

This creates a machine in 'mach', and then creates VCPU number zero (VCPU0) in this machine. This VM is associated with our process, so if our application gets killed or exits bluntly, NVMM will automatically destroy the VM.

Fetching and Setting the VCPU State

In order to operate our VM, we need to be able to fetch and set the state of its VCPU0, that is, the content of VCPU0's registers. Let's say we want to set the value '123' in VCPU0's RAX register. We can do this by adding four more lines:

	struct nvmm_x64_state state;
	nvmm_vcpu_getstate(&mach, 0, &state, NVMM_X64_STATE_GPRS);
	state.gprs[NVMM_X64_GPR_RAX] = 123;
	nvmm_vcpu_setstate(&mach, 0, &state, NVMM_X64_STATE_GPRS);

Here, we fetch the GPR component of the VCPU0 state (GPR stands for General Purpose Registers), we set RAX to '123', and we put the state back into VCPU0. We're done.

Allocating Guest Memory

Now is time to give our VM some memory, let's say one single page. (What follows is a bit technical.)

The VM has its own MMU, which translates guest virtual addresses (GVA) to guest physical addresses (GPA). A secondary MMU (which we won't discuss) is set up by the host to translate the GPAs to host physical addresses. To give our single page of memory to our VM, we need to tell the host to create this secondary MMU.

Then, we will want to read/write data in the guest memory, that is to say, read/write data into our guest's single GPA. To do that, in NVMM, we also need to tell the host to associate the GPA we want to read/write with a host virtual address (HVA) in our application. The big picture:

Fig. B: Memory relations between our application and our VM.

In Fig. B above, if the VM wants to read data at virtual address 0x4000, the CPU will perform a GVA→GPA translation towards the GPA 0x3000. Our application is able to see the content of this GPA, via its virtual address 0x2000. For example, if our application wants to zero out the page, it can simply invoke:

	memset((void *)0x2000, 0, PAGE_SIZE);

With this system, our application can modify guest memory, by reading/writing to it as if it was its own memory. All of this sounds complex, but comes down to only the following four lines of code:

	uintptr_t hva = (uintptr_t)mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0);
	gpaddr_t gpa = 0x3000;
	nvmm_hva_map(&mach, hva, PAGE_SIZE);
	nvmm_gpa_map(&mach, hva, gpa, PAGE_SIZE, PROT_READ|PROT_WRITE);

Here we allocate a simple HVA in our application via mmap. Then, we turn this HVA into a special buffer that NVMM will be able to use. Finally, we tell the host to link the GPA (0x3000) towards the HVA. From then on, the guest is allowed to touch what it perceives as being a simple physical page located at address 0x3000, and our application can directly modify the content of this page by reading and writing into the address pointed to by 'hva'.

Running the VM

The final step is running the VM for real. This is achieved with a VCPU Loop, which runs our VCPU0 and processes the different exit reasons, typically in the following form:

	struct nvmm_exit exit;
	while (1) {
		nvmm_vcpu_run(&mach, 0, &exit);
		switch (exit.reason) {
		case NVMM_EXIT_NONE:
			break; /* nothing to do */
		case ... /* completed as needed */
		}
	}

The nvmm_vcpu_run function blocks, and runs the VM until an exit or a rescheduling occurs.

Full Code

We're done now: we know how to create a VM and give it VCPUs, we know how to modify the registers of the VCPUs, we know how to allocate and modify guest memory, and we know how to run a guest.

Let's sum it all up in one concrete example: a calculator that runs inside a VM. This simple application receives two 16bit ints as parameters, launches a VM that performs the addition of these two ints, fetches the result, and displays it.

Full code: calc-vm.c

That's about it, we have our first NVMM-based application in less than 100 lines of C code, and it is an example of how NetBSD's new virtualization API can be used to easily implement VM-related services.

Advanced Use of the Virtualization API

Libnvmm can go farther than just providing wrapper functions around IOCTLs. Simply said, certain exit reasons are very complex to handle, and libnvmm provides assists that can emulate certain guest operations on behalf of the userland emulator.

Libnvmm embeds a comprehensive machinery, made of three main components:

The MMU Walker: the component in charge of performing a manual GVA→GPA translation. It basically walks the MMU page tree of the guest; if the guest is running in x86 64bit mode for example, it will walk the four layers of pages in the guest to obtain a GPA.
The Instruction decoder: fetches and disassembles the guest instructions that cause MMIO exits. The disassembler uses a Finite State Machine. The result of the disassembly is summed up in a structure that is passed to the instruction emulator, possibly several times consecutively.
The instruction emulator: as its name indicates, it emulates the execution of an instruction. Contrary to many other disassemblers and hypervisors, NVMM makes a clear distinction between the decoder and the emulator.

An NVMM-based application can therefore avoid the burden of implementing these components, by just leveraging the assists provided in libnvmm.

Security Aspects

NVMM can be used in security products, such as sandboxing systems, to provide contained environments. Without elaborating more on my warplans, this is a project I've been thinking about for some time on NetBSD.

One thing you may have noticed from Fig. A, is that the complex emulation machinery is not in the kernel, but in userland. This is an excellent security property of NVMM, because it reduces the risk for the host in case of bug or vulnerability – the host kernel remains unaffected –, and also has the advantage of making the machinery easily fuzzable. Currently, this property is not found in other hypervisors such as KVM, HAXM or Bhyve, and I hope we'll be able to preserve it as we move forward with more backends.

Another security property of NVMM is that the assists provided by libnvmm are invoked only if the emulator explicitly called them. In other words, the complex machinery is not launched automatically, and an emulator is free not to use it if it doesn't want to. This can limit the attack surface of applications that create limited VMs, and want to keep things simple and under control as much as possible.

Finally, NVMM naturally benefits from the modern bug detection features available in NetBSD (KASAN, KUBSAN, and more), and from NetBSD's automated test framework.

Performance Aspects

Contrary to other pseudo-cross-platform kernel drivers such as VirtualBox or HAXM, NVMM is well integrated into the NetBSD kernel, and this allows us to optimize the context switches between the guests and the host, in order to avoid expensive operations in certain cases.

Another performance aspect of NVMM is the fact that in order to implement the secondary MMU, NVMM uses NetBSD's pmap subsystem. This allows us to have pageable guest pages, that the host can allocate on-demand to limit memory consumption, and can then swap out when it comes under memory pressure.

It also goes without saying that NVMM is fully MP-safe, and uses fine-grained locking to be able to run many VMs and many VCPUs simultaneously.

On the userland side, libnvmm tries to minimize the processing cost, by for example doing only a partial emulation of certain instructions, or by batching together certain guest IO operations. A lot of work has been done to try to reduce the number of syscalls an emulator would have to make, in order to increase the overall performance on the userland side; but there are several cases where it is not easy to keep a clean design.

Hardware Support

As of this writing, NVMM supports two backends, x86-SVM for AMD CPUs and x86-VMX for Intel CPUs. In each case, NVMM can support up to 128 virtual machines, each having a maximum of 256 VCPUs and 128GB of RAM.

Emulator Support

Armed with our full virtualization stack, our flexible backends, our user-friendly virtualization API, our comprehensive assists, and our swag NVMM logo, we can now add NVMM support in whatever existing emulator we want.

That's what was done in Qemu, with this patch, which shall soon be upstreamed. It uses libnvmm to provide hardware-accelerated virtualization on NetBSD.

It is now fully functional, and can run a wide variety of operating systems, such as NetBSD (of course), FreeBSD, OpenBSD, Linux, Windows XP/7/8.1/10, among others. All of that works equally across the currently supported NVMM backends, which means that Qemu+NVMM can be used on both AMD and Intel CPUs.

Fig. C: Example, Windows 10 running on Qemu+NVMM, with 3 VCPUs, on a host that has a quad-core AMD CPU.

Fig. D: Example, Fedora 29 running on Qemu+NVMM, with 8 VCPUs, on a host that has a quad-core Intel CPU.

The instructions on how to use Qemu+NVMM are available on this page.

What Now

All of NVMM is available in NetBSD-current, and will be part of the NetBSD 9 release.

Even if perfectly functional, the Intel backend of NVMM is younger than its AMD counterpart, and it will probably receive some more performance and stability improvements.

There still are, also, several design aspects that I haven't yet settled, because I haven't yet decided the best way to fix them.

Overall, I expect new backends to be added for other architectures than x86, and I also expect to add NVMM support in more emulators.

That's all, ladies and gentlemen. In six months of spare time, we went from Zero to NVMM, and now have a full virtualization stack that can run advanced operating systems in a flexible, fast and secure fashion.