Porting Helios to aarch64 for my FOSDEM talk, part one

Helios is a microkernel written in the Hare programming language, and the subject of a talk I did at FOSDEM earlier this month. You can watch the talk here if you like:

A while ago I promised someone that I would not do any talks on Helios until I could present them from Helios itself, and at FOSDEM I made good on that promise: my talk was presented from a Raspberry Pi 4 running Helios. The kernel was originally designed for x86_64 (though we were careful to avoid painting ourselves into any corners so that we could port it to more architectures later on), and I initially planned to write an Intel HD Graphics driver so that I could drive the projector from my laptop. But, after a few days spent trying to comprehend the IHD manuals, I decided it would be much easier to port the entire system to aarch64 and write a driver for the much-simpler RPi GPU instead. 42 days later the port was complete, and a week or so after that I successfully presented the talk at FOSDEM. In a series of blog posts, I will take a look at those 42 days of work and explain how the aarch64 port works. Today’s post focuses on the bootloader.

The Helios boot-up process is:

Bootloader starts up and loads the kernel, then jumps to it
The kernel configures the system and loads the init process
Kernel provides runtime services to init (and any subsequent processes)

In theory, the port to aarch64 would address these steps in order, but in practice step (2) relies heavily on the runtime services provided by step (3), so much of the work was ordered 1, 3, 2. This blog post focuses on part 1, I’ll cover parts 2 and 3 and all of the fun problems they caused in later posts.

In any case, the bootloader was the first step. Some basic changes to the build system established boot/+aarch64 as the aarch64 bootloader, and a simple qemu-specific ARM kernel was prepared which just gave a little “hello world” to demonstrate the multi-arch build system was working as intended. More build system refinements would come later, but it’s off to the races from here. Targeting qemu’s aarch64 virt platform was useful for most of the initial debugging and bring-up (and is generally useful at all times, as a much easier platform to debug than real hardware); the first tests on real hardware came much later.

Booting up is a sore point on most systems. It involves a lot of arch-specific procedures, but also generally calls for custom binary formats and annoying things like disk drivers — which don’t belong in a microkernel. So the Helios bootloaders are separated from the kernel proper, which is a simple ELF executable. The bootloader loads this ELF file into memory, configures a few simple things, then passes some information along to the kernel entry point. The bootloader’s memory and other resources are hereafter abandoned and are later reclaimed for general use.

On aarch64 the boot story is pretty abysmal, and I wanted to avoid adding the SoC-specific complexity which is endemic to the platform. Thus, two solutions are called for: EFI and device trees. At the bootloader level, EFI is the more important concern. For qemu-virt and Raspberry Pi, edk2 is the free-software implementation of choice when it comes to EFI. The first order of business is producing an executable which can be loaded by EFI, which is, rather unfortunately, based on the Windows COFF/PE32+ format. I took inspiration from Linux and made an disgusting EFI stub solution, which involves hand-writing a PE32+ header in assembly and doing some truly horrifying things with binutils to massage everything into order. Much of the header is lifted from Linux:

.section .text.head
.global base
base:
.L_head:
	/* DOS header */
	.ascii "MZ"
	.skip 58
	.short .Lpe_header - .L_head
	.align 4
.Lpe_header:
	.ascii "PE\0\0"
	.short 0xAA64                              /* Machine = AARCH64 */
	.short 2                                   /* NumberOfSections */
	.long 0                                    /* TimeDateStamp */
	.long 0                                    /* PointerToSymbolTable */
	.long 0                                    /* NumberOfSymbols */
	.short .Lsection_table - .Loptional_header /* SizeOfOptionalHeader */
	/* Characteristics:
	 * IMAGE_FILE_EXECUTABLE_IMAGE |
	 * IMAGE_FILE_LINE_NUMS_STRIPPED |
	 * IMAGE_FILE_DEBUG_STRIPPED */
	.short 0x206
.Loptional_header:
	.short 0x20b                     /* Magic = PE32+ (64-bit) */
	.byte 0x02                       /* MajorLinkerVersion */
	.byte 0x14                       /* MinorLinkerVersion */
	.long _data - .Lefi_header_end   /* SizeOfCode */
	.long __pecoff_data_size         /* SizeOfInitializedData */
	.long 0                          /* SizeOfUninitializedData */
	.long _start - .L_head           /* AddressOfEntryPoint */
	.long .Lefi_header_end - .L_head /* BaseOfCode */
.Lextra_header:
	.quad 0                          /* ImageBase */
	.long 4096                       /* SectionAlignment */
	.long 512                        /* FileAlignment */
	.short 0                         /* MajorOperatingSystemVersion */
	.short 0                         /* MinorOperatingSystemVersion */
	.short 0                         /* MajorImageVersion */
	.short 0                         /* MinorImageVersion */
	.short 0                         /* MajorSubsystemVersion */
	.short 0                         /* MinorSubsystemVersion */
	.long 0                          /* Reserved */

	.long _end - .L_head             /* SizeOfImage */

	.long .Lefi_header_end - .L_head /* SizeOfHeaders */
	.long 0                          /* CheckSum */
	.short 10                        /* Subsystem = EFI application */
	.short 0                         /* DLLCharacteristics */
	.quad 0                          /* SizeOfStackReserve */
	.quad 0                          /* SizeOfStackCommit */
	.quad 0                          /* SizeOfHeapReserve */
	.quad 0                          /* SizeOfHeapCommit */
	.long 0                          /* LoaderFlags */
	.long 6                          /* NumberOfRvaAndSizes */

	.quad 0 /* Export table */
	.quad 0 /* Import table */
	.quad 0 /* Resource table */
	.quad 0 /* Exception table */
	.quad 0 /* Certificate table */
	.quad 0 /* Base relocation table */

.Lsection_table:
	.ascii ".text\0\0\0"              /* Name */
	.long _etext - .Lefi_header_end   /* VirtualSize */
	.long .Lefi_header_end - .L_head  /* VirtualAddress */
	.long _etext - .Lefi_header_end   /* SizeOfRawData */
	.long .Lefi_header_end - .L_head  /* PointerToRawData */
	.long 0                           /* PointerToRelocations */
	.long 0                           /* PointerToLinenumbers */
	.short 0                          /* NumberOfRelocations */
	.short 0                          /* NumberOfLinenumbers */
	/* IMAGE_SCN_CNT_CODE | IMAGE_SCN_MEM_READ | IMAGE_SCN_MEM_EXECUTE */
	.long 0x60000020

	.ascii ".data\0\0\0"        /* Name */
	.long __pecoff_data_size    /* VirtualSize */
	.long _data - .L_head       /* VirtualAddress */
	.long __pecoff_data_rawsize /* SizeOfRawData */
	.long _data - .L_head       /* PointerToRawData */
	.long 0                     /* PointerToRelocations */
	.long 0                     /* PointerToLinenumbers */
	.short 0                    /* NumberOfRelocations */
	.short 0                    /* NumberOfLinenumbers */
	/* IMAGE_SCN_CNT_INITIALIZED_DATA | IMAGE_SCN_MEM_READ | IMAGE_SCN_MEM_WRITE */
	.long 0xc0000040

.balign 0x10000
.Lefi_header_end:

.global _start
_start:
	stp x0, x1, [sp, -16]!

	adrp x0, base
	add x0, x0, #:lo12:base
	adrp x1, _DYNAMIC
	add x1, x1, #:lo12:_DYNAMIC
	bl relocate
	cmp w0, #0
	bne 0f

	ldp x0, x1, [sp], 16

	b bmain

0:
	/* relocation failed */
	add sp, sp, -16
	ret

The specific details about how any of this works are complex and unpleasant, I’ll refer you to the spec if you’re curious, and offer a general suggestion that cargo-culting my work here would be a lot easier than understanding it should you need to build something similar.¹

Note the entry point for later; we store two arguments from EFI (x0 and x1) on the stack and eventually branch to bmain.

This file is assisted by the linker script:

ENTRY(_start)
OUTPUT_FORMAT(elf64-littleaarch64)

SECTIONS {
	/DISCARD/ : {
		*(.rel.reloc)
		*(.eh_frame)
		*(.note.GNU-stack)
		*(.interp)
		*(.dynsym .dynstr .hash .gnu.hash)
	}

	. = 0xffff800000000000;

	.text.head : {
		_head = .;
		KEEP(*(.text.head))
	}

	.text : ALIGN(64K) {
		_text = .;
		KEEP(*(.text))
		*(.text.*)
		. = ALIGN(16);
		*(.got)
	}

	. = ALIGN(64K);
	_etext = .;

	.dynamic : {
		*(.dynamic)
	}

	.data : ALIGN(64K) {
		_data = .;
		KEEP(*(.data))
		*(.data.*)

		/* Reserve page tables */
		. = ALIGN(4K);
		L0 = .;
		. += 512 * 8;
		L1_ident = .;
		. += 512 * 8;
		L1_devident = .;
		. += 512 * 8;
		L1_kernel = .;
		. += 512 * 8;
		L2_kernel = .;
		. += 512 * 8;
		L3_kernel = .;
		. += 512 * 8;
	}

	.rela.text : {
		*(.rela.text)
		*(.rela.text*)
	}
	.rela.dyn : {
		*(.rela.dyn)
	}
	.rela.plt : {
		*(.rela.plt)
	}
	.rela.got : {
		*(.rela.got)
	}
	.rela.data : {
		*(.rela.data)
		*(.rela.data*)
	}

	.pecoff_edata_padding : {
		BYTE(0);
		. = ALIGN(512);
	}
	__pecoff_data_rawsize = ABSOLUTE(. - _data);
	_edata = .;

	.bss : ALIGN(4K) {
		KEEP(*(.bss))
		*(.bss.*)
		*(.dynbss)
	}

	. = ALIGN(64K);
	__pecoff_data_size = ABSOLUTE(. - _data);
	_end = .;
}

Items of note here are the careful treatment of relocation sections (cargo-culted from earlier work on RISC-V with Hare; not actually necessary as qbe generates PIC for aarch64)² and the extra symbols used to gather information for the PE32+ header. Padding is also added in the required places, and static aarch64 page tables are defined for later use.

This is built as a shared object, and the Makefile ~~mutilates~~ reformats the resulting ELF file to produce a PE32+ executable:

$(BOOT)/bootaa64.so: $(BOOT_OBJS) $(BOOT)/link.ld
	$(LD) -Bsymbolic -shared --no-undefined \
		-T $(BOOT)/link.ld \
		$(BOOT_OBJS) \
		-o $@

$(BOOT)/bootaa64.efi: $(BOOT)/bootaa64.so
	$(OBJCOPY) -Obinary \
		-j .text.head -j .text -j .dynamic -j .data \
		-j .pecoff_edata_padding \
		-j .dynstr -j .dynsym \
		-j .rel -j .rel.* -j .rel* \
		-j .rela -j .rela.* -j .rela* \
		$< $@

With all of this mess sorted, and the PE32+ entry point branching to bmain, we can finally enter some Hare code:

export fn bmain(
	image_handle: efi::HANDLE,
	systab: *efi::SYSTEM_TABLE,
) efi::STATUS = {
    // ...
};

Getting just this far took 3 full days of work.

Initially, the Hare code incorporated a lot of proof-of-concept work from Alexey Yerin’s “carrot” kernel prototype for RISC-V, which also booted via EFI. Following the early bringing-up of the bootloader environment, this was refactored into a more robust and general-purpose EFI support layer for Helios, which will be applicable to future ports. The purpose of this module is to provide an idiomatic Hare-oriented interface to the EFI boot services, which the bootloader makes use of mainly to read files from the boot media and examine the system’s memory map.

Let’s take a look at the first few lines of bmain:

efi::init(image_handle, systab)!;

const eficons = eficons_init(systab);
log::setcons(&eficons);
log::printfln("Booting Helios aarch64 via EFI");

if (readel() == el::EL3) {
	log::printfln("Booting from EL3 is not supported");
	return efi::STATUS::LOAD_ERROR;
};

let mem = allocator { ... };
init_mmap(&mem);
init_pagetables();

Significant build system overhauls were required such that Hare modules from the kernel like log (and, later, other modules like elf) could be incorporated into the bootloader, simplifying the process of implementing more complex bootloaders. The first call of note here is init_mmap, which scans the EFI memory map and prepares a simple high-watermark allocator to be used by the bootloader to allocate memory for the kernel image and other items of interest. It’s quite simple, it just finds the largest area of general-purpose memory and sets up an allocator with it:

// Loads the memory map from EFI and initializes a page allocator using the
// largest area of physical memory.
fn init_mmap(mem: *allocator) void = {
	const iter = efi::iter_mmap()!;
	let maxphys: uintptr = 0, maxpages = 0u64;
	for (true) {
		const desc = match (efi::mmap_next(&iter)) {
		case let desc: *efi::MEMORY_DESCRIPTOR =>
			yield desc;
		case void =>
			break;
		};
		if (desc.DescriptorType != efi::MEMORY_TYPE::CONVENTIONAL) {
			continue;
		};
		if (desc.NumberOfPages > maxpages) {
			maxphys = desc.PhysicalStart;
			maxpages = desc.NumberOfPages;
		};
	};
	assert(maxphys != 0, "No suitable memory area found for kernel loader");
	assert(maxpages <= types::UINT_MAX);
	pagealloc_init(mem, maxphys, maxpages: uint);
};

init_pagetables is next. This populates the page tables reserved by the linker with the desired higher-half memory map, illustrated in the comments shown here:

fn init_pagetables() void = {
	// 0xFFFF0000xxxxxxxx - 0xFFFF0200xxxxxxxx: identity map
	// 0xFFFF0200xxxxxxxx - 0xFFFF0400xxxxxxxx: identity map (dev)
	// 0xFFFF8000xxxxxxxx - 0xFFFF8000xxxxxxxx: kernel image
	//
	// L0[0x000]    => L1_ident
	// L0[0x004]    => L1_devident
	// L1_ident[*]  => 1 GiB identity mappings
	// L0[0x100]    => L1_kernel
	// L1_kernel[0] => L2_kernel
	// L2_kernel[0] => L3_kernel
	// L3_kernel[0] => 4 KiB kernel pages
	L0[0x000] = PT_TABLE | &L1_ident: uintptr | PT_AF;
	L0[0x004] = PT_TABLE | &L1_devident: uintptr | PT_AF;
	L0[0x100] = PT_TABLE | &L1_kernel: uintptr | PT_AF;
	L1_kernel[0] = PT_TABLE | &L2_kernel: uintptr | PT_AF;
	L2_kernel[0] = PT_TABLE | &L3_kernel: uintptr | PT_AF;
	for (let i = 0u64; i < len(L1_ident): u64; i += 1) {
		L1_ident[i] = PT_BLOCK | (i * 0x40000000): uintptr |
			PT_NORMAL | PT_AF | PT_ISM | PT_RW;
	};
	for (let i = 0u64; i < len(L1_devident): u64; i += 1) {
		L1_devident[i] = PT_BLOCK | (i * 0x40000000): uintptr |
			PT_DEVICE | PT_AF | PT_ISM | PT_RW;
	};
};

In short, we want three larger memory regions to be available: an identity map, where physical memory addresses correlate 1:1 with virtual memory, an identity map configured for device MMIO (e.g. with caching disabled), and an area to load the kernel image. The first two are straightforward, they use uniform 1 GiB mappings to populate their respective page tables. The latter is slightly more complex, ultimately the kernel is loaded in 4 KiB pages so we need to set up intermediate page tables for that purpose.

We cannot actually enable these page tables until we’re finished making use of the EFI boot services — the EFI specification requires us to preserve the online memory map at this stage of affairs. However, this does lay the groundwork for the kernel loader: we have an allocator to provide pages of memory, and page tables to set up virtual memory mappings that can be activated once we’re done with EFI. bmain thus proceeds with loading the kernel:

const kernel = match (efi::open("\\helios", efi::FILE_MODE::READ)) {
case let file: *efi::FILE_PROTOCOL =>
	yield file;
case let err: efi::error =>
	log::printfln("Error: no kernel found at /helios");
	return err: efi::STATUS;
};

log::printfln("Load kernel /helios");
const kentry = match (load(&mem, kernel)) {
case let err: efi::error =>
	return err: efi::STATUS;
case let entry: uintptr =>
	yield entry: *kentry;
};
efi::close(kernel)!;

The loader itself (the “load” function here) is a relatively straightforward ELF loader; if you’ve seen one you’ve seen them all. Nevertheless, you may browse it online if you so wish. The only item of note here is the function used for mapping kernel pages:

// Maps a physical page into the kernel's virtual address space.
fn kmmap(virt: uintptr, phys: uintptr, flags: uintptr) void = {
	assert(virt & ~0x1ff000 == 0xffff800000000000: uintptr);
	const offs = (virt >> 12) & 0x1ff;
	L3_kernel[offs] = PT_PAGE | PT_NORMAL | PT_AF | PT_ISM | phys | flags;
};

The assertion enforces a constraint which is implemented by our kernel linker script, namely that all loadable kernel program headers are located within the kernel’s reserved address space. With this constraint in place, the implementation is simpler than many mmap implementations; we can assume that L3_kernel is the correct page table and just load it up with the desired physical address and mapping flags.

Following the kernel loader, the bootloader addresses other items of interest, such as loading the device tree and boot modules — which includes, for instance, the init process image and an initramfs. It also allocates & populates data structures with information which will be of later use to the kernel, including the memory map. This code is relatively straightforward and not particularly interesting; most of these processes takes advantage of the same straightforward Hare function:

// Loads a file into continuous pages of memory and returns its physical
// address.
fn load_file(
	mem: *allocator,
	file: *efi::FILE_PROTOCOL,
) (uintptr | efi::error) = {
	const info = efi::file_info(file)?;
	const fsize = info.FileSize: size;
	let npage = fsize / PAGESIZE;
	if (fsize % PAGESIZE != 0) {
		npage += 1;
	};

	let base: uintptr = 0;
	for (let i = 0z; i < npage; i += 1) {
		const phys = pagealloc(mem);
		if (base == 0) {
			base = phys;
		};

		const nbyte = if ((i + 1) * PAGESIZE > fsize) {
			yield fsize % PAGESIZE;
		} else {
			yield PAGESIZE;
		};
		let dest = (phys: *[*]u8)[..nbyte];
		const n = efi::read(file, dest)?;
		assert(n == nbyte);
	};

	return base;
};

It is not necessary to map these into virtual memory anywhere, the kernel later uses the identity-mapped physical memory region in the higher half to read them. Tasks of interest resume at the end of bmain:

efi::exit_boot_services();
init_mmu();
enter_kernel(kentry, ctx);

Once we exit boot services, we are free to configure the MMU according to our desired specifications and make good use of all of the work done earlier to prepare a kernel memory map. Thus, init_mmu:

// Initializes the ARM MMU to our desired specifications. This should take place
// *after* EFI boot services have exited because we're going to mess up the MMU
// configuration that it depends on.
fn init_mmu() void = {
	// Disable MMU
	const sctlr_el1 = rdsctlr_el1();
	wrsctlr_el1(sctlr_el1 & ~SCTLR_EL1_M);

	// Configure MAIR
	const mair: u64 =
		(0xFF << 0) | // Attr0: Normal memory; IWBWA, OWBWA, NTR
		(0x00 << 8);  // Attr1: Device memory; nGnRnE, OSH
	wrmair_el1(mair);

	const tsz: u64 = 64 - 48;
	const ips = rdtcr_el1() & TCR_EL1_IPS_MASK;
	const tcr_el1: u64 =
		TCR_EL1_IPS_42B_4T |	// 4 TiB IPS
		TCR_EL1_TG1_4K |	// Higher half: 4K granule size
		TCR_EL1_SH1_IS |	// Higher half: inner shareable
		TCR_EL1_ORGN1_WB |	// Higher half: outer write-back
		TCR_EL1_IRGN1_WB |	// Higher half: inner write-back
		(tsz << TCR_EL1_T1SZ) |	// Higher half: 48 bits
		TCR_EL1_TG0_4K |	// Lower half: 4K granule size
		TCR_EL1_SH0_IS |	// Lower half: inner sharable
		TCR_EL1_ORGN0_WB |	// Lower half: outer write-back
		TCR_EL1_IRGN0_WB |	// Lower half: inner write-back
		(tsz << TCR_EL1_T0SZ);	// Lower half: 48 bits
	wrtcr_el1(tcr_el1);

	// Load page tables
	wrttbr0_el1(&L0[0]: uintptr);
	wrttbr1_el1(&L0[0]: uintptr);
	invlall();

	// Enable MMU
	const sctlr_el1: u64 =
		SCTLR_EL1_M |		// Enable MMU
		SCTLR_EL1_C |		// Enable cache
		SCTLR_EL1_I |		// Enable instruction cache
		SCTLR_EL1_SPAN |	// SPAN?
		SCTLR_EL1_NTLSMD |	// NTLSMD?
		SCTLR_EL1_LSMAOE |	// LSMAOE?
		SCTLR_EL1_TSCXT |	// TSCXT?
		SCTLR_EL1_ITD;		// ITD?
	wrsctlr_el1(sctlr_el1);
};

There are a lot of bits here! Figuring out which ones to enable or disable was a project in and of itself. One of the major challenges, funnily enough, was finding the correct ARM manual to reference to understand all of these registers. I’ll save you some time and link to it directly, should you ever find yourself writing similar code. Some question marks in comments towards the end point out some flags that I’m still not sure about. The ARM CPU is very configurable and identifying the configuration that produces the desired behavior for a general-purpose kernel requires some effort.

After this function completes, the MMU is initialized and we are up and running with the kernel memory map we prepared earlier; the kernel is loaded in the higher half and the MMU is prepared to service it. So, we can jump to the kernel via enter_kernel:

@noreturn fn enter_kernel(entry: *kentry, ctx: *bootctx) void = {
	const el = readel();
	switch (el) {
	case el::EL0 =>
		abort("Bootloader running in EL0, breaks EFI invariant");
	case el::EL1 =>
		// Can boot immediately
		entry(ctx);
	case el::EL2 =>
		// Boot from EL2 => EL1
		//
		// This is the bare minimum necessary to get to EL1. Future
		// improvements might be called for here if anyone wants to
		// implement hardware virtualization on aarch64. Good luck to
		// this future hacker.

		// Enable EL1 access to the physical counter register
		const cnt = rdcnthctl_el2();
		wrcnthctl_el2(cnt | 0b11);

		// Enable aarch64 in EL1 & SWIO, disable most other EL2 things
		// Note: I bet someday I'll return to this line because of
		// Problems
		const hcr: u64 = (1 << 1) | (1 << 31);
		wrhcr_el2(hcr);

		// Set up SPSR for EL1
		// XXX: Magic constant I have not bothered to understand
		wrspsr_el2(0x3c4);

		enter_el1(ctx, entry);
	case el::EL3 =>
		// Not supported, tested earlier on
		abort("Unsupported boot configuration");
	};
};

Here we see the detritus from one of many battles I fought to port this kernel: the EL2 => EL1 transition. aarch64 has several “exception levels”, which are semantically similar to the x86_64 concept of protection rings. EL0 is used for userspace code, which is not applicable under these circumstances; an assertion sanity-checks this invariant. EL1 is the simplest case, this is used for normal kernel code and in this situation we can jump directly to the kernel. The EL2 case is used for hypervisor code, and this presented me with a challenge. When I tested my bootloader in qemu-virt, it worked initially, but on real hardware it failed. After much wailing and gnashing of teeth, the cause was found to be that our bootloader was started in EL2 on real hardware, and EL1 on qemu-virt. qemu can be configured to boot in EL2, which was crucial in debugging this problem, via -M virt,virtualization=on. From this environment I was able to identify a few important steps to drop to EL1 and into the kernel, though from the comments you can probably ascertain that this process was not well-understood. I do have a better understanding of it now than I did when this code was written, but the code is still serviceable and I see no reason to change it at this stage.

At this point, 14 days into the port, I successfully reached kmain on qemu-virt. Some initial kernel porting work was done after this, but when I was prepared to test it on real hardware I ran into this EL2 problem — the first kmain on real hardware ran at T+18.

That sums it up for the aarch64 EFI bootloader work. 24 days later the kernel and userspace ports would be complete, and a couple of weeks after that it was running on stage at FOSDEM. The next post will cover the kernel port (maybe more than one post will be required, we’ll see), and the final post will address the userspace port and the inner workings of the slidedeck demo that was shown on stage. Look forward to it, and thanks for reading!

A cursory review of this code while writing this blog post draws my attention to a few things that ought to be improved as well. ↩︎
PIC stands for “position independent code”. EFI can load executables at any location in memory and the code needs to be prepared to deal with that; PIC is the tool we use for this purpose. ↩︎

Porting Helios to aarch64 for my FOSDEM talk, part one February 20, 2023 on Drew DeVault's blog

Articles from blogs I read Generated by openring

Vindicated: Icons suck

Upcoming changes to Hare's event loop library

SourceHut is now accepting payments in Euro