What’s the smallest variety of CHERI?

Security Research & Defense

/ By Saar Amar / September 06, 2022 / 20 min read

The Portmeirion project is a collaboration between Microsoft Research Cambridge, Microsoft Security Response Center, and Azure Silicon Engineering & Solutions. Over the past year, we have been exploring how to scale the key ideas from CHERI down to tiny cores on the scale of the cheapest microcontrollers. These cores are very different from the desktop and server-class processors that have been the focus of the Morello project.

Microcontrollers are still typically in-order systems with short pipelines and tens to hundreds of kilobytes of local SRAM. In contrast, systems such as Morello have wide and deep pipelines, perform out-of-order execution, and have gigabytes to terabytes of DRAM hidden behind layers of caches and a memory management unit with multiple levels of page tables. There are billions of microcontrollers in the world and they are increasingly likely to be connected to the Internet. The lack of virtual memory means that they typically don’t have any kind of process-like abstraction and so run unsafe languages in a single privilege domain.

This project has now reached the stage where we have a working RTOS running existing C/C++ components in compartments. We will be open sourcing the software stack over the coming months and are working to verify a production-quality implementation of our proposed ISA extension based on the lowRISC project’s Ibex core, which we intend to contribute back upstream.

Our CHERI microcontroller project aimed to explore whether we can get very strong security guarantees if we are willing to co-design the instruction set architecture (ISA), the application binary interface (ABI), isolation model, and the core parts of the software stack. We applied the same two fundamental security principles as the wider CHERI project throughout:

The principle of least privilege. No component should run with more privileges than it needs to complete its task.
The principle of intentionality. No component should exercise privilege without explicitly trying to do so.

Existing hardware security features do not fully respect either of these. Traditional processors describe privilege in terms of protection rings, where each ring is strictly more privileged than the next. Anything running in one privilege mode has, effectively, full control over things running in lower privilege states, which is far more power than most of the code in a typical kernel or hypervisor needs. Similarly, while executing in one privilege mode, any instruction automatically operates with this privilege, even if this was not the intention. Techniques such as SMAP are intended to help address this, preventing a kernel from acting as a confused deputy and accessing userspace memory accidentally.

Memory protection on microcontrollers is typically done via a memory protection unit (MPU), which RISC-V calls a physical memory protection (PMP) unit. This does a subset of what a memory management unit (MMU) on a large system can do: it provides protection for different address ranges but does not perform address translation. Again, this does not respect the principle of intentionality because the permission that is granted (access to a range of the physical address space) is divorced from the operation that accesses the range (load and store instructions with an arbitrary integer address). For example, a PMP may define an explicit region for the stack, but pointer arithmetic on a heap or global pointer that goes out of bounds can still end up writing to the stack region.

Our architecture is an extension to RISC-V32E, the smallest core RISC-V specification. This has only 15 registers (RISC-V normally has 31, zero is reserved as a zero value in both), a 32-bit address space, a single privilege level, and no PMP. We extend all of the registers to hold 64-bit capabilities. All of the RISC-V load and store instructions are modified to require a capability as the operand. Unlike the big CHERI systems, we expect everything that runs on a microcontroller to be recompiled and so we don’t provide compatibility for legacy integer-addressing loads and stores.

What do we get from CHERI?

CHERI, originally developed by The University of Cambridge and SRI International under DARPA funding, provides a capability model for accessing memory. Every memory access (load, store, or instruction fetch) must be authorized by a capability. A CHERI capability is a data type that the hardware protects via guarded manipulation that both describes and authorizes access to memory. On a system with a 32-bit address space, CHERI capabilities are 64-bit values protected by a non-addressable tag bit (65 bits in total). They cannot be created out of thin air. At system boot, the register file contains capabilities that grant full access to the address space. Every capability in the system is derived from one of these by either simply copying it, removing permissions, or restricting the range of memory that it covers.

Capabilities are the hardware type that the compiler uses to represent pointers and so a C/C++ programmer can think of capabilities and pointers as equivalent. A pointer in a CHERI-C system is unforgeable, has non-bypassable bounds checks, and may have reduced permissions (for example, it might be read-only). Every function pointer, every data pointer, and every implicit pointer such as the stack pointer or global pointer, is a capability and so the hardware enforces bounds checks on every access. This gives us a building block that can be used both for object-granularity memory safety and for fine-grained compartmentalization.

CHERI naturally respects both of our security principles. Every load or store instruction must have a capability as the base address. If you use an offset that would take a pointer from one object that you own to another object then it will fail because your intent is captured by the instruction (you meant to access the object identified by the pointer that you gave as an operand). This also holds for indirect jumps: they take an executable capability as an operand and will fail if the function pointer does not have execute permission. Trying to use a data pointer as a function pointer will trap. The set of memory that a piece of running code can access is limited by the set of capabilities that it holds, which makes it easy to enforce the principle of least privilege.

Scaling CHERI down

Most of the CHERI work to date has focused on 64-bit architectures with 128-bit capabilities. The Morello capability format has 64 bits of address, 20 bits of bounds, 16 bits of object type, and 18 bits of permissions. A 32-bit processor has only 32 bits to store all of the metadata and a direct transliteration would use more than half of that for permissions. There has been some early work on 32-bit CHERI systems but the encoding has a lot of limitations, such as 3-bit precision which means that you need a lot of padding for large allocations. It can provide byte-granularity capabilities for objects up to around 64 bytes but then requires stronger alignment on the base and top. This was enough to prove that a 32-bit CHERI is possible but not good enough for real-world deployment.

Our encoding saves space by observing that a secure system will never use a lot of the combinations of permissions that CHERI provides and so we don’t need to be able to represent them. One of these decompositions was proposed a few years ago by our friends at the University of Cambridge: capabilities that convey sealing or unsealing permissions are operating on a different namespace to all other capabilities and so can be a separate format. They also proposed separating writeable and executable capabilities but this proved not to be possible without invasive changes to POSIX or Windows software.

We have built on these ideas, compressing 13 architectural permissions down to 7 bits of encoding space. Sealing and unsealing permissions cannot be combined with any memory-access permissions. Execute and store permissions are disjoint and so no capability can convey both the rights to execute and write memory. Microcontroller software often assumes at least the option of running on a Harvard architecture and so generally avoids the assumptions that make this kind of change problematic in desktop or server codebases.

This compression gives us more space for bounds encoding and allows us to have byte-granularity bounds for any object (or sub object) up to 510 bytes: ample for embedded systems.

The privilege compression also means that there is no single omnipotent capability in the system. When a 64-bit CHERI system boots, it provides the initial loader with capabilities that grant all forms of access to the entire address space. When our core boots, it provides three different root capabilities, one for sealing, one for executing, and one for writing to memory. All capabilities in a running system are derived from one of these. By construction, CHERI does not provide any mechanism for adding permissions and so there is no way in our system to ever build a write-and-execute capability because doing so would require adding either write or execute to one of our root capabilities.

This does not preclude code from holding two capabilities, one that grants write access to some memory and one that allows executing the same memory but doing so respects the principle of intentionality: the code must explicitly use the correct one for each action. This is a desirable property from the perspective of intentionality: even when you’re allowed to write to and execute from the same memory, you must explicitly choose which operation you are doing and authorize it correctly.

Adding temporal memory safety

Our work on large CHERI systems has proposed a lot of optimizations that, we hope, will improve the performance of temporal safety on server-class CHERI systems. On small systems, we have a somewhat simpler problem. The lack of virtual memory means that we don’t need to worry about aliasing. The small quantity of physical memory means that we are able to scan all of memory in a very small amount of time. In addition, embedded systems are often single core, which eliminates the need to handle races between a memory access and an object being freed.

We provide a hardware implementation of a revocation bitmap, as used in Cornucopia, a 1-bit tag per 8 bytes of SRAM that is used to indicate whether memory has been deallocated. On free, the memory allocator sets the bits for an allocation and defers reuse until after it has had a chance to do a revocation scan. A potent innovation for small CHERI systems is to have the main CPU pipeline check this bit whenever it loads a capability, clearing the tag bit if the capability points to memory that has been marked as revoked. This means that no capability that points to a deallocated object can be loaded into the register file. As long as we spill and reload registers on context switch (which happens by definition) and on return from free (which happens as a side effect of calling into the memory allocator’s compartment), registers never hold stale pointers. This, in turn, means that we don’t need to do checks on data loads and stores (though we would have to either do this or explicitly serialize cores on free for a multicore microcontroller).

This is enough for use-after-free protection but in a world with multiple compartments it’s often useful to have temporary delegation. If I pass a pointer to an object from one compartment to another, I don’t want to have to free the object to ensure that the callee doesn’t have a pointer to it, I want to know that I can reuse it immediately. We provide a _lexically scoped _delegation mechanism, which allows access to an object graph to be delegated for the duration of a cross-compartment call.

We implement lexical delegation on top of the 2-bit information-flow-control mechanism that has been part of CHERI since the beginning. This involves two permissions: global and store-local. A capability without the global permission is called a local capability and may be stored only via a capability with the store-local permission.

In our system, only two kinds of capabilities have the store-local permission: stacks and the register-save area that is used for context switches. This means that you can pass a local capability from one compartment to another and the only place that it can be stored is on the stack. We then, in software, simply need to ensure that the stack is cleared on return.

For our CHERI variant, we’ve extended the local/global mechanism with one extra permission: permit-indirect-load-global. If you have this permission, you may load capabilities with the global permission set. Without this permission, any capability that you load will have both global and permit-indirect-load-global cleared. This is a similar mechanism to the deep-immutability support that we worked with Arm to add to Morello. If you clear the permit-indirect-load-global and global permissions from a capability then you can pass it to another compartment and guarantee that nothing reachable from it has been captured by the callee when the call returns. You can combine this with the deep-immutability model to temporarily grant read-only access to a complex data structure, for the duration of a call.

Compartments and threads

Our software model has two key concepts for isolation: compartments and threads. A compartment defines spatial ownership, a thread defines temporal ownership. Compartments are a combination of code and data (global variables) that expose functions as entry points. Threads are schedulable entities that own a stack and invoke compartments. At any given time, the system is running one thread in one compartment.

A compartment is defined to the CPU as two capability registers. The program counter capability (PCC) defines the code (and read-only data), and the capability global pointer (CGP) defines the range of (mutable) globals for that compartment. Function calls within a compartment are direct jumps that don’t change the PCC value. Accesses to globals are all via the CGP register (with the compiler inserting bounds restriction if you take the address of a global).

In software, a compartment also defines a set of entry points that are used as valid targets for domain transitions. Calls between compartments look like normal C function calls in the source code but the compiler inserts a call sequence that jumps via a compartment switcher. The switcher is responsible for ensuring cross-compartment isolation. It saves callee-save registers, clears temporary and unused argument registers, truncates the stack, and zeroes the delegated part of the stack, before finally jumping to an entry point identified by the callee’s export table. The switcher manages a trusted stack, containing the saved stack pointer and cross-compartment return address, which is not accessible to the main compartment code. The trusted stack also allows returning from a cross-compartment call in the event that a compartment crashed.

Stack truncation ensures that the called compartment cannot access any part of the caller’s stack that was not explicitly passed as an argument. Stack zeroing (which also happens on return) ensures that no secrets or capabilities are leaked between compartments.

Calling between compartments is quite a bit slower than a function call, but still quite fast (on the order of a few hundred cycles). Zeroing the stack sounds slow, but remember that this is an embedded system, where the typical stack size is 1-2 KiB, sometimes smaller. Even on a fairly slow 50 MHz embedded system, zeroing 1 KiB of local SRAM is quite fast. Unfortunately, we can’t easily use this technique on systems such as CheriBSD on Morello, where the stack is typically 8 MiB of DRAM.

The fact that stacks and register-save area are the targets of the only store-local capabilities in the system means that there is no way to pass a stack pointer from one thread to another, even when both are executing in the same compartment. Attempting to store a pointer to a stack object into a heap object or a global will trap. This provides strong non-interference guarantees between threads in line with the principle of intentionality: only stores via an intentionally shared object are visible in another thread.

Shared libraries

Having to duplicate all code between compartments would significantly increase the memory requirements for some embedded software. To avoid this, we also provide a notion of a shared library. This can be viewed as an immutable compartment: it contains code, but no mutable globals and so is safe to invoke by jumping to a sentry capability.

A sentry (sealed entry) capability is an existing CHERI feature that allows an executable capability to be sealed with a magic object type that lets it be unsealed automatically by a jump instruction. As with any other sealed capability, this cannot be modified and so sealed capabilities provide a way of calling a function without allowing the calling function access to that function’s code or read-only globals.

Not allowing shared libraries to own globals is far less of a restriction on embedded software than it would be for large systems. We’re even able to fit a JavaScript interpreter in this model, allowing multiple mutually distrusting compartments to all run JavaScript code.

Some library routines need to run with interrupts disabled. On RISC-V, interrupts are disabled by writing a flag bit in a control and status register (CSR). We can prevent untrusted code from accessing this register by removing the access-system-registers CHERI permission but this is a very coarse-grained control and grants more privileges than library routines should typically have. Instead, we have extended the sentry mechanism to encode interrupt posture. We have three sentry types, one that disables interrupts on jump, one that enables them, and one that does not alter the interrupt status. A jump-and-link instruction will always create a return capability with the explicit interrupt posture at the point of the jump. These are exposed to C as function attributes, which means that the interrupt status is controlled via structed programming and so is very easy to reason about at the source level.

A privilege-separated kernel

There are several parts of the system that need to run with greater privileges than a normal compartment but there is only one component that runs with complete authority: the loader. The loader is responsible for setting up the compartments and runs as soon as the system has booted. This means that it begins executing with the set of capabilities that, between them, allow all accesses. It then derives more restricted capabilities from them. The loader keeps the three root capabilities around during its execution, but everything that it does other than derive new capabilities from these roots is done with a more restricted one. Once the loader finishes, nothing in the system will ever run with full privileges until the next system reset.

The next most privileged component is the switcher. This is responsible for all domain transitions. It forms part of the trusted computing base (TCB) because it is responsible for enforcing some of the key guarantees that normal compartments depend on. It is privileged because its program counter capability grants it explicit access to the trusted stack register, a special capability register (SCR) that stores the capability used to point to a small stack used for tracking cross-compartment calls on each thread. The trusted stack also contains a pointer to the thread’s register save area. On context switch (either via interrupt or by explicitly yielding) the switcher is responsible for saving the register state and then passing a sealed capability to the thread state to the scheduler.

The switcher has no state other than the state borrowed from the running thread via the trusted stack and is small enough to be audited easily. It always runs with interrupts disabled and so is simple to audit for security. This is critical because it could violate compartment isolation by not properly clearing state on compartment transition and could violate thread isolation by not sealing the pointer to the thread state before passing it to the scheduler.

The switcher is the only component that both deals with untrusted data and runs with the access-system-registers permission and it is fewer than 200 RISC-V instructions.

Note that the sealing operation means that the scheduler does not have access to a thread’s register state. The scheduler can tell the switcher which thread to run next, but it cannot violate compartment or thread isolation. It is in the TCB for availability (it can refuse to run any threads) but not for confidentiality or integrity. The scheduler is responsible for configuring the interrupt controller and so has access to the memory-mapped I/O (MMIO) space that grants this access but that just gives it control over availability: it can control whether interrupts are delivered and choose which thread to schedule when they do.

The final TCB component is the memory allocator. This is always a critical part of the TCB for any CHERI system because it is responsible for setting bounds on objects. If it does this incorrectly then you don’t even have spatial memory safety. On our system, it is also responsible for managing revocation and so bugs could introduce exploitable use-after-free vulnerabilities.

The memory allocator is responsible for managing a heap that’s shared between compartments and so bugs in it could lead to cross-compartment memory disclosure. Note that this is mostly limited to compartments that use the heap (not all do on embedded systems). The majority of the allocator has capabilities to the heap memory, but nothing else. The only exception is a small (isolated) component that provides the revocation service. This must be able to scan all mutable memory and invalidate dangling pointers. The revocation service can be implemented either in hardware or in a loop of around ten RISC-V instructions.

Save the seals

CHERI has a sealing mechanism that allows a capability with permit-seal permission to turn another capability into an opaque token that cannot be modified. This sealed capability has the address field of the sealing capability embedded in its ‘object type’ field and can be turned back into a usable capability only via an unseal operation with a capability whose bounds include that value and which has the permit-unseal permission. On Morello, the object type is 18 bits and so it’s possible to have a lot of different opaque types that are passed between compartments.

In our 32-bit capability encoding, we have only 3 bits spare for object types. We reserve all of these for use by privileged components. The switcher has one for unsealing pointers to compartments that are being invoked and for sealing thread state. The scheduler has one for protecting message queues. The allocator also has one, which we use to provide a software-defined capability mechanism.

The software-defined capability mechanism uses a special entry point to the allocator, which allocates an object with a header containing the value that would normally go in the otype field of a capability. Because this is in a heap allocation, it can be a full 32-bit value (minus the handful of values used by the hardware). The allocator returns a sealed capability to this object and, if presented with a permit-unseal capability with a matching address, will return an unsealed capability to the part of the object excluding the header word. This ensures that the header is tamper proof (accessible only within the allocator) and allows a huge number of sealing types, with the restriction that only pointers to entire objects can be sealed.

We can use this mechanism, for example, for the network stack to seal connection state and return it to the caller, then unseal it when invoked, protecting different connections from each other. This lets us build the principle of intentionality into higher-level abstractions: you may not send data over the network stack unless you present a software-defined capability that authorizes it.

What about safe languages?

The core parts of any OS, including an RTOS, involve doing unsafe things. The memory allocator, for example, has to construct a notion of objects out of a flat address range. Safe Rust can’t express these things and unsafe Rust doesn’t give us significant benefits over modern C++ code that uses the type system to convey security properties.

For the remainder of the code running in compartments, we want to be able to adopt existing code without modification. Rewriting things like a network stack, TLS layer, or JavaScript interpreter in Rust may end up with fewer memory safety bugs but, simply by being a new implementation, would be more likely to contain logic bugs than a mature and well-tested implementation. With CHERI, we can retrofit memory safety to these codebases without requiring a rewrite.

For new code, safe languages are a far better choice. A safe systems language such as Rust is going to be a fantastic fit for a lot of the performance-critical code. There is work underway on a CHERI Rust target. Rust’s delegation model is very similar to the set of properties that we can enforce across compartment boundaries and so we expect to be able to enforce properties between compartment from different mutually distrusting authors that Rust provides within a trust domain via the type system. This will make Rust an excellent language for development in this environment.

The safe language spectrum doesn’t stop at Rust though. A lot of code that runs on microcontrollers is control-plane software that isn’t performance (throughput or latency) sensitive. We already have a JavaScript VM running on the system and hope to support other fully managed languages. JavaScript code running in a compartment is protected from C code in other compartments and can be provided as a development environment for programmers who want a fully garbage-collected type-safe environment.

Today’s announcement

Over the next few weeks, we intend to publish a technical report with our adaptation of the RISC-V CHERI ISA and ABIs to embedded systems for external feedback, review, and collaboration, as the first step towards proposing it as an official RISC-V standard extension. By the end of 2022, we aim to upstream our implementation of the ISA to the lowRISC project’s Ibex core and release our software stack, including LLVM-based compiler and RTOS.

David Chisnall, Hongyan Xia, Wes Filardo, Robert Norton – Microsoft Research

Saar Amar – Microsoft Security Response Center

Yucong Tao, Kunyan Liu – Azure Silicon Engineering & Solutions

Tony Chen – Azure Edge & Platform