Technical Challenges in Athena

Three Things #131: August 4, 2024

Aug 05, 2024

What working on Athena feels like most of the time.

For a professional software developer I don’t actually write much technical content here. One of my goals with Three Things has been to make technical ideas more accessible, which generally means not going too deep. This means that, while I’ve written quite a bit about the Athena project from a 30,000 foot perspective, I haven’t written anything that explains some of the “in the weeds” technical aspects of the project. But Athena is a deeply technical project and I’ve been deep in those weeds lately, so it seems like a good time to talk about some of the more technical details. Here are some of the hardest technical challenges of the project so far.

Thing #1: Rust Toolchain 🦀

I’ve said it many times and I’ll say it again: Athena is just Rust. What I mean by that is, writing programs for Athena just means writing Rust code. You shouldn’t need a new language. You shouldn’t need a new compiler. You shouldn’t need any new tools. Athena is just Rust.

That’s the goal, anyway. And we’re not so far from achieving it! Athena is, really, just Rust. You write Athena programs in Rust and compile them using Rust.

In fact, Rust is more than a programming language and a compiler. It’s an extremely mature, active ecosystem and a big toolchain with dozens of components and utilities. What’s more, it’s built on top of LLVM, which is the most mature compiler toolchain there is. Rust and LLVM support many target architectures and instruction sets. They’ve included support for RISC-V for years. Rust even gained support for RV32IM, one of the instruction sets that Athena supports, earlier this year, thanks to our friends at RiscZero.

Unfortunately, however, Rust still doesn’t support RV32EM, which is the primary instruction set we’re using for Athena programs. The difference is that RV32IM uses 32 registers, while RV32EM uses 16 (the “E” stands for “embedded”).

Why use a nonstandard instruction set, when doing so makes our lives harder? There are a few reasons. One is that most of the underlying hardware that’ll be running Athena, including systems that run amd64, the most popular ISA, actually have 16 registers, which means there should be no performance hit at runtime and that compilation from RISC-V to native machine code should be even easier. Another is that we need to wrap Athena programs in a special runtime kernel when we create ZK proofs of execution, and reserving the upper 16 registers makes this task much easier (since we know the user space program can’t touch them).

In fact, LLVM gained partial support for RV32EM earlier this year. And adding support to Rust isn’t too hard. This is exactly what our toolchain does. (You can see exactly the required changes in our patches.) There’s no reason Rust won’t also support the instruction set; it really just requires that someone make the effort to add it. We can do that when we’re ready.

In the meantime, however, the only way to support RV32EM for Athena was to compile a custom Rust toolchain. I hoped this would be as simple as adding a custom target to Rust. In the end, it turned out to be a much bigger task than I anticipated. Every time you compile Rust, it takes more than 30 minutes, even on a fast system. The super slow feedback loop made testing changes very difficult. There were a lot of mysterious errors that took days to debug, even with the help of AI tools. Fortunately, I did have the work of others including RiscZero and PolkaVM to rely on. Unfortunately, I also ended up having to work around mysterious bugs in other people’s code.

The other challenge was cross-compilation. Compiling and running a program on the same system is straightforward. While it isn’t quite as easy, it’s also possible to compile, say, a Windows application on Linux or a macOS application on Windows using Rust. Using a nonstandard instruction set, and trying to compile a compiler, made this much harder. I was constantly getting confused: I’m using linux-amd64 on Rust to compile Rust to run on linux-amd64 to compile RISC-V code to run on Athena. So… which is the host and which is the guest? 🤔 (This sort of strange loop is exactly the reason I so enjoy working on compilers and VMs!)

RV32EM caused headaches on the RISC-V side as well. The standard RISC-V toolchain also doesn’t support RV32EM out of the box, and I had a lot of trouble getting it working. As with the Rust toolchain, it was mostly getting the right set of libraries together and figuring out the right set of commands. Eventually, after a lot of trial and error, I was able to compile programs to our instruction set, including the memory management code in Athena and some simple test code (see below).

The final product, a patched version of the latest Rust toolchain that supports the instruction set we need, is embarrassingly simple given how long it took to put together, but I’m proud of this work. I learned a lot about compilers in general and about Rust in particular. This understanding is extremely important if we want Athena to be a mature project with a target that, eventually, we’ll lobby to add to mainline Rust. Until then, those who want to compile Athena programs will need to download the custom toolchain—which, sadly, for now only supports Linux. Fortunately, the athup command automates this process.

Thing #2: FFI and Lifetimes 🧩

One of the things that makes Rust such a fun and powerful language is a unique tool called the borrow checker. This is a large part of how Rust achieves its memory safety, one of its primary design goals. The borrow checker keeps track of the ownership of every bit of memory, and simply won’t allow a program to compile if there are errors. In this way Rust almost totally eliminates null pointer exceptions and many other types of common memory issues.

One of the Rust features the borrow checker relies on is called lifetimes, which track how long variables must be kept active before they can be deallocated. While Rust is usually able to figure out lifetimes on its own, in some situations the developer must explicitly annotate certain data elements with a lifetime, which tell Rust how long a reference should be valid. Rust doesn’t have a garbage collector but instead tracks the number of active references to each memory address, a process called reference counting. Annotating with lifetimes is necessary only when Rust can’t tell how long a particular object lives. Lifetimes allow the borrow checker to disambiguate how it should reference count memory objects.

Rust has a fairly steep learning curve and these memory-related features—ownership, the borrow checker, and lifetimes—are a big reason why. Having said that, most of the time it’s not too hard to understand and implement ownership correctly, once you get the hang of it. Each object can only have one mutable reference at a time (or many immutable references), and “moving” the object into another scope transfers its ownership. A mutable reference can be safely cast into an immutable reference, but not vice versa. That’s most of what you need to know.

But FFI makes this much harder. FFI stands for foreign function interface, and as I wrote in the last Athena update, it’s how programs in one language call code from another language. In our case, for now, this means go-spacemesh, written in Go, calling into Athena, written in Rust. When programs communicate over FFI, they have to conform to C standards, which is more or less the lowest common denominator. This requires passing raw pointers, which means that all memory safety bets are off. In Rust and in most languages, FFI code is considered “unsafe” as a result.

Athena includes a sophisticated FFI, which took a long time to design. I previously wrote more about the process of designing and building it, but in brief, one hard part was implementing dozens of pages of wrapper and conversion code that converts back and forth between C types and Rust types (and, later, Go types for use in go-spacemesh). This code is straightforward but error prone, repetitive, and a pain to write. Writing these “bindings” that allow Athena to talk to the “Athcon” (Athena Connector) interface in Rust is actually fairly straightforward, since Rust allows directly calling C functions and working with C types. By contrast, adding support in Go was more work, since Go requires the use of an intermediate layer called CGO and can’t as easily talk to C code as Rust can.

The FFI also caused some problems for the Rust borrow checker. When the host, e.g., go-spacemesh, calls into the Athena VM, it needs to pass an “opaque pointer” back to itself. More specifically, this is how programs running in Athena call back into the host, using host functions, which are an interface defined on this pointer.

The problem is, the host pointer is a raw C pointer, defined in the FFI. All of the other data input from the host is sanitized and wrapped in clean, Rust native types, so the vast majority of Athena code is clean, safe Rust.

But Rust doesn’t know what to do with the host pointer. It doesn’t know how long it should live. By default, it assumes it should have a “static” lifetime, which is to say that it lives for the entire life of the program. But that has downstream consequences for all of the code that touches it. It’s even more complicated by recursion: the host calls into the VM, which may in turn call into the host (using the call host function), which may in turn call back into the VM, etc., as a program works its way through multiple call frames. Rust needs to be able to understand the lifetimes and ownership for every active frame, and unwind everything correctly when execution terminates, successfully or with an error.

I’ll spare you the details, but it took a very long time to get this all working. I’m really proud that the tests are passing and everything seems to be working well, in spite of this complexity. And I learned an enormous amount about Rust memory management in the process of figuring it out. One of the really exciting things about implementing Athena in Rust is that, aside from a little bit of unsafe glue code, we get a lot of strong safety guarantees from Rust’s type system and borrow checker.

Thing #3: Minimal Program 🤏

Athena primarily relies upon the ELF standard for storing, loading, and executing binaries. This is great because ELF is about as OG as a standard can get. It’s existed for decades, it hasn’t changed in forever, and it’s the format that Linux uses for executables and libraries. It’s old and a little crufty, but it’s old for a reason: it works well, it’s battle-tested, and there isn’t a big reason to change it.

Of course ELF isn’t perfect. One downside is that it produces files that are larger than we need. This has to do with the way ELF files are constructed: ELF files have a header and various sections containing metadata. Even a very tiny Rust program compiles to a 72kb ELF. To give you a sense of the overhead involved, the PolkaVM project built a custom pipeline that includes a custom linker and a stripper to remove unused code and data and shrink the binary as much as possible. It turns a 800 kb binary produced by Rust into just 85 bytes! That’s 1,000,000% inflation!

This is fine most places ELF is used. These days hard drives are large, disk space is cheap, and no one cares about a few extra kilobytes here or there. But it does matter for blockchain, at least for now. Big ELF files means potentially big transactions to deploy new program code to the blockchain, on the order of tens or even hundreds of kilobytes, and that’ll be very expensive. (For comparison, most blockchain transactions these days are on the order of 100 bytes, and programs tend to be a few hundred to a few thousand bytes.)

I initially tried to create some really minimal programs to see what was possible, but I kept running into various problems. When I began removing sections from the ELF, or hand-coding them, the interpreter failed to load and execute the code. When I was about to give up I discovered two fascinating articles, and I felt reinspired to try again.

Understanding what is and isn’t possible required a deep dive into the ELF format, and into the decompiler that Athena uses to read it. It required understanding not only how program code is compiled, linked, and stored, but also how code and data end up split into different sections in the ELF file. It required getting a full RV32EM compiler toolchain up and running, a challenge that I wrote about above, and learning how to use tools I’d never used before such as as, ld, strip, objcopy, nm, readelf, and hexdump. It required learning how to write RISC-V instructions by hand. Finally, it required modifying Athena to accept raw RISC-V bytecode rather than an ELF.

I eventually ended up writing a minimal test of the Athena host functions that consists of only eight instructions. It looks like this:

.section .text

.global _start

_start:

addi x10, x0, 0x100 # load address to write result

addi x5, x0, 0xa3 # load host getbalance syscall number

ecall

addi x10, x0, 3 # load fd (3)

addi x11, x0, 0x100 # load writebuf address (result address)

addi x12, x0, 32 # load nbytes (32)

addi x5, x0, 2 # load write syscall number

ecall

I compiled it directly to RISC-V bytecode, without using ELF. (It was possible in this test case since the code is simple, and it doesn’t depend on any libraries or data.) I added a new Athena magic number at the start of the file and, voila, we were off to the races. The final file size is 36 bytes (four bytes per instruction, times eight instructions, plus the magic number), a far cry from the 72kb we started at! (It’s also smaller than the 45 byte program the author of the first article mentioned above finally achieved!)

This took the better part of a day. When I saw the program running successfully, I nearly fell out of my chair. It feels like a superpower to write raw instructions and see them successfully trigger a complex chain of events in a system like Athena, with everything working exactly as it should. For now it’s only used in tests, but it also shows what’s possible in the context of blockchain programs. The Athena tooling, because it’s just Rust, will continue to produce bloated ELF files for now, but at least now we know how we can reduce this size dramatically if we need to.

It’s also reassuring to know that the clever folks working on PolkaVM are a bit further along than we are, as mentioned above. They’ve already developed a custom compiler, linker, and binary format for their VM, which removes the bloat of the ELF. We may end up going the same route, but in the interest of remaining standards-compliant and keeping things as simple as possible for as long as possible, we’re not doing that yet. It’s also a tradeoff between code size on the one hand and complexity on the other: we should keep the Athena toolchain as simple as possible for as long as possible.

Three Things

Technical Challenges in Athena

Three Things #131: August 4, 2024

Thing #1: Rust Toolchain 🦀

Thing #2: FFI and Lifetimes 🧩

Thing #3: Minimal Program 🤏

Discussion about this post