Technical Challenges in Athena, Part II

Three Things #133: August 18, 2024

Aug 18, 2024

At heart, Athena is a platform for building amazing things. The focus shouldn’t be on the platform itself, but rather on the things it enables to be built.

I wrote two weeks ago about some of the difficult but interesting technical challenges we’ve encountered while working on Athena. As a complex project with many moving parts, Athena presents no shortage of technical challenges. Here are three more.

Thing #4: Upstream Changes 🚣🏻‍♀️

"To improve is to change; to be perfect is to change often." — Winston Churchill

Athena has many components and serves multiple functions. But at its core, Athena is a straightforward RISC-V interpreter that runs programs compiled to RV32EM and RV32IM, the two RISC-V variants that Athena supports. Initially, I couldn't find a mature, “off-the-shelf” RISC-V interpreter that met my needs, so I began writing this interpreter from scratch.

Later, while studying the code for SP1—a RISC-V zkVM we plan to use for generating ZK proofs for Athena programs when we transition to a ZK rollup—I realized I was looking at precisely what I needed: at its core, SP1 is also a straightforward RISC-V interpreter, albeit wrapped in a lot of ZK magic. To prove correct execution, a zkVM needs to demonstrate not only that individual instructions were executed in the right order but also that the entire state of the virtual machine is correct, including memory access, the state of registers, the stack pointer, and the program counter. SP1 includes logic to handle all this.

I decided to experiment. I took the SP1 core code and, piece by piece, began stripping away all the ZK-related components and “extra magic.” For instance, whenever the SP1 interpreter reads from or writes to memory, it creates a record of the memory access and includes it in the program transcript, which is what gets proven. Athena doesn’t need this functionality. The process took a few months, but eventually, all of the ZK-related code was removed from the core. The experiment worked: this is where the Athena core originally came from, although we’ve made numerous changes and improvements since then.

This begs the question: why not just use the SP1 code as a library? In an ideal world, the SP1 core would be available as a standalone Rust crate, without all the ZK bits and bobs. Then we could import just the code we need and rely on it, rather than maintaining a fork of their code. I actually tried to do this, but it wasn’t possible. SP1’s core wasn’t designed or built to be used as a third-party library, and there’s no way to “turn off the ZK” without forking the code.

The downside is that when SP1 releases updates, there’s no easy way to pull them in. It took a full day to review all the updates from the 1.0 release and integrate just the changes relevant to Athena. The SP1 core may still be the Athena core, but the two codebases have diverged so significantly that even Git, with its powerful differential engine, can’t compare them, so I had to do it manually. The good news is that the core code is relatively stable and likely won’t see many updates in the future.

I have some ideas for partially automating this process. We could probably use GitHub Actions to monitor SP1 releases, filter them based on file location, and highlight what’s changed, but there’s no tool powerful enough to automate the entire process today. (In the not-too-distant future, I could imagine an AI engine handling this.)

Another, cleaner option is to “upstream” our changes, which would involve working with the SP1 team to refactor their core into something that other projects like Athena can rely on. However, this is a huge project that isn’t a high priority and won’t happen anytime soon. For now, we’re stuck with this annoying, time-consuming manual reconciliation process.

Thing #5: It’s Just Rust 🦀

"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away." — Antoine de Saint-Exupéry

Athena is far from the first project to use Rust for blockchain programmability. Other projects, including Polkadot, NEAR, and Solana, also use Rust, and domain-specific languages like Move and Cairo are very similar to Rust. However, as I dug deeper, I began to understand why none of these projects are truly just using Rust and why doing so is so challenging.

Let’s start with the first point: why other projects only sort of use Rust. Solana, for example, uses Rust, but there are two issues. First, they maintain their own fork of Rust and the Rust toolchain. As far as I know, the main reason is the eBPF backend and Solana’s changes, which required Solana-specific customizations. While this isn’t catastrophic, it does mean that the Solana version of Rust will always be slightly out of date, and it adds more work to maintain the Rust fork.

The second issue is that Solana programs aren’t written in idiomatic Rust; instead, they require the use of Solana-specific conventions that contradict Rust conventions. This effectively turns Solana Rust into a DSL built on top of Rust. Three examples of this are: 1. Programs must have a single entry point, making it impossible to use natural, idiomatic Rust constructs like functions or a struct with methods. Instead, each program must internally dispatch calls to other functions. 2. Calling another program using CPI is extremely complicated and bears little resemblance to calling a method on an object or calling into a foreign library. 3. Programs can’t return values directly; instead, they must use a convoluted set of syscalls to return something, and even this wasn’t originally possible. While none of these issues is a dealbreaker on its own, taken together, they mean that writing programs for Solana isn’t really writing Rust as you would if you were compiling it using the mainline Rust compiler. You’re writing Solana-flavored Rust. Although there isn’t space here to go through all the other chains and VMs I mentioned, they all have similar limitations.

Now, onto the second point: why is it so difficult to just use Rust, and why don’t more projects do this? Why only sort of use Rust rather than going all in and using the actual Rust toolchain? It took some time to understand this, but it has to do with the degree of control the execution environment has over how programs are compiled and run.

At one extreme, you have the EVM and Solidity. Ethereum has its own compiler and its own domain-specific VM, giving it full control over every detail of how Ethereum smart contracts are written, compiled, linked, stored, and executed. The tradeoff is that developers can’t use mature, familiar tools like Rust when writing Ethereum smart contracts.

Next are projects like PolkaVM, which also use Rust but only partially. PolkaVM includes complex macros (more on these in a moment), a custom linker, and a custom bytecode format. This means they’re only using part of the Rust toolchain, requiring special, domain-specific tools to write programs for PolkaVM. Like Solana, these programs don’t look like idiomatic Rust, and the result is significant complexity. Solana, NEAR, and others are similar to PolkaVM in this regard.

Athena, however, is positioned at the opposite end of the spectrum. We aim to use only the standard Rust toolchain, with no special or custom modifications. I’ve discussed the benefits of this decision repeatedly, but in short, it means we’re using the most mature, battle-tested, and widely adopted tools, which are familiar to the most developers. Writing programs for Athena should feel very familiar to any Rust developer and shouldn’t require additional training. Accordingly, Athena programs are designed to look and feel like idiomatic Rust.

The tradeoff here is control. Since we’re using the default Rust tools, we have much less control over how programs are compiled, linked, and stored. In particular, we have no way to make deep changes to the program AST at compile time.

However, we do have one powerful tool at our disposal: procedural macros. This is a complex and powerful Rust feature that gives us some control over how a program is compiled. A Rust macro can make deep changes to a program’s structure. We currently use macros to bootstrap a program’s entry point, set up memory and the stack pointer, and wrap template methods to make them externally callable. Procedural macros are somewhat limited—for example, they don’t have access to the full AST with type information—but I believe they’re powerful enough for our purposes. They allow us to write an Athena program that looks like this.

The macros we use with Athena must perform several tasks. First, they expose functions in the program binary so they can be called directly by external callers. Second, they wrap the function, its arguments, and its return value into FFI-compatible C types. We also use the macro to “rehydrate” template structs using the program state stored on the blockchain so that programs can be written as ordinary Rust structs.

None of this is straightforward, and it’s not finished yet, but it is working (here’s an end-to-end test that passes). In fact, I’m amazed at how far we’ve already come. We may ultimately decide to follow the path taken by PolkaVM, Solana, and others, taking more control of the toolchain. This would make certain things easier and allow us to do things that we currently can’t. For example, as I wrote previously, if we decide we want smaller binaries on-chain, we may need a custom linker and possibly a custom bytecode format. But we’ll make those decisions as necessary, and not before. The goal is for Athena to rely as much as possible on the entire Rust toolchain for as long as possible.

Thing #6: Platform Independence 🗽

"The art of simplicity is a puzzle of complexity." — Douglas Horton

It’s challenging enough to get something as complex as Athena working on a single platform. Expanding it to work across multiple platforms and architectures is even more difficult. For software developers, the holy grail has long been a framework—whether a language or a VM—that’s truly platform-independent and allows you to “write once, run anywhere.”

To some extent, this is now possible with languages and platforms like Python and the web. Python and JavaScript do a good job of isolating you from the operating system and hardware, enabling the development of various applications without much concern for the underlying platform. Moreover, it’s safe to assume that most users, regardless of their system’s architecture, should be able to run your web application if it’s built using modern web standards. A web browser and its JavaScript runtime essentially function as a big VM running platform-independent code, and in this respect, it’s not unlike Athena.

However, building that underlying platform itself is far from straightforward. Ultimately, Athena programs, like web apps, need to run on actual, physical hardware. One way to think about a web browser is as a translation layer that runs programs in isolation from the underlying operating system, proxying platform-specific behavior down to the OS when hardware access is required. Athena has to perform a similar role, though its capabilities are currently far more limited than those of a web browser (e.g., no networking, no file access, no direct hardware access). You can think of Athena as a translation layer between the Athena program and the underlying system.

For this reason, Athena needs to run on virtually any modern system. This means supporting the “big three” operating systems—Linux, macOS, and Windows—and supporting both AMD64 and ARM64 architectures, at a minimum. Together, that’s six permutations, and we may choose to add support for additional platforms in the future.

To be clear, we already face this challenge with our other applications, including go-spacemesh and its dependencies. This has led to complex tooling, such as the CI/CD scripts we use to compile and import libraries across multiple platforms, a process that must often be repeated three times, once for each operating system. But it’s a bit harder for Athena because Athena needs to translate the host’s underlying hardware instruction set, something go-spacemesh doesn’t have to worry about (it’s compiled to native code, and there aren’t any applications running on top of it).

Currently, we interpret programs on Athena. This is the simplest and most correct approach, but it’s inefficient. Many other projects, including Solana, NEAR, and PolkaVM, opt to run programs using a JIT (“just-in-time”) compiler for performance reasons, and efforts are underway to develop similar tools for EVM. The difference between interpretation and compilation is significant. Compilation involves translating each Athena RISC-V instruction into one or more instructions that the underlying hardware can understand, rather than emulating those instructions purely in software, as an interpreter does. If we want Athena to be truly cross-platform, we’d potentially need six different translators/cross-compilers/transpilers—one for each OS x platform permutation mentioned above.

In reality, we’ll start with just one: Linux on AMD64. This single combination accounts for the vast majority of systems that will run Athena nodes in production for the foreseeable future. For now, all other platforms will fall back on an interpreter-based “compatibility mode,” which is the solution adopted by the other projects I mentioned. I hope we’ll be able to add native support for each platform over time, and as RISC-V tooling improves, this may become easier.

Our goal is for Athena to support the wide variety of platforms where people already run Spacemesh. By far the hardest platform to support is Windows, for various reasons. It’s simply more challenging to develop for, and it doesn’t support many of the libraries and tools that macOS and Linux do. None of our developers personally run Windows, so development work has to be done on emulators or remote systems, which can be frustrating. Fortunately, Rust has robust cross-platform support, so I don’t foresee major issues in getting Athena running everywhere. I even hope it will eventually run on resource-constrained devices such as a Raspberry Pi or a smartphone.

Three Things

Technical Challenges in Athena, Part II

Three Things #133: August 18, 2024

Thing #4: Upstream Changes 🚣🏻‍♀️

Thing #5: It’s Just Rust 🦀

Thing #6: Platform Independence 🗽

Discussion about this post