The first project I worked on in Ethereum was called Ewasm: an Ethereum-flavored, WebAssembly (Wasm) based VM for Ethereum. It didn’t end up being deployed in Ethereum but it did have a large impact on Wasm-based VMs that appeared in projects including Polkadot, NEAR, and Arbitrum.
At the time Spacemesh was working on a similar Wasm-based VM called SVM (Spacemesh Virtual Machine). Like Ewasm SVM was also retired without ultimately being deployed and we made the difficult but correct decision to launch the initial version of Spacemesh without a full VM, but the work done on SVM has also had a profound impact on our current plans for a Spacemesh VM. (This is how technology R&D works: dead ends are sometimes just as helpful as successes.)
Without a full VM, in its current form Spacemesh isn’t much more than a cryptocurrency like Bitcoin. With a full VM it’ll become a general purpose compute platform that can be used to run any sort of application our incredible community of developers can dream of (within certain resource bounds). For this reason the VM has always been central to our vision of a mature Spacemesh network and protocol, and it’s the biggest thing we’re working on today other than general network stability and usability. It’s also always been the part of Spacemesh that I’ve been the most excited to work on because it’s the thing that makes Spacemesh useful and exciting to millions of existing and potential blockchain developers, and because it gives us the opportunity to do something really special and really unique and to offer a much better developer experience than any existing blockchain VM.
We previously released a position paper outlining our goals and high level approach to designing the eventual Spacemesh VM, but it was scant on details. With the goal of advancing our VM design and filling in more of these details, the Spacemesh research team met face to face last week in Athens to discuss goals, candidate designs, and tradeoffs. It was a busy week full of deep tech, intense debate, and heated conversations. We managed to walk through the entire VM stack and address pretty much every open question, and in the end we reached consensus on a large number of decisions (which is why it’s so important to put everyone in the same room from time to time!). Here are the main things we discussed and the main decisions we made.
Thing #1: Athena 🔱
In proper Spacemesh fashion we’ve chosen a substantially novel VM design. In the same way that we built proof of spacetime, our novel consensus mechanism, as a hybrid of the best parts of proof of work and proof of stake because we were dissatisfied with these two alternatives, we’re equally dissatisfied with the existing VM options on offer.
The first successful, fully expressive blockchain VM was of course Ethereum’s EVM. Today there are many such L1 smart contract platforms to choose from. In theory they let you build anything but in practice they all face the same constraint: scalability. L1 blockchains all require that every single full node in the network execute every single transaction in order to independently verify that those transactions, and the blocks containing them, are valid. This is robust and secure but there’s no way it can even begin to provide the scalability we need if we’re going to run the world’s financial infrastructure (and continue to trade meme coins) on chain.
Various L2 solutions, most famously rollups, have arisen to address this shortcoming. Rollup designs vary and the nuances of the tradeoffs are beyond the scope of what we can cover here but in general rollups trade security and sovereignty for scalability. They store their data and settle disputes elsewhere, typically on a L1 chain like Ethereum, and as a result allow a form of “vertical” scaling where the rollup can process transactions much more quickly than the base layer chain. If you build many such rollups and introduce “quadratic rollups” (i.e., rollups-within-rollups) you get theoretically infinite scaling: indeed, this is precisely the Ethereum vision for achieving massive, modular scalability.
There are two major downsides to the rollup model. The first is fragmentation. There are already dozens of popular rollups built on top of Ethereum, for instance, and moving among them requires managing assets on different chains, managing multiple accounts, traversing rickety bridges, and even using different wallet software. Applications deployed on one rollup cannot natively “talk to” (i.e., exchange data or assets with) applications deployed on other rollups.
The second downside is security and centralization. While all rollups have plans for progressive decentralization, none has today achieved a degree of decentralization even remotely approaching that provided by a L1 chain like Ethereum. Decentralization really matters: it’s the reason we’re all here, the reason we’re building blockchains rather than centralized databases. Every single major rollup today is run by a single, privileged, centralized operator and is controlled by a centralized multisig that manages things like upgrades. This is simply unsatisfactory and it’s opposed to everything we stand for.
Spacemesh’s answer to all of this is a novel rollup design that we call Athena in honor of the city where it was first conceived and its patron goddess. Athena is the goddess of wisdom, craft, and warfare, which somehow feels highly appropriate for a blockchain VM! The full details of the Athena design are unfortunately also beyond the scope of what I can cover here, and the design isn’t complete, but we’ll share a more detailed technical description soon. For now I can at least sketch the outline.
Rather than a single, centralized sequencer, Athena has three roles. These three roles, Miner, Relay, and Executor, together bring Athena to life and address the aforementioned shortcomings of existing designs. Importantly, participation in all of these roles is permissionless, ensuring that the entire Spacemesh stack, top to bottom, including the VM, is and remains permissionless and maximally decentralized.
Users transacting with the Athena VM and applications built on it interact with the network through nodes known as Relays. Relays maintain funds and accounts on both the Spacemesh L1 network and the Athena network, and have public RPC endpoints that users and applications can talk to. They’re necessary because we don’t want users to need both an L1 account and an Athena account in order to transact on Athena, and because we equally don’t want L1 miners to have to run the Athena VM or validate Athena transactions. The Relays bundle up Athena transactions and publish those bundles to the L1 Spacemesh network. They collect fees on Athena and subsequently pay L1 miners in $SMH for storing those bundles. Note that users who are willing and able to maintain L1 accounts can publish their own transactions directly to L1, which ensures that Athena enjoys the same censorship resistance guarantees as the Spacemesh L1 network itself.
L1 miners receive Athena transaction bundles from Relays and, while producing L1 blocks as they do today, they deduplicate and sort Athena transactions and include them in the final, cooperatively produced L1 block for a given layer. This is the most unique aspect of the Athena design relative to existing rollup designs: because Athena is an “enshrined” rollup layer built on top of the Spacemesh L1, L1 miners have a unique degree of insight into Athena transactions. They don’t need to actually run the Athena VM, which is how Athena achieves scalability, but they do play an essential role in sequencing Athena transactions in a decentralized fashion (for those familiar with the idea, this is a bit similar to the notion of a based rollup). To reiterate: rather than a single, centralized sequencer, and in contrast to existing production rollup designs, Athena is fully and meaningfully decentralized from day one.
Finally, once Athena transactions have been successfully committed to L1 blocks, the third class of actors, Executors, are responsible for executing those transactions in situ and maintaining the Athena state. What happens next depends on whether the design is Optimistic or ZK-based.
Thing #2: Rollup 🌯
The security and correctness of transaction execution inside a rollup is guaranteed in one of two ways: optimistically or using zero knowledge proofs. In an optimistic rollup transactions are applied to state immediately (i.e., optimistically) but assets can’t be moved out of the rollup until a challenge period has passed. During this challenge period proofs of bad behavior known as fraud proofs can be submitted. Zero knowledge rollups work in the opposite fashion: in order for a transaction to be considered correct and canonical it must come with a ZK proof of correctness (known as a validity proof) attached. There’s no need for fraud proofs or a challenge window in ZK rollups since a validity proof by definition cannot be generated for an invalid transaction.
There are major advantages and disadvantages to each design and, despite what you may hear on CryptoTwitter, neither design is optimal nor dominant at this stage. Optimistic rollups permit the efficient execution of transactions using existing VMs such as EVM that smart contract developers are already used to. The happy path for optimistic rollups is quite, well, happy and optimistic, but things get much more complicated when something goes wrong. Fraud proofs require a complex challenge-response game whereby an offending transaction or single instruction inside a transaction is isolated, the violator is punished, and the challenger is rewarded. Different rollups take different approaches to this protocol—some perform the entire thing on chain, others off chain, for instance—but the process is so delicate and so complex that no optimistic rollup has yet enabled permissionless fraud proofs.
By contrast ZK rollups have the advantage of not requiring fraud proofs or a challenge period but they have other significant downsides. Firstly and most obviously, generating ZK validity proofs is expensive and slow. ZK proofs have gotten much more efficient than they used to be, but proving correct execution of a VM that isn’t designed to be ZK friendly is still slow and expensive. In our benchmarks, it takes on the order of 10 minutes on an ordinary, mid-range desktop computer to generate a single proof for a single transaction using a ZK friendly VM; doing the same for EVM, for example, could easily be 100x as expensive. In practice ZK rollups use enormously powerful infrastructure to generate these proofs in a timely fashion, but this doesn’t lend itself to decentralization and permissionless participation (most of us don’t have systems with 128 cores and 1TB of RAM lying around).
One popular ZK friendly VM is Cairo. However, ZK friendly VMs like Cairo require the use of programming languages and programming paradigms that are unfamiliar and confusing to most Web2 and Web3 developers. ZK is so complex that building a ZK rollup and proving architecture requires either spending tens of millions of dollars to recruit extraordinarily scarce talent, or else relying on the work of others who have already done this (such as Starkware, the team behind Cairo). The downside here is that if something goes wrong you can’t fix it yourself and need to wait for someone else to fix it, which entails sacrificing some degree of sovereignty over your tech stack.
Finally, building a decentralized, permissionless set of incentives around the generation of ZK proofs is a hard, unsolved problem (which is another reason that ZK rollups use centralized operators and provers). How do you ensure that you have enough redundancy to handle failures but not too much duplication? (This is a well understood but hard to solve problem in probability known as the coupon collector’s problem, which gets significantly harder in a distributed context.) One nice property of ZK proofs is that the work to be done (e.g., the transactions in a block or layer) can be split up and many such proofs can be generated in parallel and later combined using recursive proofs, i.e., proofs of proofs, but this further complicates the game theory. Who’s responsible for generating recursive proofs? What happens if a proof fails to appear? And in case of a deep reorg who’s responsible for regenerating proofs of old blocks and layers and how are they compensated? These are just some of the unanswered questions we’re grappling with as we consider various ZK designs.
Given these constraints and tradeoffs we decided to go with an optimistic design for now. In spite of its downsides we feel that the optimistic rollup design is more mature and more viable for Spacemesh today. We plan to ensure future ZK compatibility by choosing a ZK friendly machine (more on this below), but ZK and its complexity won’t be a blocker to designing and launching Athena. We can also avoid some of the biggest downsides of the optimistic design, such as the complex challenge-response protocol and the lengthy challenge period, due to the unique role of Spacemesh L1 miners in Athena. Athena transactions will have precisely the same finality as Spacemesh L1 transactions, obviating the need to wait a long time when moving assets from Athena to L1 (i.e., in Spacemesh, an Athena reorg is by definition a Spacemesh L1 reorg). And L1 miners can resolve challenges by directly executing challenged transactions, which vastly simplifies the optimistic protocol.
The final piece of the puzzle is the “machine” that’s used to maintain state and execute transactions!
Thing #3: Machine ⚙️
One of the most important decisions when designing a smart contract engine is choosing the “machine” that actually executes smart contract code. There are many options including Bitcoin Script, EVM, the Move VM, the Solana VM, and more universal standards such as WebAssembly (Wasm) and RISC-V. Within each of these “families” of VM there are additional choices and decisions such as how to ensure that execution is deterministic (which usually involves removing support for floating point operations), choosing a bytecode format and figuring out how to interpret or compile it, choosing a word size and register count, figuring out how to do gas metering and enforce runtime constraints, deciding whether to support any degree of parallelism, etc. As you can imagine there are many significant tradeoffs among these many choices! And all of this is before even considering zero knowledge support.
Years ago I was initially turned off by blockchain-specific VMs and languages like EVM and Solidity for the simple reason that I’ve learned a huge number of programming languages, frameworks, and toolchains already in my career as a developer and I’m not super excited about learning even more. I loved the idea of using a universal standard like Wasm that would allow me to develop smart contracts using familiar languages like C and Rust while using familiar, mature tools like gcc, cargo, my favorite IDE and macros, my debugger, etc., rather than learning an entirely new set of tools. I thought that this would make smart contract engineering appealing to millions of existing developers who would be less than excited about learning things like Solidity and EVM.
But over time I’ve come to realize that there are big downsides to this approach as well. The main one is that universal standard machines and languages are by definition not custom tailored to the blockchain use case, which is in reality a very different programming paradigm from the type of programming most developers are used to for a number of reasons. Security matters more than it does for other types of software development since your code is open, anyone can audit and call it, and bugs can cost hundreds of millions of dollars. Blockchain applications need to optimize for things like reproducible, deterministic execution and ZK friendliness that most applications do not, and unlike other types of applications code size matters enormously. For these reasons a domain specific language and VM such as EVM or Move has some big advantages.
Another more nuanced and less discussed downside to adopting a global standard like Wasm is that doing so means losing control of the spec. Put another way, it means moving the entire Wasm toolchain including the interpreter or compiler inside of your trusted computing base, such that a bug in either could subject you to attack and, in the extreme case, could even cause your blockchain to crash. Adopting potentially millions of lines of third party code in this fashion is a decision that shouldn’t be taken lightly. And it also means that, when the upstream spec is updated as happens quite often with global standards like RISC-V and Wasm, you need to either include those updates or else choose not to update and, over time, end up effectively maintaining a noncompliant fork of the standard, a decision which should also not be taken lightly. By contrast if you “own” your own, bespoke VM design you’re never forced into a corner in this fashion.
Of course the execution engine is more than just the programming language and instruction set. It’s also the SDK. The SDK is the glue that allows smart contract code, regardless of which language it’s written in, to interact with the blockchain: transferring assets, reading and writing other state, working with accounts, etc. You get this “for free” with a domain-specific machine and language like EVM or Move: accounts and coins are first class citizens in EVM, for instance, and you can do things like reading and writing state and transferring coins using built-in opcodes. Move includes some elegant constraints that help a lot with safety. If you’re using an off the shelf option like Wasm you need to write this SDK glue code yourself which increases the surface area for bugs and exploits.
The final piece of the puzzle is the account model itself. Does your blockchain use accounts or UTXOs, or some novel hybrid such as Move’s resource model or Miden’s actor model? Where does the state live, and are state items, e.g., someone’s token balance, attached to the token object, to the owner account object, or to something else? Can you access multiple state items in parallel? How do you expire or evict unused or unpaid state items over time to prevent state bloat? This seemingly simple idea is in practice borderline impossible to do in a good way on an existing system like EVM, and is way easier if you design correctly for it from the beginning.
This is a subset of the questions we’re considering as we decide the rest of the details for the Spacemesh VM. With all of this context in place I’ll explain in brief what we’re planning for Athena. The design isn’t final yet but we have the rough outlines.
We’ve decided to use RISC-V as our instruction set, i.e., the underlying “machine” that will run Spacemesh smart contracts. Given everything we know today this seems like the optimal decision (note that others have come to the same conclusion). It allows us to adopt a global, open standard for which support inside and outside the blockchain ecosystem is rapidly growing. It will allow smart contract developers to write code in their favorite, existing language and to use existing tools. Best of all, thanks to the innovative RISC Zero framework it’s also reasonably ZK friendly.
What about the high-level language and VM design? Unfortunately, for now compiling EVM to RISC Zero is extremely inefficient (some projects including Taiko are doing this and we wish them luck) and there’s currently no good way to compile Move to RISC-V. The simplest possible option here is to use Rust, to provide a barebones account model and SDK that developers can use to build Spacemesh-compatible smart contracts in Rust (or, eventually, other supported languages), and to compile everything to RISC-V and prove it in zero knowledge using RISC Zero.
We’re very excited about this plan but the devil’s in the details. We have to better understand existing blockchain VM SDKs and we have to finish designing our Athena account model to ensure that things like rent/state expiry are feasible. We really like Move’s resource model but need to figure out how to implement and enforce those constraints, whether at compile time or at run time. We have to figure out whether we can realistically support any degree of parallel execution. We have to better understand the RISC-V toolchain and work out the details of how smart contract code is compiled, stored, interpreted, metered, and safely executed. We hope to make rapid progress on these important questions and we’ll share progress as we do!
There’s obviously still a lot of work to be done but we couldn’t be more excited about the fact that we’re one step closer to a future where motivated developers can write, test, deploy, and release all sorts of applications on Spacemesh using cutting-edge tools.
Special thanks to Amira Bouguera and Tal Moran for reviewing this issue and providing helpful comments and suggestions.