Last week I wrote about something non-technical; this week, it’s time to go deep tech again. I wrote several articles late last year explaining some of the challenges we’ve faced at Spacemesh, and some of the design decisions we’ve made. I want to go a step further and focus on some of the most controversial decisions we’ve made. To be clear, all of the design decisions I wrote about previously are contrarian and indeed quite different than what other blockchains are doing. This week I want to explain why these decisions are so contrarian, why other projects haven’t adopted them, the associated tradeoffs and why we’ve chosen them.
Let me start by saying that the act of launching a new layer one blockchain in 2023 is already contrarian, or so I’m regularly reminded. I’m actually not sure why that’s the case. Yes, there are many existing blockchains, but none of them are perfect and in fact each has large flaws and downsides (which is only natural given the tradeoffs inherent in such a complex design space). To me, launching a new blockchain in 2023 is a bit like launching a new search engine in 1998 or a new social network in 2004. Lots of people probably thought Google and Facebook were crazy to enter those crowded markets when they did, but we all know how those stories turned out.
If you’re going to enter a crowded market, the worst thing you can do is copy an incumbent’s strategy. The best thing you can do is something fresh and remarkably different. Yes, Spacemesh is taking risks, both social and technical, but those risks are calculated and intentional. We’ll succeed or fail on the merits of these decisions.
Thing #1: Self-Healing
I briefly mentioned self-healing in one of my previous articles. It’s one of the most unique aspects of the Spacemesh protocol, and also one of the most complex and hardest technically. To understand its purpose, it’s helpful to start with a fundamental tradeoff in distributed systems between safety and liveness (otherwise known as consistency and availability), best captured by the CAP theorem.
One of the best aspects of Bitcoin-style proof of work and Nakamoto Consensus is the way it strongly favors liveness. Bitcoin never stops producing blocks. It doesn’t matter how severe an attack, or how many miners go offline. As long as a single miner is still running, it will continue to produce blocks. (Of course, block production will temporarily slow down if many miners suddenly disappear, but it will speed up again after a few difficulty adjustments. We saw this happen in June 2021 when Chinese miners went offline.) Bitcoin can thus “heal” from an attack of any severity or Byzantine behavior of any degree.
Proof of stake systems, by contrast, don’t have this property and favor safety over liveness. In this respect their consensus is much more fragile. If consensus breaks temporarily for any reason—e.g., because assumptions are temporarily violated and ⅔ of validators aren’t online and voting or don’t act honestly—the protocol is effectively stuck with no way to reestablish consensus. This is due to the way in which proof of stake systems process transitions in the validator set: these involve ordinary transactions. New validators that want to join declare their intent and deposit their stake as a transaction; incumbent validators that want to stop validating send exit transactions. This leads to a circular dependency. If there isn’t a sufficiently large set of honest validators to notarize these transactions, then there also isn’t a way to repair consensus by changing the validator set. QED.
In this situation the only choice for a proof of stake network is manual intervention, and a manual restart. We’ve seen this happen several times recently to networks like Solana. To put it mildly, such an intervention is suboptimal because it violates many of the trust assumptions of Bitcoin.
Self-healing is the mechanism by which Spacemesh guarantees both liveness and safety. The key thing to understand about Spacemesh is that it’s a hybrid of proof of work and proof of stake, in that we attempt to capture the best aspects of both. Spacemesh is a bit more like proof of stake in the way we have a fixed set of smeshers (what we call our miners) in each epoch. Unlike proof of stake, however, the Spacemesh protocol is able to repair itself even if consensus breaks temporarily thanks to the self-healing mechanism.
As I wrote about recently, Spacemesh consensus in fact relies on two separate mechanisms: Tortoise and Hare. Hare relies on strict consensus about the current “active set” of miners (although we’re working on a new design that doesn’t have this requirement). If this consensus breaks, Hare stops working. This is where Tortoise comes in. By contrast, Tortoise does not rely on such strict consensus. Tortoise is slower than Hare (as the name suggests), but it can establish consensus even without perfect agreement on this set. This is because it counts votes for and against individual blocks and relies on simple “yes” or “no” majority (rather than votes on specific chains, like other protocols). Each vote (we call them “ballots”) is freestanding and valid in its own right, assuming it was generated by a valid miner.
Even in the very worst case, if the Tortoise vote-counting mechanism is attacked or if not enough votes are received to cross the threshold for or against a given block, self-healing kicks in as a fallback mechanism. In a nutshell, self-healing relies on a pseudorandom coin toss that all honest miners will agree on. If Tortoise doesn’t count enough votes to cross the threshold for or against a block and enough time has passed, all honest miners will instead vote according to this coin toss, allowing them to reestablish consensus.
How is the coin toss implemented and what happens if it, too, is attacked? Here’s the secret: in order to prevent circular dependencies of the sort that plague proof of stake, it cannot depend upon Hare or Tortoise. It cannot depend upon previous consensus in any fashion. All miners are eligible to participate in a separate random beacon protocol that generates the coin toss; they establish eligibility to participate in a particular round via their identity (and not through proof of spacetime as in the rest of the protocol). A certain number of miners are eligible to participate in each round of the beacon, and a miner cannot control or change their eligibility without creating a new identity, which has a cost associated with it.
It is theoretically possible for an attacker to exert some degree of influence over this random beacon, but not for a very long time. Eventually, all of the honest miners will converge on a coin toss value and, once they do, transactions and blocks will once more become finalized and consensus will be reestablished.
The main price we pay for self-healing is added complexity, and a lack of strict consensus on the active set of miners in any given epoch (more on this in a moment), but it’s a price worth paying to strengthen consensus and improve the protocol’s trust model.
Thing #2: Blocks and Pointers
In every other blockchain that I’m aware of, starting with Bitcoin, each block contains a pointer to the previous block in the chain. This deceptively simple idea is one of the core enabling technologies behind blockchain! It’s the thing that makes a blockchain a chain, an append-only ledger (as opposed to a simpler type of database). It makes the chain immutable (you can only append onto the existing data structure, and can never overwrite what came before). It makes the chain secure (in order to change something in a past block you’d need to recreate that block and all the blocks that came after, including all of the accumulated proof of work, and then convince the network to accept them all).
There are even more subtle benefits to these pointers. When a block points to the previous canonical block in the chain, the node building the next block always knows the exact state at that point in time. Knowing the exact state means it knows the exact outcome of executing each transaction in the order it puts them into the block. This is fantastically useful as it allows miners to exclude ineffective transactions (i.e., transactions that would fail if they were run against different state), reducing spam and saving space.
If you take away the pointer, you lose all of this. Spacemesh takes away the pointer.
The logic behind this decision is a bit convoluted but I’ll try to explain. Spacemesh was originally designed not as a block chain but as a block mesh, i.e., a directed acyclic graph (DAG) with multiple blocks at each unit of time (we call them layers). In a mesh topology you can’t point to a single canonical block because there isn’t just one, there are many.
We later changed the design so that, barring an attack or severe network split, Spacemesh does produce a single, canonical block per layer. Spacemesh is technically still a DAG, but it’s one in which only a single, canonical block is elected each layer. The DAG topology is part of what enables self-healing as described above, so it’s useful when assumptions fail or the network is under attack.
Even so, the canonical block at each layer is produced using a process called cooperative mining where no single miner has exclusive control over the creation of a block or the ordering of the transactions within that block. So, even if the miners did know the state as of the previous block, it wouldn’t be so useful. They still wouldn’t know the outcome of each transaction they select because they don’t control the final ordering of those transactions in the block.
The other reason we take away the pointer is to allow for reorgs. If the protocol changes its mind about the validity of a previous block, one produced several layers ago—because, e.g., new information has arrived—it can invalidate or swap that block without instantly causing all intervening blocks and transactions to be invalid, as would happen if there were backward pointers. Bitcoin and Nakamoto Consensus handles this situation elegantly: only another chain of blocks can compete with the existing canonical chain, and if you swap one block in that chain, you’re swapping the entire chain up to the tip. By contrast, in this situation proof of stake protocols are much more fragile. This is why they emphasize fast finality: if they fail to finalize a few blocks, then a reorg occurs, they have no way to fill in the intervening blocks.
Spacemesh is more like Bitcoin and less like proof of stake protocols in terms of safety and liveness, as described above. We emphasize liveness—Spacemesh is always producing another block, every five minutes (the “layer time”), even if the previous one wasn’t finalized—but not at the cost of safety. Like Bitcoin, finality in Spacemesh is probabilistic. Reorgs can happen if our security assumptions are violated, and if this happens, then all the blocks and transactions that came after the reorg’d block are reinterpreted in light of the new information.
Yes, this makes many things harder. Spacemesh achieves security and immutability in a slightly different way that does not depend on these pointers. As in Bitcoin, changing or attacking a historical block becomes harder the older the block becomes and the more layers are built on top of it. Unlike Bitcoin, however, this isn’t because other blocks are built “on top of” the block, so to speak. Rather, it’s because of the overwhelming weight of ballots explicitly voting for the block that accumulates over time.
This design makes picking transactions and building blocks a bit harder (since, as mentioned, miners don’t always have complete information about state when doing so). But it’s all part of the design of self-healing.
Thing #3: Permissionless Rotation of Miners
In order to facilitate the creation of new blocks, blockchain consensus mechanisms rely on a feature called leader election (the “leader” is the miner selected by the protocol, through one fashion or another, to produce the next canonical block). Without leader election, there’d be a free-for-all with many miners simultaneously creating many competing blocks—which would make coming to consensus on the next canonical block more or less impossible.
Proof of work and proof of stake handle leader election quite differently. In proof of work, a single leader that wins a single computational race is allowed to create and broadcast a single block to the network, one block at a time. By contrast, in proof of stake, there’s a set of active validators in each epoch that are eligible to produce blocks in certain slots, chosen deterministically. Neither solution is perfect: the former because it requires a huge amount of computational waste and the latter because it isn’t permissionless.
There’s another important difference between these approaches. A miner doesn’t need anyone’s permission to participate in the computational race in proof of work. All they need is access to mining hardware, electricity, and an internet connection. Proof of stake works quite differently, and as a result it’s not truly permissionless for two reasons. The first is that you need coins in order to stake, and acquiring those coins requires permission (since you cannot mine them yourself). You need to find someone willing to sell you coins—which might require a benevolent friend, a credit card, access to an exchange, regulatory approval, etc.. The second is a bit more nuanced, and I alluded to it above: the existing validator set needs to notarize incoming, new validators. Therefore, a cartel of colluding validators could choose to censor new, incoming validators if they chose to do so.
We designed Spacemesh to be truly permissionless, like Bitcoin: you don’t need anyone’s permission to create a Bitcoin account, mine coins, or send transactions. The same is true for Spacemesh. In fact, in a sense, it’s even more true for Spacemesh than it is for Bitcoin, since mining Bitcoin requires specialized hardware, whereas you can easily mine Spacemesh from home without special hardware, and will be able to do so for a very long time.
Miners in Spacemesh do not need any coins in order to start mining (as in proof of work). And acquiring coins to transact on the network is trivial because it’s so easy to run the Spacemesh software and mine from home.
With respect to how miners become eligible to join the active set and produce blocks, Spacemesh takes yet a different approach. We remove the computational race of proof of work (and, thus, the energy consumption), but we allow truly permissionless rotation of the miners in each epoch. Existing miners don’t need to recognize or approve the addition of new, incoming miners. As a result, at any given moment in time, strictly speaking the Spacemesh protocol doesn’t know exactly which miners are online and eligible to participate in block creation. How can such a protocol function? Let me explain.
The Spacemesh trust model for introducing new miners is very similar to Bitcoin’s trust model. Unlike proof of stake networks, Spacemesh does not require an ordinary transaction for a new miner to enter the set of active miners. Instead, new miners send a special “activation transaction” that doesn’t need to be recognized by a majority of miners and mined into a block. Simply connecting to the P2P gossip network (which is permissionless) and broadcasting the activation transaction (which establishes a miner’s eligibility to produce blocks in the following epoch) is sufficient to begin mining and earning rewards—just like a Bitcoin miner mining and broadcasting a chain tip with greater accumulated proof of work.
One consequence of this design, and in particular, of the fact that the current set of miners does not notarize or approve the incoming set of miners, is that Spacemesh does not need, nor does it want, strict consensus on the set of active miners. In fact, miners can disagree slightly about which other miners are online, eligible, and producing proposals and votes. As described above, the Tortoise consensus mechanism has no problem establishing eventual global consensus even without this strict agreement. In fact, if Spacemesh required strict global consensus on the set of active miners at any point in time, not only would it mean mining isn’t strictly permissionless, consensus in Spacemesh would be as fragile as in proof of stake.
If, hypothetically, the protocol required strict consensus among all existing miners on the current and incoming set of miners as in proof of stake, then a majority of existing miners would need to see and notarize the set. This is a catch 22 as it breaks permissionlessness. Thus, Spacemesh’s insistence upon having only “rough consensus” on the set of active miners is a huge advantage for two reasons: it means that new miners can join permissionlessly, and it enables self-healing as described above.
The downside to this tradeoff is that not having strict consensus on the set of miners makes other aspects of the protocol, such as the Hare consensus mechanism and issuing block rewards, harder. But I’ll leave the details for a later issue.