Three Things #49: December 25, 2022

Remaining Technical Challenges

Dec 25, 2022

Almost the last issue of the season! Here’s one more on Spacemesh tech to round out the lot.

As I’ve written previously, Spacemesh is a technically complex project that’s hard along several dimensions. After more than five years of development there’s a light at the end of the tunnel and we’re days away from finishing development and being ready for genesis. Of course, “ready” is not really a technical term. We’re very nearly code complete in the sense that all of the features we intended to implement for genesis have been implemented and basic tests have been written, but as we discover on a daily basis there’s still a lot more work to do.

The most obvious remaining work is testing: in addition to unit tests, we need to write more comprehensive black-box tests, we need to do adversarial testing such as red teaming and pen testing, and we also need to stress test the network. As we begin this testing work, we’ve already discovered a number of things that still need to be fixed or improved. This is both exciting, because the point of testing is precisely to uncover these issues before giving someone the opportunity to exploit them in a live, production network, but it’s also a bit terrifying and exhausting. Here are three of the remaining technical challenges we’re currently focused on.

An inside view of the core R&D process in the Spacemesh lab, courtesy of Stable Diffusion. You don’t need a glowing purple headset to work on our protocol, but it helps.

Thing #1: Equivocation

I’ve written and spoken previously about how much I like proof of work and Nakamoto consensus because it’s simple, elegant, secure, and self-contained. In other words, it doesn’t have many moving parts and it achieves a number of properties that are desirable in a consensus mechanism, such as safety, liveness, and self-healing, all using the same basic primitives. In building proof of spacetime, we’re attempting to emulate proof of work by keeping the parts that we like best. When you make a slight change or tweak to an existing complex protocol, you break lots of other aspects of that protocol and end up bolting on lots of additional gadgets and features to fix the broken parts: e.g., Spacemesh has its own, relatively complex self-healing mechanism.

Equivocation means that an eligible protocol participant commits to many messages in a single slot. This can’t happen in proof of work, since each message—i.e., each block—has a large intrinsic cost to produce. In proof of stake and proof of spacetime, however, miners/validators have to pay an upfront price to obtain eligibility, but once that’s done producing and distributing messages is basically costless. This core flaw is at the root of many of the issues of proof of stake like costless simulation, weak subjectivity, and the long-range attack. Equivocation is the bane of these BFT protocols; it’s their achilles heel.

Proof of stake protocols tend to punish equivocation with slashing: if a validator is detected sending two or more conflicting messages, a portion of the validator’s stake is destroyed and it’s ejected from the validator set. Spacemesh handles equivocation in a similar fashion: the protocol deactivates the guilty miner’s identity, and the miner loses the cost they paid to initialize that identity plus the opportunity cost of all future rewards they would’ve earned.

But, as with so many things in blockchain, it’s not that simple. One reason is that a misbehaving miner can still cause a lot of damage even before their identity is canceled. In particular, they can flood the network with a more or less unlimited number of valid-looking messages that every other participant needs to receive, process, and possibly store. This is especially problematic if they have the option of crafting many unique, valid messages—since discarding identical messages isn’t too hard. And they can theoretically do this in a very clever way, such that most of the network doesn’t immediately discover their malfeasant behavior, until many such messages have already been processed, stored, and possibly even built on top of. It’s theoretically possible to use equivocation to carry out a particularly pernicious attack known as a balancing attack, where the opinion of honest nodes becomes split. In the extreme case, consensus across the network could fail as the result of such an attack.

The outlines of the solution are pretty straightforward and easy to intuit: detect the conflicting messages and immediately stop routing messages from the miscreant. But as always the devil’s in the details and it’s nuanced. For example, it’s not as simple as just blacklisting the malfeasant node. You need to also construct and broadcast a “malfeasance proof” so that others know to do the same—in other words, the network needs to achieve common knowledge on the known set of bad actors. Precisely how to construct, share, and synchronize this set is not straightforward. And you also need to add a delay before processing or building on top of any possibly malfeasant massages to allow such proofs to propagate across the network, which slows everything down. What’s more, handling an equivocating leader is much harder than handling equivocation in a leaderless protocol since the leader has the unique ability to single handedly convince other nodes to vote the way it wants them to, at least temporarily.

Of course, there’s also a tricky balancing act involving economics: making sure the cost of equivocation, i.e., the cost of having one’s identity deactivated, is high enough while also making mining as accessible as possible to as many people as possible. Err too far in one direction and the network will be insecure and vulnerable to attack; err too far in the other direction and it’ll be too difficult to mine from home, so mining will be more centralized. (More on this below.)

For more: Equivocation is an issue for every (non-POW) blockchain protocol. Read about how it’s handled in Ethereum, Solana, and Tendermint. Read this Spacemesh research forum thread on the topic for more information on our approach.

Thing #2: Streamlining P2P

Good protocol designers write protocols that are complete, secure, and incentive compatible. Great protocol designers go a step further and make sure their protocol can actually be implemented and run with limited resources.

The Spacemesh protocol is complete, secure, and incentive compatible (provably so), and it’s been implemented, but one thing we didn’t test until recently is the requirements for running a node. Our research team is world class, but some things can only be tested “in the wild” (which is why it’s so important to do genesis sooner rather than later).

Bandwidth is one of those things. In the pristine lab conditions where protocols are designed it’s simply impossible to account for the unpredictable, variable network conditions that thousands of Spacemesh nodes around the world will experience. For one thing, users will be running other things on their computers, they’ll have a variety of different types of internet uplinks, they’ll experience internet and wifi hiccups, and their computers will reboot occasionally to install updates. None of this can be easily forecast, modeled, or accounted for. For another thing, the Spacemesh protocol is agnostic to the p2p layer, but the bandwidth requirements depend directly on the p2p layer, which we about a year ago when we switched to libp2p.

We discovered recently that, due to a naive implementation of the Hare protocol, the bandwidth required for running a full node is actually surprisingly high. The Hare can use 1gb of data in a single round, and it runs hundreds of rounds a day, so the total usage can reach into terabytes per day. That’s not an issue if you have fiber broadband at home or if you’re running your node in a professional data center, but it’s a potential issue for millions of home miners. The Spacemesh protocol is designed to make mining from home as easy as possible, so we need to make sure that bandwidth is not a limiting factor.

The issue arises due to the way the Hare protocol gossips messages, proofs, and signatures. There are around 800 participants in each Hare round, and each participant needs to gossip its eligibility proof to all of the other participants. In the later rounds of Hare, in order for a participant to be able to commit to an output set, they need to include a “safe value proof” that justifies their decision to commit to the set: i.e., a certificate that contains commit messages from more than half of the other nodes. Each of those messages in turn contains the participant’s eligibility proof. And each participant needs to generate and gossip this safe value proof. Given the size of the eligibility proofs and signatures, after multiplying by 800 a few times, the amount of data being passed around gets quite large.

This issue is much less troubling than the equivocation issue described above because it’s more straightforward to fix and doesn’t require deep surgery to the protocol itself. For one thing, we can do some quick, superficial tweaks to make things easier. We can reduce the bandwidth requirements in libp2p by reducing the number of times messages are propagated (at the cost of some redundancy) or using compression. We can also temporarily reduce the Hare committee size (at the cost of some security). A longer-term fix will involve redesigning the structure of the Hare messages so that redundant data aren’t being gossipped again and again. That may not happen in time for genesis, but if it doesn’t, it’ll happen shortly after. We’ll give miners a heads up ahead of time how much bandwidth they should expect Spacemesh to consume, and they can test it out themselves on the testnet.

For more: Run a Spacemesh testnet node and check your bandwidth usage. Read this research forum thread. If you plan to run a Spacemesh node at genesis, make sure you have plenty of bandwidth (for now!).

Thing #3: Parameterization

We’ve been running a testnet for over a year. Several of our testnets, including the most recent, have been surprisingly stable: the current network has been running for a month with no hiccups. However, other than the fact that testnet coins have no value, one important thing differentiates testnets from mainnet: parameters.

Our testnet runs with parameters that allow the network to boot up quickly and that makes it as easy as possible to join the network and start mining. This means reducing the layer time from the target five minutes to two minutes, reducing the length of an epoch from two weeks to two hours, and only requiring miners to initialize a tiny amount of storage space to become eligible to join. In total, there are several dozen parameters that we need to set for the network. Each of the subprotocols, such as Tortoise, Hare, the Beacon, etc., has its own set. Some of these parameters can be set by the user on a per-node basis, but for most of the values, all nodes must agree on the value (and changing one would result in a network fork).

We’re in the final stages of setting these parameters now, one of the final steps before genesis. Some of the parameters, especially for Tortoise, Hare, and P2P, will be set on the basis of testing and simulation in order to maximize performance and security, and many of these can be tweaked in the future.

The hardest parameters to set are the ones that don’t directly impact performance or security, but rather impact “fairness.” We have a difficult balancing act here. On the one hand, as I’ve said many times before, the raison d'être of Spacemesh is to make profitable home mining accessible to millions of people. So we have to keep the resources and the needs of the average home miner in mind. This suggests parameters (e.g., low minimum storage space and infrequent, small proof generation) that make the network accessible to as many miners as possible. On the other hand, network security is non-negotiable! This suggests parameters (e.g., high minimum storage space and frequent, large proofs) that raise the bar for certain kinds of attacks. Other things being equal, the more skin in the game each miner has in terms of its committed storage space, the more secure the network.

Our goal is that it should take a home miner using commodity hardware, such as a $200-300 second- or third-generation GPU, no more than a few days and a few dollars worth of electricity to complete the initialization process. We don’t make any guarantees for exactly how long it will take—obviously, relatively faster GPUs will perform relatively better—only that it shouldn’t take too long, and it should be accessible to most home miners. We’re running benchmarks across a wide range of hardware and software now to pick the final parameters to achieve this.

(Miners without access to a GPU will have other options that we’ll announce later. The init process can be run on a CPU but it will take much longer.)

Even if we get these parameters perfect today, we have to consider the future as well! Over time, storage will continue to get cheaper and processors will continue to get faster, so the overall spacetime on the network will increase. We need to ensure both that we can throttle the number of new miners coming online if necessary by making it harder to join, e.g., by increasing the minimum amount of storage space, or increasing the amount of work required to initialize each unit of space—ideally without creating a burden for existing miners. We also need to ensure that users can regenerate all or part of their storage in a reasonable amount of time, can easily scale up and down their total storage, etc.. All of these factors have to be taken into consideration as we set the mainnet parameters.

Finally, we also need to make sure that we’re running the fastest possible PoET servers for the community. If someone else develops a faster PoET, then they (and anyone using that PoET server) would have a disproportionate advantage to the rest of the network in terms of voting weight and rewards. Developing such a system is surprisingly difficult because the PoET workload by design cannot be parallelized. Compute-intensive workloads long ago migrated to massive parallelization across many CPU cores. If you have a parallelizable workload it’s quite straightforward to find suitable hardware, both for purchase and for rent. By contrast, since PoET cannot be parallelized, it requires a custom build with a single, very fast core—an architecture that has very few applications today.

For more: To get an idea of the number and types of parameters that need to be set, take a look at the go-spacemesh sample config file. Keep in mind that some parameters have default values and aren’t included in the config file, but can be overridden.

Venus Black

Dec 26, 2022

I really loved this one. It seems like it’s more geared towards cyber security though. I was hoping for some personal computing security options. Often we find users just don’t know how to protect their data or even see a reason for their data to be protected.

Expand full comment

Three Things

Three Things #49: December 25, 2022

Remaining Technical Challenges

Thing #1: Equivocation

Thing #2: Streamlining P2P

Thing #3: Parameterization

Discussion about this post