Recap of Dry Run 5
Dry Run 5 ran on the 30th of October with 34 number of external validators participating. Our goal for this dry run was to establish a stable testnet then proceed to ramp up transactions into the cluster, testing it’s transaction capacity for the first time. Unfortunately, we were prevented from these throughput tests due to a loss of consensus and network stability. The DR lasted roughly 2.5 hours before losing consensus. The high level sequence of events during the DR are provided below:
Cluster booted successful at 8:00 AM and we soon had 38 validators online.
The RPC port on tds.solana.com died around 9:30AM PST. Most nodes on the cluster seemed to experience a similar issue around the same time. The cluster continued to function fine (according to the metrics server). Restarting tds.solana.com manually restored the RPC port
Lost consensus at 10:28:07AM PST
Final stake distribution on at two nodes we investigated were identical:
- slot 34617: votes: 22, stake: 22.5 SOL (63.2%)
- slot 34611: votes: 7, stake: 7.1 SOL (19.9%)
- slot 34614: votes: 1, stake: 1.0 SOL (2.8%)
- Absent: votes: 5, stake: 5.0 SOL 14.0%
Slot 34617 failed to reach 66.7% of the stake and thus the cluster stalled
We had 34 external validators participating in total.
Thanks to the community, we managed to identify some unknown issues - which is ultimately the purpose of these Dry Runs. By reviewing some of the logs from the attempt, we identified 2 state reconciliation bugs that led to the crash. Details of the two bugs are provided below:
Before we dive into the details of the issue, we first need to explain Solana’s network architecture. In Solana we have a gossip network, it generally operates like any traditional gossip network within blockchain protocols (everybody talks to each other and makes sure everyone else receives those messages), with the goal to ensure that all votes are propagated to the rest of the Validators in the network.
Let’s imagine a scenario where there are 3 nodes within this gossip network, Alice, Bob and Charlie. What each of these Validators do is, they will always store the latest vote only, and each Validators pubkey will point to its most recent vote. With new votes being produced by the leader every 400ms.
With the above in mind, let’s say Alice is the block producer, and she has just voted (block #11), referencing the latest block (block #10) and it’s now being propagated to the rest of the network. The problem occurs when there is a fork and the next block producer in the network - Bob - doesn’t include Alice’s vote in the next block (block #12) because Alice’s message hasn’t propagated to him yet. Alice’s vote is effectively dropped from the network as a result.
This can happen because as Validators are storing votes, their pubkey only ever references their most recent vote. But because the new block produced by Bob (block #12) is on a separate fork, it can’t be included on a child of block #10 that doesn’t include block #12 and skips to #13. Effectively we lose this vote and are unable to agree on a common parent and that’s where the partition occurs.
When validators look at the pool of votes from the entire network, they keep a timestamp of the last time they looked at this dataset (e.g.100ms ago). Using this timestamp as a reference for when they next review this dataset, they extract all the votes they haven’t seen since that point and pull them into a proposed fork and try again. The issue occurs when another fork occurs and this fork is looking at the set of votes with the timestamp that is updated from another fork even though they are separate and don’t include each others votes.
In other words, the failure occurs when the network tries to reconcile all these votes from several forks with divergent states using a global timestamp.
Our internal testing uncovered another consensus bug. Our consensus implementation was using a greedy algorithm that always voted for the heaviest available block. While nodes would never fully commit to a minority partition this rule created a situation where nodes on minority forks are much more likely to continue to vote for their own minority fork because at the time of voting those nodes are locked out of the majority fork.
This required multiple fixes
- Compute a fork weight, which is the accumulated weight of the block and all of its ancestors.
- Wait for the heaviest fork, instead of continuing building on the minority fork.
While inspecting the logs and during DR5 we noticed that there were unexpected bursts of high packet loss in the network, so we’re internally testing with netem to simulate adverse network conditions (i.e. 15% packet loss) to ensure that our network can survive even in the harshest of conditions. That should resolve some of the reasons why forks are occurring to begin with.
To stress test consensus and network behavior under load we added new tools that can induce arbitrary network partitions and packet loss .
- ./net/net.sh netem --config-file topology.txt
Continuous partition testing uncovered a couple more bugs, and after a bit more hard work we are able to run the network under high TPS load while inducing a partition every 10 minutes without stalls or memory leaks!
Relevant Github Issues for Reference:
- https://github.com/solana-labs/solana/pull/7079 - Fork weight changes
- https://github.com/solana-labs/solana/pull/6624 - Bench TPS client improvement
- https://github.com/solana-labs/solana/pull/6622 - Upgrade JSONRPC HTTP server
- https://github.com/solana-labs/solana/issues/6627 - Add RPC metrics
- https://github.com/solana-labs/solana/issues/6628 - Verify a validator’s vote account actually exists at startup
- https://github.com/solana-labs/solana/issues/6630 - Increase the validator boot period
- https://github.com/solana-labs/solana/issues/6656 - Add fork visualizer
- https://github.com/solana-labs/solana/issues/6676 - Ledger/log upload tool
- https://github.com/solana-labs/solana/issues/6675 - Add vote retry logic
- https://github.com/solana-labs/solana/issues/6704 - Cleanly exit when AVX is not supported
- https://github.com/solana-labs/solana/pull/6760 - Correctly serialize interrupted leader slots in the ledger
- https://github.com/solana-labs/solana/pull/6695 - Store and persists full stack of tower votes in gossip
- https://github.com/solana-labs/solana/pull/6696 - Correctly sign gossip messages
- https://github.com/solana-labs/solana/pull/6719 - Allow voting on empty banks
Legal Documentation for Tour de SOL
We got some feedback while on the road about the thorough registration process requested for participants in Tour de SOL, particularly around KYC/AML, W8-BEN, W9 and Participation Agreement documents.
We know being presented with 12 pages of legal documentation can be pretty daunting. So we wanted to provide some context as to why it has been setup as such.
For Solana to incentivize this testnet we are essentially issuing tokens to service providers. The service providers in this case are the Validators who are helping us stress test the network. More specifically we are using a 701 exemption, which is somewhat similar to how company options are issued. The documentation is there to clearly define that relationship. We’re happy to go into detail for specifics of the terms for anyone interested.
While this might cause a little more friction, given that we’re a US company our preference is to take the conservative path and remain compliant while still allowing participants to be appropriately compensated in SOL tokens.
We’re proponents of having 100% slashing because we believe that 100% slashing could potentially be a forcing function for decentralization and security. For example, if you’re an exchange that has staking enabled and there is 100% slashing, you wouldn’t want to be the sole Validator responsible for all the exchange users tokens in-case something goes wrong right? You would want to diversify your risk.
Likewise, if you’re an investor looking to delegate your tokens, you would want to act in a similar fashion and ensure your stake is evenly distributed across a series of Validators rather than just one. With regards to security, the security of the network is dependent on the amount of capital at risk, in this case the % of the staked assets. Therefore if slashing is increased then the amount of capital at risk is higher, thereby providing stronger security guarantees.
Note that we also appreciate that the risk of being slashed is not only dependent on the Validators running the nodes but that it can be due to bugs in the software on which the protocol is built on. Given that, we intend on only proposing increased slashing over time as the network matures and relative to the security and robustness of the code .
We’d love more thoughts/comments/discussion from the community. This is obviously a long term goal, so community alignment and shared understanding is key!