On the 28th of August we ran our 4th Dry Run (DR4). We made some really strong progress following on from our previous dry run, which I’ll break down below. Slowly, but surely getting closer to the much anticipated Stage 1 of Tour de SOL.
The goal for DR4 was to further iterate on the testnets’ stability at nominal transaction load. In our previous Dry Run, DR3 lasted around 5 hours or so before voting started to break down. If you’re interested in learning more, and want to take a deep dive into our previous Dry Run debrief, you can find it here.
The most interesting thing about the dry run this time was that the software performed exactly as intended within DR4, and actually only died because participants turned off their validators. Meaning “it works” (as long as humans don’t kill it)!
Recap on changes between DR3 and DR4:
- Snapshots were enabled, meaning it would take minutes for Validators, not hours to catch up to the cluster
- MTU Issue - to date we’ve been making progressive solutions to this issue which has been consistently affecting throughput throughout Stage 0. Within this stage further improvements were made, which we expected would provide even greater stability and higher transaction throughput
- Various Usability Fixes - better errors, better Validator logs for analytics
- Other - solana-wallet is now called solana
Parameters for DR4
Parameters remained similar to that of the previous attempt. We expect this to remain relatively consistent until the end of Stage 0.
- 3 Solana nodes were used to boot the cluster, with 39 external validators expected to join shortly thereafter
- All nodes received 1000 SOL to start with, recommendation was to stake 0.5 SOL (matching the Solana nodes)
- GPU’s not required as the goal was to sustain an idling cluster. Having said that, the intention was to push tps in bursts during this run to around 1000-3000 tps
- All nodes used the Mavericks v0.18.0 release
Play by Play
As per usual we booted up the network at 8am PST. With Validators both new and old joining us, quick shoutout to my Oceanic folks who stayed up late to get in on the action! Great to get some representation out that way in this run.
Almost immediately after the network kicked off Validators were already seeing the benefits of having snapshots enabled.
But we soon found out that it wasn’t completely perfect, with some of our validator nodes starting to panic. Which was due to a bug in snapshots not properly restoring the validator state.
But overall the network was progressing well and we were seeing many similarities between it’s stability and performance in a permissionless environment as it was performing within our own internal tests in between Dry Runs.
I imagine due to the stark improvement in DR4, before long we had Validators eagerly asking when we’d hit the red button and pump up the transaction throughput. However, we let the network idle a little longer to ensure stability first. Which it did so successfully. By around 1:00pm PST we had officially survived longer than ~5 hours, exceeding DR3! At which point we were comfortable that the network was idling stably, and we finally hit the red button and inserted a few bursts of transactions to see how the network react. So if you happen to review the full log for Dry Run you’ll see some irregular peaks of tps in the graph. While it’s just a start, we successfully managed to push just under 4,000 tps.
As the network network was generally stable during idle conditions (unlike the previous Dry Runs, where we were were mostly working through Validator on-boarding issues), this time we had the luxury to talk a little more leisurely with Validators to:
- Upskill Validators and help with educating them on features, commands and functionality specific to the Solana network
- Discuss quality of life improvements, such as adding the ability for Validators to easily identify how much of their total stake is warmed-up/still warming-up at any given moment
At 14 hours into DR4, we hit two new milestones:
- Our Validators finally had to sleep!
- We had survived long enough until a normal hour in the APAC region, so Validators from there finally had a chance to participate in the testnet!
Our intention was to leave the testnet up until we hit the 24 hours mark before shutting it down to analyse the data and start working through the issues identified. However, as Validators started going to sleep, that was when issues started to arise.
This was because nodes started to go down with active stake. We woke up to find the network died around 3am, marking the total duration at a whopping 19 hours!
So thanks again to all 39 Validators who participated and making the success of this network possible. Also just wanted to share one last screenshot with everyone, shoutout to the community for always bringing the humour.
Major Bugs Identified
- Significant number of validators struggled to join the cluster due to https://github.com/solana-labs/solana/issues/5568. This issue was observed prior to DR4 but was not believed to be significant.
- Minor metrics reporting issue: https://github.com/solana-labs/solana/issues/5718
- Improve bench-tps client filtering: https://github.com/solana-labs/solana/issues/5719
- Various minor Network Explorer issues:
- CUDA workflow issues: https://github.com/solana-labs/solana/issues/5722
- Metrics server needs more CPU: https://github.com/solana-labs/solana/issues/5724
- CUDA memory pinning appears to be causing issues: https://github.com/solana-labs/solana/issues/5710
- Misconfigured NATs continue to be an issue: https://github.com/solana-labs/solana/issues/3915
- Unexpected RPC service failure at node boot: https://github.com/solana-labs/solana/issues/5725
- Network Explorer uptime calculation improvements: https://github.com/solana-labs/blockexplorer/issues/321
- Add absoluteSlot to getEpochInfo for easier validator log monitoring: https://github.com/solana-labs/solana/issues/5726
- Improve error message on a genesis ledger mismatch: https://github.com/solana-labs/solana/issues/5727
- Consider adding a validator uptime command to solana-cli: https://github.com/solana-labs/solana/issues/5728
- solana-cli should show account balances in SOL by default: https://github.com/solana-labs/solana/issues/5729
- Merge solana-validator-info into solana-cli: https://github.com/solana-labs/solana/issues/5730
- Improve solana-cli default config messages
- Document procedure for recovering a cluster when consensus is lost: https://github.com/solana-labs/solana/issues/5735
- Improve solana-cli --help output: https://github.com/solana-labs/solana/issues/5736
- Document procedure for deactivating stake: https://github.com/solana-labs/solana/issues/5737
- Display more info about the status of stake warm up/cool down: https://github.com/solana-labs/solana/issues/5738
- Improve visibility of activated stake vs. total stake for a validator: https://github.com/solana-labs/solana/issues/5739
Next Step and Final Comments
Good News - With DR4, we have significantly de-risked the soft launch of MVP mainnet in October. The cluster idles fine, and as long as validators don’t walk away it should continue to do so. Which is effectively what our intent for an MVP mainnet will look like, before we gradually start to roll out more features.
Bad news - Internally we initially aimed to start Stage 1 of Tour de SOL early September, but with the issues identified within DR4, it was clear that this was no longer possible. There’s a few factors that influenced why this was the case, but the high level takeaways are:
- Stability of the network at a high TPS is still not quite where we want it to be yet, and the timeline to fix it is not clear just yet. The root cause of this stability regression was the networking rewrite that was necessary to enable a successful DR4. Two steps forward, one step back!
- Snapshots need some more work to fix some bugs that many people ran into during DR4
Before we feel comfortable initiation Stage 1 of Tour de SOL we want to have at least 1 Dry Run with max tps + GPU’s. Which we haven’t achieved yet, therefore we intend on hosting another Dry Run during Stage 0. The date for this is proposed to be the 23rd of September.
Assuming this next Dry Run goes well, we’re optimistically looking at initiating Stage 1 of Tour de SOL at the very end of September or early October. This timeline is still looking pretty aggressive for us, but we want to make sure we take the appropriate time to work through the issues because we understand that everyone is sacrificing their personal time to participate, and we believe it is only fair that we ensure the issues identified in the previous Dry Run are rectified each time before starting the next attempt. Not only that, internally we take pride in doing what we do as well, and despite Dry Runs being a test of a WIP product, we still hope to put our best foot forward, rather than throwing garbage over the fence.
Resources and Links
- Here’s why the network died. Only 52% of the stake was online so the network couldn’t progress (66.6% of stake is required to reach consensus):
- Log files from some of the Solana-run validators: https://drive.google.com/drive/folders/1rmyFVtsNsdlAnhGUmTgUe8PL3m1XmT28?usp=sharing
- Grafana metrics snapshot of the entire run: https://metrics.solana.com:3000/dashboard/snapshot/rX63DPq7z1GWiSgpsOzfXG2Akps9SCpE