Yesterday we ran our second dry run for Tour de SOL. It was scheduled in a little over a week since our first dry run. For those who missed the debrief from our first dry run, you can catch up on it here.
We had a lot of new interest in Tour de SOL this time around, given our recent exposure in the press, so there were a bunch of new faces. But we also saw a lot of familiar faces from earlier attempts as well, which was fantastic.
Before we dive into what happened during Dry Run #2 (DR2), I’d like to recap what changed between this attempt and the one prior:
- A fix to the UDP packet issue: We had about 50% packet loss with 64kb UDP packets in the first dry run causing leaders to be unable to catch up to the network. Packet sizes have now been reduced to around ~1500bytes, and has been tested internally. This was expected to be a short term bottleneck on performance, until this implementation was more refined over the following weeks, which resolve the bottleneck (Link to issue: https://github.com/solana-labs/solana/issues/5294)
- Virtual ticks were not being flushed as soon as PoH recorder was given a bank. Leading to a bad vote that killed the network. This has now been updated such that the PoH recorder immediately flushes virtual ticks immediately (Link to issue: https://github.com/solana-labs/solana/pull/5277)
- Staking setup was previously configured impractically. This has now been fixed so that a Validator properly (1) starts validator with no stake, (2) observes that the validator has caught up and is voting consistently, (3) adds the stake.
All of which was tested internally before announcing DR2. Great so we’re all on the same page now. Let’s get into what happpend in DR2.
Parameters for DR2:
Overall parameters remained extremely similar to that of the previous attempt. We expect this to remain relatively consistent until the end of Stage 0.
- 3 Solana nodes were used to boot the cluster, with 28 validators external validators expected to join shortly thereafter
- All nodes received 1 SOL to start with, recommendation was to stake 50% of that (matching the Solana nodes)
- GPUs not required as the goal was to sustain an idling cluster and not send a large volume of transactions.
- All nodes used the latest pre-release, v0.18.0-pre0
Play by Play
The on-boarding process for new Validators looking to participate seemed slightly more manageable this time around as some participants had already familiarised themselves with our process a little more. Credit also to OG Validators who were helping lessen that burden by sharing their knowledge with newcomers.
We booted the cluster at around 8:00am PST, with validators joining around 8:13am and we immediately started getting a flood of logs from all the participants. Some queries/issues which consistently stood out during the first 10 minutes were:
- Being unable to create a vote account
- Inability to receive data due to port misconfiguration
- Inability to delegate, as the vote account had no root slot
- Queries about voting power and warm up period for stake
None of which were issues that couldn’t be resolved, but highlighted areas where communication or instructions could possibly be improved. Outside of that though we were off to what seemed to be a strong start!
Unlike our first attempt, Validators were properly catching up to the network this time. Meaning they were actually receiving packets from other validators on the network, and successfully voting.
This was fantastic for us to see as it was re-affirming what occurred in our internal dry run just a few days prior, which was idling quite comfortably for 12 hours or so before we got bored…
So I dare say at this point we expected DR2 attempt to perform similarly. Thus our fearless leader put his faith in us to hold down the fort while IRL commitments called for his attention.
Funnily enough the network started to gradually break down almost immediately after he went AFK, this deterioration continued until around 9:07pm. By then it had become too unstable, so we officially announced the end of DR2 and began collecting the console logs from Validators.
However it was clear that the network had died much earlier than when we officially finished up DR2. If you refer to the image below, it shows that the network died around 8:35am followed by a series of panics after the cluster died.
Major Bugs Identified:
Too much stake, too fast (the Cluster died due to this): We’ve identified a bug that makes it such that the network cannot tolerate large increases (or shifts in general) in stake as the network crosses an epoch boundary. Stakes warm up every epoch and that’s why the boundaries are of importance. In our internal runs, our nodes start pretty close together. So even though total stake goes from 1x (bootstrap stake) to Nx (N nodes in testnet) it happens during the phase where our epochs are so small that total stake grows by less than 2/3 of the previous epoch. Effectively spreading the activation. In TdS it seems like a bunch of nodes came online simultaneously during epoch 6 (the longer epoch). Since the cluster was up for a few minutes before these nodes came online we were past the short epochs (epochs 0-5) and into the stage where an epoch was long enough that we had a bunch of nodes (each with equal stake to our nodes) come up. As the cluster entered epoch 7, the total stake shifted dramatically and we ran into this bug.
Gossip packet panic: We need to decouple gossip from the size restriction on data_blobs
Out of (virtual) memory: During the dry run, Validators were reporting that they were OOM. We believe it might be vm_max_map_count being hit
Solana-wallet cli usability improvements
Cluster log/metrics/rpc triage improvements:
Next Steps & Final Comments:
We filed 11 issues in total as a direct result from DR2. Which is great! Of all the issues identified above, the main blocker before we attempt our next dry run is this one. However, we’re hoping to have most/all of the other issues to be fixed before our next dry run attempt. In fact, a majority of the DR2 issues have already been fixed at the time of this writing!
I think a lot of the prior observations around quality of documentation remains. You can rest assured we’re working on them behind the scenes. Hopefully some of that will be rolled out before DR3 or DR4. We’re actively collecting feedback around this and anything else in particular aspect the community feels we should show some love. So please don’t hesitate to ping us or start a thread in our forums with your thoughts.
A big thank you again to everyone who participated in this dry run and the broader community! You guys rock.
Resources & Links