Apologies for the late update on this one folks. Been running around Berlin so communication and administration on the Public Registry has been lagging. But really happy to say that we had a fairly successful Dry Run (DR3) this past Tuesday!
The goal for DR3 was to be able to improve the network stability such that we could idle without any issues. With previous Dry Runs only lasting somewhere between 30-60 minutes so far we really wanted to see it last for days instead this time. For those who missed our previous debrief, you can find it here.
Before we dive in, I just wanted to share some quick statistics (as of Monday 19th of August PT) pulled together by rshea#2622 and mvines#6646 from our team:
- 232 registrations for Tour de SOL
- 113 have completed KYC/AML
- 92 of which have passed KYC/AML
Keybase & Public Key:
- 152 of these applicants have provided us with a Keybase Username
- 67 of these applicants have shared their Public Key and are fully onboarded
Call me a pessimist but given that we ran DR3 during Berlin Blockchain Week, I was expecting turnout to be a little lower for DR3. So you can probably imagine my surprise when we had 48 Validators decide to participate. You guys never cease to exceed expectations !
To recap on the changes between DR2 and DR3:
- The bug that killed the cluster in DR2 was because we had too much stake come online too fast. We had some internal discussions about how to fix this. For now, we’ve rectified this by slowing down the rate at which stake comes online, which is more of a temporary fix rather than a permanent solution. As our main priority is fleshing out as many bugs as possible during these Dry Runs.
- We were seeing a lot of nodes panic towards the end of DR2 due to gossip packets. So we’ve now introduced a limit to the bloom size.
- Validators were running out of virtual memory. It turns out Linux didn’t seem to like it when we mmap too much. Meaning we only needed to append_vecs if the number of accounts was high. We now only create opportunistic ones as accounts are created.
- We released a Validator on-boarding document to try and bring together all the disparate instructions forgetting involved in Tour de SOL. Which is open to public feedback, feel free to review or contribute your thoughts here.
- There’s also several other issues which were identified and rectified, which you can find in the DR2 debrief here, under the ‘Major Bugs Identified’ section.
Parameters for DR3
Overall parameters remained extremely similar to that of the previous attempt. We expect this to remain relatively consistent until the end of Stage 0.
- 3 Solana nodes were used to boot the cluster, with 48 validators external validators expected to join shortly thereafter
- All nodes received 1 SOL to start with, recommendation was to stake 50% of that (matching the Solana nodes)
- GPUs not required as the goal was to sustain an idling cluster and not send a large volume of transactions.
- All nodes used the Mavericks v0.18.0-pre1 release
- Snapshots disabled
Play by Play
We booted the network at 8:06am PST. One immediate issue that has come up time and time again, was port forwarding setup. Some of the veterans deftly handled this, but it’s definitely something we’d like to improve on. If you’re familiar with WebRTC code and interested in helping us with this let us know.
What’s funny is that the start of DR3 essentially coincided with Anatoly’s presentation at Web3. We couldn’t have timed it better even if we wanted to. It was awesome to see some of our validators spinning up their nodes from within the audience at Web3.
But we were off to a hot start and successfully making it into epoch 7 which meant that DR3 officially survived longer than DR2! The network continued to run consistently reaching approximately a total of 250,000 vote transactions at 90 minutes into DR3 (48tps idling2,778tps). Hopefully in the near future we’ll be able to chew up 250,000 vote transactions within 5 seconds or less!Which we’ll hopefully be able to chew up within 5 seconds or less sometime in the near future!
If you ever wanted to know what gets us blockchain engineers excited, the secret answer is watching lines converge. Below is a visual representation of the nodes coming online, and the network converging to a point where we were no longer the majority stake within the network:
Followed by some obligatory trolls from mvines#6646 about going AFK, after he unofficially caused DR2 to fall over by doing the same thing:
Alas, all good things must come to an end. After going strong for another 3-4 hours, bringing the total uptime for DR3 to ~5 hours voting started to break down and we called it a wrap at 13:23pm PST.
Above is a snapshot of the network towards the end. The majority of the cluster stopped voting after root slot 44632. Post processing of the dash output around the time of failure indicated that 88% of the total stake was offline. With that much offline stake, the cluster couldn’t make progress. Most of the stake went offline either due to the crash detailed in #5570, and a couple cases of where validators accidentally terminating their node without first deactivating their delegated stake.
Major Bugs Identified
- Cluster Killer: Solana-Window panicking at ‘assertion failed’
- Solana-Wallet Delegate-Stake crashes when the stake keypair file doesn’t exist
- Replay Stage Panic
- Solana-Validator Panics when it can’t open the ledger
- Improve logging on consensus failures
- Vote and stake programs report useless errors
Next Steps and Final Comments
We saw significant progress in DR3 and we’re hopeful that after the next Dry Run, we’ll not only be able to run for a longer period of time, but also push up the transaction count as well. Touch wood, but if DR4 works out as intended we think we’ll be pretty well positioned to transition into what everyone has really been waiting for - Stage 1 Tour de SOL!
As for the quality of life issues, port forwarding is definitely a bit painful right now. With the other issue being how to successfully get vote accounts created - which is hopefully resolved now with the revised FAQ/On-Boarding doc for Validators. On this note, we’ve also created a new thread to collect feedback on what Validators would like to see fixed most. So please, if there’s anything that would make your life easier, or issues that you’re constantly bumping into, let us know!
Thank you to all the 48 validators for participating in DR3!
Resources and Links