[Tour de SOL] Dry Run #1 Debrief

Debrief

After a year of building Solana, yesterday we made our first attempt at booting a fully decentralized cluster with external validators in preparation for Tour de SOL, our incentivized testnet. To say that we were excited is a severe understatement. Being at a point where we were able to put our technology out into the hands of the community was simply amazing.

Today we wanted to share the journey, our thoughts and some of the issues identified during the dry run to all those who participated and the broader community.

Parameters for the Dry Run:

  • 5 Solana nodes were used boot the cluster, with 50 validators external validators expected to join shortly thereafter
  • All nodes received 1 SOL to start with, recommendation was to stake 50% of that (matching the Solana nodes)
  • Snapshots (for faster validator catch-up) were disabled due to some last minute bugs identified
  • GPUs not required as the goal was to sustain an idling cluster and not send a large volume of transactions.
  • All nodes used the latest stable release, v0.16.6.

Play by Play:

2 hours before the Dry Run started we worked through final issues to ensure Validators were properly onboarded i.e. collecting public keys, keybase usernames etc.

At around 8:05am we started to restart/boot up the Dry Run cluster. But before that we had to make sure that everyone had disconnected first because there were still some unresolved bugs to properly support multiple forks. There were still 2 cheeky validators that refused to disconnect from the cluster before the restart, but we pushed ahead.

As soon as we announced the cluster boot was happening, the memes started:

image

image

image

At the same time we started our own pool internally, betting on how long it would take for the cluster to shut down, for fun and to help claim our nerves:

Before we even had the opportunity to tell validators the cluster had successfully booted some eager beavers had already started piling on!

We officially asked validators to connect at 8:27am Pacific Time. Upon which immediately we were seeing error logs, most of which were reporting “No next leader found”.

Despite this the cluster made progress as expected until the 5th epoch, at which point it started to get a little spotty. Epoch 5 was when we started to pick up a bunch of new stake from external validators, followed by absent leaders and large blocks of frozen/new forks.

Eventually at around 8:50am PT the cluster stopped progressing. Wrapping up our first dry run a little under 30 minutes. Which was surprisingly longer than we anticipated.

After the cluster went down the first time, we re-booted the cluster for a second attempt at 9:45am Pacific Time where we ran into the same issues as those mentioned above, causing the cluster to fall over again.

It was at this point that we decided to call it a day and end our first dry run. We had collected some great data and didn’t see much value in constantly restarting the cluster further until we had some time to analyse the data and rectify the issues we discovered.

Major Bugs Identified:

  • BlockhashNotFound issue between leader/validator nodes that can kill forks.
    • We had observed this error inconsistently and infrequently in our internal dry run prior… But on the external dry run this was observed very quickly across all nodes. The logs we collected from the external dry run were exactly what we needed to diagnose and fix the issue.
  • Staking setup is backwards.
    • We had configured our scripts to spin up Validators very easily, which is optimal for a development workflow. But in reality it wasn’t an ideal way to configure a validator in the real world. The script delegated stake before the validator even starts and begins functioning, which in retrospect is just plain upside down. The steps should actually be: (1) start validator with no stake, (2) observe that the validator has caught up and is voting consistently, (3) add the stake. This is really just a simple workflow fix
  • Turns out 64kb UDP packets are a problem.
    • Most of our prior testing had been between nodes on cloud instances that happily send large UDP packets with almost no packet loss. This appears to not generalize to the wider Internet and even a seemingly low packet loss of 1% can be quite significant and prevent a validator from making progress. This will require some surgery to the networking subsystem to limit UDP packet size, a bit of work but again another invaluable learning from this effort.

Other Observations:

  • Our documentation leading up to the first dry run was definitely lacking
  • Confusion around what certain scripts were used for
  • Some validators had a setup with GPUs which were under CUDA capability 3.5 and as a result were knocked out immediately

Next Steps:

  • We’ll be working to improve documentation for future attempts, by collating some of the commonly encountered issues, frequently asked questions and transforming that into a comprehensive document for Validators
  • In light of some of the issues identified, we’ll be replacing Stage 1 of Tour de SOL with a new Stage 0. With all stages from Stage 1 onwards being pushed out by a month.
  • Stage 0 - Stage 0 will run throughout the balance of August, where we’ll be continuing to run dry runs as we work through the issues. Dry runs during August will be ad-hoc in nature, but we’ll be giving a minimum of 48 hours notice to all/any Validators interested in participating

Final Comments:

For us, the first External Validator Dry Run #1 was a massive success. Not only did the cluster take longer than we expected to fall over, but we also identified many major bugs which weren’t previously observed. So we want to say thank you to all the 50 validators that participated in our first attempt. We were overwhelmed by the excitement leading up to, during and event after the dry run event. We were also really surprised by the immense level of patience, positivity and support from the community the whole way through.

Suffice to say, we had a lot of fun, and we couldn’t have done it without you all. So from all of us at Solana, we wanted to say thanks!

A final shoutout as well to Aurel from Dokia Capital as well. Who is officially the “cluster-killah” for our first dry run. Let’s see who takes the cake next time! Hehe.

6 Likes

Excellent retrospective @dominic :slight_smile:

2 Likes

I particularly like this! Excellent debrief!

1 Like

Communication from the teams side has been great so far. Keep up the good work, guys! :slight_smile:

1 Like