Welcome back to everyone who’s been following along on our progress during Tour de SOL. It’s been another event-filled week. We’ve shipped several fixes to issues identified over the past few days, while some issues remain outstanding as we expect them to take longer to fix.
PROGRESS ON CRITICAL BUGS
Certus Ones’ DoS Attack
For the first several days last week, we kept the network offline as we attempted to quickly fix the bug highlighted by Certus One’s successful DoS attack, however after working on it for several days we concluded that the fix would require some additional time and effort to be resolved. Therefore we decided to bring the network back up first:
- As we had several other fixes for other issues ready to be implemented
- Allow us to continue identifying more bugs
- With an agreement together with the Validator community that the DoS attack would be off-limits until advised otherwise
Other High-Priority Stability Issues
Outside of the attack mentioned above, we still have several other issues which we’ll be focusing on for the coming weeks:
- Snapshots loading very slowly: Snapshots are taking over 1000 slots to unpack and process across both TdS and SLP due to the large number of account files in them. Thereby slowing down Validators from being able to quickly boot and restart
- Validators can accidentally fetch snapshots from delinquent Validators: When a validator comes up and looks for a snapshot over RPC, it can easily pick a delinquent validator and thus get a very old snapshot.
- Intermittent Consensus Issue
The cluster was restarted on the 12th of February with version v0.23.4, capturing fixes in the following areas:
- Gossip Network
We successfully upgraded the cluster with the new version the following day, and our bootstrap Validator node finally managed to distribute its stake to the rest of the cluster such that it represented less than 33% of the active stake.
We followed up with another version on the 14th of February with another update to rectify a long-standing out-of-memory issue that was affecting some Validators
As of today, the network is still down due to the issue mentioned above where Validators accidentally fetch snapshots from delinquent nodes. Until this is resolved we won’t be restarting the network just yet as it’ll likely cause the network to crash again after a few days.