Solana Outage: Perspective From a Developer

Solana Outage: Perspective From a Developer

I’m Tanner from Code, a new app being built with Kin. I’ve been a part of the Solana ecosystem for about three years now, well before the launch of Testnet and Mainnet Beta. We were the first major project to publicly make the decision to move to Solana, and we recently completed the largest blockchain migration that we are aware of (60M+ wallets). Over the last three years we’ve met a lot of developers who are or are thinking about building on Solana as well as a lot of funds who are or are thinking about investing in the Solana ecosystem.

When the last outage happened in December 2020 we fielded a lot of questions 1:1 as people were trying to understand what happened and how they should be thinking about that going forward. That seemed to be helpful, so when this recent outage happened we again had a bunch of people reach out with similar questions, as well as some new people. Again we started to respond 1:1 and that seemed to be helpful, but given the volume we figured it would be better to put this out publicly in the hopes it’s helpful to others. Here is our simple summary and some answers to anticipated follow up questions - if you have any others, or we missed something in this post, please let me know.

Required Pretense

Unlike Bitcoin and Ethereum that validate transactions serially, Solana parallelizes transactions. This is made possible through Proof-of-History (POH) that creates a canonical ordering of validators across the network. POH is a novel mechanism that enables the entire network to agree on time using something called a Verifiable Delay Function (VDF) to prove the passage of time between block productions. Effectively each subsequent block has an appended proof that it came after the previous block. This serialization of block production enables parallelization of verification because validators can all agree on the sequencing. This breakthrough enables Solana to optimize the concept of a “mempool” (memory pool) - which is effectively a queue of yet-to-be-confirmed transactions. At the time of this writing Ethereum has a mempool of ~180k transactions, which at 30 tps would be about 1.5-2 hours to churn through that queue. This is why we see rising gas costs. With a supply/demand imbalance we see rising prices.

Solana uses something called Gulf Stream to mitigate this bottleneck. Because Solana validators can coordinate ordering more efficiently with POH they are able to move through the verification process much faster since clients know who will be the leader and can forward transactions to that leader without waiting for serialized confirmation. In a practical sense validators are verifying in parallel on “mini forks” and consolidating as they go.

09.14 Outage

On 09.14 there was a spike in proposed transactions, on the order of 300k / second. This overloaded the “forwarders” (part of the Gulf Stream protocol that pushes transactions to validators), which resulted in validators crashing from memory overload. To mitigate this, block producers started to automatically propose a number of forks - what they are supposed to do. The challenge was that validators could not agree on a fork, which is a byproduct of the parallelization because validators are trying to reconcile varying states. With this overload, the automatic system of forking came to a halt with <80% consensus on a proposed fork. A bug related to transaction prioritization was found and addressed, and the network was “restarted” after 80% of validators agreed on a state of the chain and did a manual hard fork.

No funds were lost, and proposed transactions that were not confirmed were then processed after this outage. Even with the bug addressed, there are additional learnings that can be pulled from this outage that should help us continue to refine the Solana network.

Going Forward

I don’t work for the Solana Foundation, or Solana Labs, and have no information outside of what has been shared publicly so far. Our view is that:

  1. Validators should continue to update their memory capacity
  2. There should be increased data compression to make memory usage more efficient
  3. Transaction encoding at the client level can be improved to be more memory efficient

One option we’ve been discussing that could be a strong economic incentive to drive these is a dynamic fee structure. This should incentivize efficient transaction encoding and disincentivize inefficient transaction encoding, and if there is a complex transaction, the fee generated would offset the increased hardware cost to improve memory capacity. This will become increasingly important as the Solana ecosystem starts to implement privacy preserving transaction types ie. proofs; and in the future, transaction types that increase scalability ie. roll ups

Thoughts on Solana Long Term

We’ve been building in the Solana ecosystem for almost three years now and have seen the rate of progress. We have also seen two outages, so the question we’ve got a lot is: “Can Solana be trusted long term?”. Nothing is guaranteed, but what we have seen from the previous two outages is that there haven’t been any fundamental issues yet that would make the Solana infrastructure fundamentally flawed. There haven’t been any double spends or censored transactions, the issues have been related to transaction confirmation.

Our pragmatic view is that Solana is solving complex problems with proven technology (VDFs, erasure coding, etc.) and the edges of how these fit together will need to be smoothed out over time. Building a high performance consumer product requires high confidence in the underlying technology, so we are watching this closely and helping where we can.

Our hope would be that these edges are smoothed out and we don’t see another outage like this. The one thing we did see was the rate of coordination across the network, so if there was a challenge like this in the future our expectation would be that this response time would be compressed by an order of magnitude.

Overall we believe that:

  1. Solana is still nascent and there is a lot of room to grow for Solana to reach the level of reliability expected for infrastructure that is meant to support large scale consumer experience

  2. There are options we should be exploring to incentivize more efficient use of, and contributions to, the network (ie. dynamic fees)

  3. Of the options, Solana is the best option we’ve seen for a high performance, and the ecosystem is in a good position to build on this and increase its leadership in this category

Interested to hear others’s thoughts on this and what might be a good next step as we move forward as an ecosystem.