DR5 Postponement Update

Hi All!

Our announced target date DR5 was October 3rd however during the internal testnet with our latest release prior to the external event we were still finding transaction throughput being bottlenecked by the CPU. Ideally we want DR5 to both be the final dry run before Stage 1 and to be a test of validators running on GPUs at a high transaction rate. For these reasons we’ve decided that it is in everyone’s best interest to delay it until we can improve the transaction rate further (the latest release is stable at around 5,000 - 8,000 average tx/s for 50 validator nodes)

There is however some good news. Stable transaction throughput has significantly improved since DR4, which previously averaged ~1,000 tx/s (for those that haven’t been following Tour de SOL, note that we regressed TPS significantly in the 0.18 release as we rewrote our entire networking stack to actually work for the real Internet). Everything outside of transaction throughput is also looking really solid, including usability improvements such as:

  • Snapshots are working great, with many new bugs squashed, and validators can typically catch up to the cluster within a minute.
  • Much improved detection of misconfigured routers/NATs, which was a major source of pain for new validators.
  • Support for both SOL (default) and lamports in the solana command-line tool, the new “solana uptime” command, and even better error handling all around.

This is progress worth celebrating, but we do believe that achieving throughput an order of magnitude greater unlocks use cases that have as yet been technically or economically infeasible in blockchain, so we remain committed to lighting cigars only when the average transaction throughput is within a higher range. To be frank, it’s difficult to estimate how long it will take to achieve this… But we have the utmost confidence that it’s a matter of when and not if it’s achievable. We won’t be committing to a revised date for DR5 as of yet, but the goal has not changed. If you’re interested in reviewing our progress in working through the blockers for DR5, you can find our tracker HERE.

The other important factor that has led us to this decision is that after running our internal testnet on release 0.19.0, we’re confident we could idle and easily replicate transaction throughput issues experienced. Therefore until everything looks clean internally, external validator participation would waste the time and money of our community, and provide no real benefit other than hosting a Dry Run for optics, and slowing down the progress of solving real problems.

We hope you all understand, and thank you for sticking with us. If you have any questions, concerns or issues don’t hesitate to reach out to us.

Cheers,

Solana Team

1 Like

Is this indefinite? No tentative future dates?

The good news is we’re at a point where we’re confident that we can much higher than 8,000 tps with some more time.

But there’s 2 issues right now. First we’re finding that the current release still has us CPU bound. Second we’re not happy with 5,000-8,000 tps.

Which isn’t in the spirit of what we promised DR5 to be. So we want to do it right, when we can deliver higher throughput but it’s just hard to estimate how long it will take. I daresay it’s not going to be something unreasonably long, but we don’t want to poorly manage expectations right now by committing to a time frame.

Sorry if that seems like a little bit of a cop-out answer. Just speaking openly. We still intend to give updates along the way, and you’re always welcome to ping us with any questions.

1 Like

Hey All,

As promised, I just wanted to update you on progress towards improving transaction throughput in preparation for Dry Run 5. We’ve been making some good progress these past few weeks towards identifying and finding the bottlenecks.

Just a bit of background - for those who haven’t been following, we reduced the packet sizes from 64k down to 1k after DR1 due to massive packet loss. You can read more about what happened in the debrief for Dry Run 1 here. As a result of that the whole system was effectively trying to handle 64x as much workload. So if we were previously able to do 50,000 transactions a second on a local network, then we were technically asking it to do 3,200,000 tx/s now (50,000 x 64 = 3,200,000). However this isn’t exactly true as the actual transactions aren’t increasing. But every loop in the code, turbine computation and verification now needs to be performed 64x more.

Meaning we ran into a single core performance limitation on many parts of the system. Which was delaying how fast we were delivering the data to everyone.

In addition to that, we identified a bug where network repair responses were significantly amplifying the number of 1k MTU packets over the network. Which we’ve fixed by not sending 1k MTU packets received via repair to the rest of the cluster.

Relevant Github PR’s if you’re interested in diving into the details have been provided below: