Improving Network Reliability and Scalability
This blog post first appeared on the Solana Labs medium.
Delivering a Fast, Reliable, and Scalable Network
Delivering a fast, reliable, and scalable network remains a top priority in order to move toward a better, decentralized web. The issues around last week’s 1.14 network update – which focused on improvements for speed and scale – made it clear how maintaining stability during these major updates remains a challenge.
Ongoing Investigation and Next Steps
An investigation is still ongoing and more details will be provided here when available, but in the meantime, I want to share the plans in motion to address the balance between reliability and building a scalable and fast network, and where we go from here.
Addressing Live Problems and Prioritizing the User Experience
Up to the 1.14 release, core engineers were working to fix live problems that were impacting the network’s speed and usability. These issues included invalid gas metering, lack of flow control for transactions, lack of fee markets, spiraling ram, storage, and restart overhead.
Addressing these issues was prioritized in an attempt to improve the user experience on the network. Following the latest release, core engineers plan to improve the process for software release rollouts by bringing in additional external developers and auditors to test and find exploits, and continuing to support external core engineers – including the Firedancer team building a second validator client.
Improving the Upgrade Process
Core engineers will work with validators to improve the software release process. Previous releases followed a certain pattern like the one for 1.14, which was as follows:
- Mainnet-beta validators run 1.13
- Testnet validators run 1.14
- Devnet validators run 1.14
- Mainnet-beta validators begin running 1.14 on master canary nodes (i.e. test nodes)
- Validators, RPC operators, as well as teams deploying dApps on the network, provide feedback on 1.14
- Mainnet-beta validators began a full deployment of 1.14, initiating the upgrade process
Despite having mixed nodes running against mainnet-beta, the behavior of the network changes when the supermajority changes versions.
Core engineers plan to help improve the process as follows:
- Before the mainnet-beta upgrade, downgrade testnet to the current mainnet-beta version and feature-set
- Upgrade testnet to the release candidate of the new version
- Observe how the testnet migration goes in real-time
- Downgrade testnet back to current mainnet-beta version
- Repeat this process while stress-testing the testnet
- Release new version to mainnet-beta validators for upgrade
This would require regenesis of the testnet image during the first downgrade. Part of this simulation should include changing the stake distribution to mirror mainnet-beta.
Forming an Adversarial Team
While core engineers previously performed integration testing, an adversarial team has also been formed comprised of nearly 1/3rd of the Solana Labs core engineering team to build additional hooks and instrumentation into the validator code to help find exploits across the underlying protocols and provide hardware to run medium to large clusters for adversarial simulation.
Improving the Restart Process
While fully automating the process is difficult, different kinds of failures can be solved with simpler procedures in an effort to improve the restart process. Nodes should be automatically discovering the latest optimistically confirmed slot and sharing the ledger with each other if it is missing.
Continuing to Focus on Stability
Over the last 12 months, Solana Labs and third-party core engineering teams have also been working to improve the network, and will continue to do so with a focus on stability. For example:
- A second validator client is being built by Jump Crypto’s Firedancer team, focused on increasing the network’s throughput, efficiency, and resiliency.
- Mango DAO developers are focused on the tooling needed to build on Solana.
- Network communication technology transitioned to QUIC, a more advanced networking protocol.
- Local fee markets were implemented.
- Stake weighted QoS was incorporated to improve the ability to land transactions.
- Jito’s MEV client is providing alternative paths for landing transactions.
- Improvements to RPC infrastructure to reduce their load.
Conclusion
Today, there are more than 2,000 developers building thousands of programs on Solana. These developers were attracted to Solana because it lets them build things they can build nowhere else, but those developers also need a stable and predictable foundation. Core engineers are committed to making these changes so that reliability does not suffer for the sake of innovation and speed.
FAQs
Q: What is the current status of the investigation?
A: An investigation is still ongoing and more details will be provided here when available.
Q: What are the plans to improve the upgrade process?
A: Core engineers will work with validators to improve the software release process by bringing in additional external developers and auditors to test and find exploits, and continuing to support external core engineers – including the Firedancer team building a second validator client.
Q: What is the purpose of the adversarial team?
A: The adversarial team is formed to build additional hooks and instrumentation into the validator code to help find exploits across the underlying protocols and provide hardware to run medium to large clusters for adversarial simulation.
Q: How will the restart process be improved?
A: While fully automating the process is difficult, different kinds of failures can be solved with simpler procedures in an effort to improve the restart process. Nodes should be automatically discovering the latest optimistically confirmed slot and sharing the ledger with each other if it is missing.
Q: What are the plans to continue focusing on stability?
A: Solana Labs and third-party core engineering teams will continue to work on improving the network with a focus on stability, including building a second validator client, implementing local fee markets, and incorporating stake weighted QoS.