Lighthouse: Lessons Learned from Testnet Collapse

Author: Blair Fraser

Translation: A Jian

Source: Ethereum enthusiasts

One test net fell, and thousands of test sites went up

A week ago (Note: This article was written on December 17, 2019), we announced the launch of a large public testnet using the Lighthouse client. The testnet successfully started and ran for a week, and it proved for the first time that the Eth2 testnet configured with a production environment can run.

When we launched the testnet, we said, "We're going to try to crash this testnet and I'm confident we can succeed." The testnet did hang, and it was twice. The first time is on Saturday morning, and the second time is on Monday morning (all Sydney time). After the first hang (more than 100 epochs were not finalized), we successfully resumed the testnet operation; but when it crashed the second time, we decided to stop and resume.

The "crash" and "hang up" mentioned here mean that the testnet cannot determine the epoch. The reason why the epoch cannot be determined is that more than 1/3 of the validators are offline. In design, this testnet will not stand dead when it encounters problems, but will quickly and clearly show failure.

The cornerstone of this testnet is 4 AWS t2.medium instances (the hardware configuration is 2 vCPU, 4gb RAM, 32gb SSD); each instance serves as a public boot node, and loads 4096 validators. In fact, we are also surprised that they can last so long; for a few machines with ordinary hardware configurations, this is a huge burden; as long as two of them are offline, the test network cannot be finalized.

We analyzed these two crashes and learned a lot (details in subsequent chapters). Our team has returned to development work and hopes to release a new testnet next week (it may also be the next few weeks, and the work may be affected by holidays). You can learn about our progress on this page of v0.1.1 milestone.

lesson

Major causes of testnet crash

The direct cause of the first crash of the testnet was a loop in the software's networked components, which would "see" certain attestation data being repeatedly released. This cycle appeared on two of the four primary nodes we deployed, exhausting their resources, making them unable to produce blocks and witness data. This problem was the direct cause of the two crashes.

We have updated our gossipsub implementation, and now each content is addressed based on its content, which means that if we receive two messages with the same content, the gossipsub protocol will ignore the second message. We have also added duplicate message checking to the Lighthouse client code to prevent sending and receiving duplicate messages.

Secondary cause of testnet crash

Data volume skyrocketed

After the two beacon nodes go down, it is impossible for the testnet to finalize the block (because 50% of the validators are offline). However, the remaining two nodes continue to send and receive blocks, which is what we want to see. However, after the network lost its finality, they were unable to trim and compress their databases, which resulted in their databases growing at a few GB per hour. Because we limited the hard disks of the test network nodes to 32 GB (including the part occupied by the operating system), eventually, their disks were full of old data and could no longer accept new blocks. This caused the other two nodes to go offline as well.

In this case, it is also very simple to resume the test network operation, just increase the hard disk and restart the node. We are also very satisfied with this recovery method, because it means that some nodes with large hard disks will be hardly affected in two crashes.

At the time of this writing, Michael is working on a solution to this problem, with the idea of ​​reducing the size of the database by 32 times. Although we are glad to see that the node can recover after 100 epochs cannot be finalized, the current situation is equivalent to a node with less than 64 gb hard disks with only about 10 hours of survival time. Resilience is important to the Lighthouse client, and Michael's update extends 10 hours to 13 days.

Forked selection

We also observed that the fork selection time of the network was extended to 8 seconds. In our opinion, this is unacceptable and must be addressed. We realized that this problem was caused by excessive loading of the beacon state from the disk, so we have written a PR to solve this problem.

Community feedback

It's great to see people participating in the Lighthouse testnet and running their own validators. More than 400 participants participated in our testnet! Thanks for their feedback! The following suggestions were mentioned repeatedly:

Requires faster synchronization time: We are working on it, and it is expected that the synchronization speed can be 1.5 to 2 times faster in the 0.1.1 version.

Better docker documentation: Scott is optimizing these documents, and the new testnet will deploy with docker (ie, we will try to use docker for ourselves).

More stable eth1 node: We provide a public eth1 node for the convenience of users, but it turns out that this node has also caused some validators to go down. When the next testnet is released, we will deploy a small number of nodes in different regions and do load balancing among these nodes.

More API endpoints: The becaoncha.in team contacted us and hoped that their block browser could get more API endpoints. We have submitted a PR which is expected to be merged in version 0.1.1.

(Finish)

We will continue to update Blocking; if you have any questions or suggestions, please contact us!

Share:

Was this article helpful?

93 out of 132 found this helpful

Discover more

Blockchain

"New and old" exchanges compete on the same stage, how can you play in the future? | Interview with SheKnows

Exchanges are an important part of the blockchain ecosystem. They interact directly with users and therefore change a...

Blockchain

Zhongying Internet publicly claimed that it is preparing for the first of the A-share listed companies in the digital currency trading platform.

This article Source: Finance Network · Chain Finance , the original title "Save capital chain break risk A-...

Blockchain

Hong Kong Stock Exchange with cross-border marriage: will enter digital asset trading within three years

On September 11, the Hong Kong Stock Exchange suddenly announced that it intends to issue a merger proposal to the Lo...

Blockchain

Gemini Exchange sets up insurance company to provide $ 200 million in insurance for custody services

The Winklevoss brothers' Gemini exchange has set up an insurance company to prepare up to $ 200 million in insur...

Market

Encryption exchange "moving tide": US SEC "strongly pushed away", Middle East and Hong Kong "welcoming with a smile"

Due to the recent pressure from the SEC, several major exchanges around the world are preparing to flee, with the UAE...

Market

Wu said Zhou's selection Hong Kong regulatory agency opens retail trading, Curve hacked, Binance US Department of Justice progress and news Top10 (0729-0805)

Author | Wu's Top 10 Blockchain News of the Week. The Hong Kong Securities and Futures Commission has approved the li...