Lighthouse: Lessons Learned from Testnet Collapse

Author: Blair Fraser

Translation: A Jian

Source: Ethereum enthusiasts

One test net fell, and thousands of test sites went up

A week ago (Note: This article was written on December 17, 2019), we announced the launch of a large public testnet using the Lighthouse client. The testnet successfully started and ran for a week, and it proved for the first time that the Eth2 testnet configured with a production environment can run.

When we launched the testnet, we said, "We're going to try to crash this testnet and I'm confident we can succeed." The testnet did hang, and it was twice. The first time is on Saturday morning, and the second time is on Monday morning (all Sydney time). After the first hang (more than 100 epochs were not finalized), we successfully resumed the testnet operation; but when it crashed the second time, we decided to stop and resume.

The "crash" and "hang up" mentioned here mean that the testnet cannot determine the epoch. The reason why the epoch cannot be determined is that more than 1/3 of the validators are offline. In design, this testnet will not stand dead when it encounters problems, but will quickly and clearly show failure.

The cornerstone of this testnet is 4 AWS t2.medium instances (the hardware configuration is 2 vCPU, 4gb RAM, 32gb SSD); each instance serves as a public boot node, and loads 4096 validators. In fact, we are also surprised that they can last so long; for a few machines with ordinary hardware configurations, this is a huge burden; as long as two of them are offline, the test network cannot be finalized.

We analyzed these two crashes and learned a lot (details in subsequent chapters). Our team has returned to development work and hopes to release a new testnet next week (it may also be the next few weeks, and the work may be affected by holidays). You can learn about our progress on this page of v0.1.1 milestone.


Major causes of testnet crash

The direct cause of the first crash of the testnet was a loop in the software's networked components, which would "see" certain attestation data being repeatedly released. This cycle appeared on two of the four primary nodes we deployed, exhausting their resources, making them unable to produce blocks and witness data. This problem was the direct cause of the two crashes.

We have updated our gossipsub implementation, and now each content is addressed based on its content, which means that if we receive two messages with the same content, the gossipsub protocol will ignore the second message. We have also added duplicate message checking to the Lighthouse client code to prevent sending and receiving duplicate messages.

Secondary cause of testnet crash

Data volume skyrocketed

After the two beacon nodes go down, it is impossible for the testnet to finalize the block (because 50% of the validators are offline). However, the remaining two nodes continue to send and receive blocks, which is what we want to see. However, after the network lost its finality, they were unable to trim and compress their databases, which resulted in their databases growing at a few GB per hour. Because we limited the hard disks of the test network nodes to 32 GB (including the part occupied by the operating system), eventually, their disks were full of old data and could no longer accept new blocks. This caused the other two nodes to go offline as well.

In this case, it is also very simple to resume the test network operation, just increase the hard disk and restart the node. We are also very satisfied with this recovery method, because it means that some nodes with large hard disks will be hardly affected in two crashes.

At the time of this writing, Michael is working on a solution to this problem, with the idea of ​​reducing the size of the database by 32 times. Although we are glad to see that the node can recover after 100 epochs cannot be finalized, the current situation is equivalent to a node with less than 64 gb hard disks with only about 10 hours of survival time. Resilience is important to the Lighthouse client, and Michael's update extends 10 hours to 13 days.

Forked selection

We also observed that the fork selection time of the network was extended to 8 seconds. In our opinion, this is unacceptable and must be addressed. We realized that this problem was caused by excessive loading of the beacon state from the disk, so we have written a PR to solve this problem.

Community feedback

It's great to see people participating in the Lighthouse testnet and running their own validators. More than 400 participants participated in our testnet! Thanks for their feedback! The following suggestions were mentioned repeatedly:

Requires faster synchronization time: We are working on it, and it is expected that the synchronization speed can be 1.5 to 2 times faster in the 0.1.1 version.

Better docker documentation: Scott is optimizing these documents, and the new testnet will deploy with docker (ie, we will try to use docker for ourselves).

More stable eth1 node: We provide a public eth1 node for the convenience of users, but it turns out that this node has also caused some validators to go down. When the next testnet is released, we will deploy a small number of nodes in different regions and do load balancing among these nodes.

More API endpoints: The team contacted us and hoped that their block browser could get more API endpoints. We have submitted a PR which is expected to be merged in version 0.1.1.