Lighthouse: Lessons Learned from Testnet Collapse

Author: Blair Fraser

Translation: A Jian

Source: Ethereum enthusiasts

One test net fell, and thousands of test sites went up

A week ago (Note: This article was written on December 17, 2019), we announced the launch of a large public testnet using the Lighthouse client. The testnet successfully started and ran for a week, and it proved for the first time that the Eth2 testnet configured with a production environment can run.

When we launched the testnet, we said, "We're going to try to crash this testnet and I'm confident we can succeed." The testnet did hang, and it was twice. The first time is on Saturday morning, and the second time is on Monday morning (all Sydney time). After the first hang (more than 100 epochs were not finalized), we successfully resumed the testnet operation; but when it crashed the second time, we decided to stop and resume.

The "crash" and "hang up" mentioned here mean that the testnet cannot determine the epoch. The reason why the epoch cannot be determined is that more than 1/3 of the validators are offline. In design, this testnet will not stand dead when it encounters problems, but will quickly and clearly show failure.

The cornerstone of this testnet is 4 AWS t2.medium instances (the hardware configuration is 2 vCPU, 4gb RAM, 32gb SSD); each instance serves as a public boot node, and loads 4096 validators. In fact, we are also surprised that they can last so long; for a few machines with ordinary hardware configurations, this is a huge burden; as long as two of them are offline, the test network cannot be finalized.

We analyzed these two crashes and learned a lot (details in subsequent chapters). Our team has returned to development work and hopes to release a new testnet next week (it may also be the next few weeks, and the work may be affected by holidays). You can learn about our progress on this page of v0.1.1 milestone.

lesson

Major causes of testnet crash

The direct cause of the first crash of the testnet was a loop in the software's networked components, which would "see" certain attestation data being repeatedly released. This cycle appeared on two of the four primary nodes we deployed, exhausting their resources, making them unable to produce blocks and witness data. This problem was the direct cause of the two crashes.

We have updated our gossipsub implementation, and now each content is addressed based on its content, which means that if we receive two messages with the same content, the gossipsub protocol will ignore the second message. We have also added duplicate message checking to the Lighthouse client code to prevent sending and receiving duplicate messages.

Secondary cause of testnet crash

Data volume skyrocketed

After the two beacon nodes go down, it is impossible for the testnet to finalize the block (because 50% of the validators are offline). However, the remaining two nodes continue to send and receive blocks, which is what we want to see. However, after the network lost its finality, they were unable to trim and compress their databases, which resulted in their databases growing at a few GB per hour. Because we limited the hard disks of the test network nodes to 32 GB (including the part occupied by the operating system), eventually, their disks were full of old data and could no longer accept new blocks. This caused the other two nodes to go offline as well.

In this case, it is also very simple to resume the test network operation, just increase the hard disk and restart the node. We are also very satisfied with this recovery method, because it means that some nodes with large hard disks will be hardly affected in two crashes.

At the time of this writing, Michael is working on a solution to this problem, with the idea of ​​reducing the size of the database by 32 times. Although we are glad to see that the node can recover after 100 epochs cannot be finalized, the current situation is equivalent to a node with less than 64 gb hard disks with only about 10 hours of survival time. Resilience is important to the Lighthouse client, and Michael's update extends 10 hours to 13 days.

Forked selection

We also observed that the fork selection time of the network was extended to 8 seconds. In our opinion, this is unacceptable and must be addressed. We realized that this problem was caused by excessive loading of the beacon state from the disk, so we have written a PR to solve this problem.

Community feedback

It's great to see people participating in the Lighthouse testnet and running their own validators. More than 400 participants participated in our testnet! Thanks for their feedback! The following suggestions were mentioned repeatedly:

Requires faster synchronization time: We are working on it, and it is expected that the synchronization speed can be 1.5 to 2 times faster in the 0.1.1 version.

Better docker documentation: Scott is optimizing these documents, and the new testnet will deploy with docker (ie, we will try to use docker for ourselves).

More stable eth1 node: We provide a public eth1 node for the convenience of users, but it turns out that this node has also caused some validators to go down. When the next testnet is released, we will deploy a small number of nodes in different regions and do load balancing among these nodes.

More API endpoints: The becaoncha.in team contacted us and hoped that their block browser could get more API endpoints. We have submitted a PR which is expected to be merged in version 0.1.1.

(Finish)

We will continue to update Blocking; if you have any questions or suggestions, please contact us!

Share:

Was this article helpful?

93 out of 132 found this helpful

Discover more

Market

Crypto Firms on the Move: Wallets Shaking and Bacon at Lighting Speed!

FTX and Alameda sent $10 million worth of popular tokens (LINK, MKR, COMP, ETH, and AAVE) to a wallet address, which ...

Blockchain

Coinbase becomes Tezos' largest verification node, will it be a new trend for exchanges?

Original: Cryptopotato , original author: Jordan Lyanchev Source: Odaily Planet Daily, Translator: Yu Shunsui Accordi...

Policy

The Crypto Circus: A Bug’s Billion-Dollar Bonanza

During the 10th day of Sam Bankman-Fried's trial, talks focused on a software glitch and the allocation of funds for ...

Blockchain

After FCoin's "incident", key figures from the team responded!

On the evening of February 12, an announcement on the FCoin Exchange regarding "the latest progress of system ma...

Market

Is CoinDesk selling at a loss with a valuation of $125 million after being in business for ten years?

On the occasion of its tenth anniversary and after being held by DCG Group for eight years, CoinDesk, the cryptocurre...

Blockchain

Bella Fang: The exchange is at the top of the food chain. How can small and medium-sized projects seize this channel?

On the afternoon of the 9th, at the 2nd Global Blockchain Summit·Wuzhen site hosted by Babbitt, Bella Fang, foun...