Why did Ethereum have two brief outages in a row? An analysis of the causes of the incident.
A brief analysis of the causes of two consecutive Ethereum outages.Author: imToken Labs
Overview
On the nights of May 11th and 12th, there was a brief abnormality in the Ethereum consensus layer. ImToken analyzed that this anomaly was mainly due to high loads on certain Ethereum consensus layer client nodes, causing Validators to go offline, directly leading to the inability to reach 2/3 of the Epoch vote and the inability of the consensus layer to confirm finality. However, shortly thereafter, the Ethereum network recovered on its own, indicating that the Ethereum PoS consensus algorithm has resilience and self-healing capabilities.
Event and Background
Under normal circumstances, the Ethereum PoS consensus network status is confirmed (Finalized) in 2 Epochs, but there were two delays in Epoch confirmation last week.
The first delay occurred on May 11th, and Epoch confirmation was delayed by 3 Epochs, about 20 minutes.
- After the explosion, searching for the past of BRC-20
- Facebook’s issue of stable currency is likely to fail
- Vitalik: Radical Market, ZK, Privacy and more
The second delay occurred on May 12th, and Epoch confirmation was delayed by 8 Epochs, about 51 minutes.
During the event, the Ethereum network continued to produce blocks and process transactions. However, because the voting rate of Validators was insufficient, Epoch could not be confirmed (i.e., Epoch did not receive a level of security guarantee from the Ethereum PoS network consensus). The failure to confirm Epoch means that in the event of most Validators behaving maliciously and causing a fork, Epoch may be rolled back, leading to the rollback of transactions.
In fact, during the event, there was no fork in the Ethereum network, and Validators did not engage in malicious voting. Epoch could not be confirmed during the event only because a large number of Validators went offline, resulting in an insufficient voting rate for Epoch.
Observations indicate that the offline Validators experienced an abnormal situation of CPU overload, which is believed to be the direct cause of Validator offline.
In the second event, Epoch confirmation was delayed by 8 Epochs, triggering the Ethereum consensus algorithm’s Inactivity leak mechanism because the delay was greater than MIN_EpochS_TO_INACTIVITY_PENALTY (=4).
- Punishment for offline validators, reducing their staked funds and confiscating around 28 ETH.
- Cancellation of the reward for attestations, resulting in around 50 ETH not being issued.
- This mechanism ensures that online validators ultimately control 2/3 of the total staked funds of Ethereum, thus allowing the network state to be finalized.
imToken’s node service also detected this incident by monitoring the voting of Ethereum’s consensus layer validators in real-time, thus providing an early warning of the abnormality of the Ethereum consensus network before the epoch could be finalized. The following image shows the node status when the first incident occurred.
Under the PoW mechanism, the success of a transaction is determined by the number of consecutive blocks after which the transaction is unlikely to be rolled back, while PoS uses the Safe Head returned block height as the criterion for successful transactions. Currently, the specification recognizes the Justified Checkpoint as the status of the Safe Head, so there may be a delay in judgment of up to 6.4 minutes based on the previous epoch’s state, which is a bad user experience.
imToken’s self-developed Safe Head service calculates secure blocks for transaction confirmation based on real-time Ethereum consensus data, shortening the time for transaction confirmation while ensuring user safety. Under normal circumstances, the block height returned by imToken’s Safe Head algorithm (yellow in the above figure) will be very close to the latest block height (green), thus improving the user experience.
More information about the Safe Head mechanism:
- Ethereum: Introduction to the Safe Head Mechanism (Part 1)
- Ethereum: Introduction to the Safe Head Mechanism (Part 2)
Reason Analysis
The direct cause of the above incidents is that several Ethereum consensus layer client nodes were overloaded, causing validators to go offline and unable to vote normally. After analysis, the reason that these nodes were overloaded was:
When receiving attestations pointing to outdated blocks, nodes need to recalculate the beacon chain state to verify these attestations, a process that consumes a lot of CPU and memory resources.
When a large number of attestations pointing to outdated blocks are received simultaneously, the node’s CPU and memory resources are exhausted, causing these validators to go offline.
This type of problem can be solved by caching based on witness pointing to blocks. However, due to the growth of Validator and the emergence of a large number of such attestations, the cache of the client implementation that caused the problem was broken, and the nodes had to consume a lot of resources to recalculate the beacon chain state. Additionally, the second incident triggered the Inactivity Leak mechanism, which is mainly designed to ensure that Ethereum can still reestablish blocks in extreme situations (when a large number of validators are offline for a long time).
Inspiration for Ethereum Applications
Although the Ethereum network is robust enough, occasional instability can have a certain impact on applications. At the same time, applications must handle these unstable scenarios correctly.
- The deposit time from Layer1 to Layer2 will become longer. An important prerequisite for Layer2 when minting is to ensure that L1 deposit transactions will not be rolled back. Therefore, when the Ethereum network Epoch is delayed, the deposit time from L1 to L2 will also be correspondingly prolonged.
- Similarly, exchanges also need to prevent the rollback of on-chain deposit transactions, so their deposit time will also be correspondingly prolonged.
- There is a risk of rollback in Oracle’s on-chain quotes, so high-value services that rely on it should be suspended appropriately.
- In this event, Uniswap did not display the balance and could only buy but not sell, while dYdX suspended deposits.
Summary
In this event, we can see the resilience and self-repairing ability of Ethereum’s PoS consensus algorithm, as well as the client’s timely response and error correction after the accident. For the entire Ethereum ecosystem, we still need to continue to invest in the following areas: increasing client diversity, optimizing real-time monitoring and early warning of network status, in-depth user education (not only for ordinary users but also for practitioners), and emergency plans for ecosystem participants in network abnormalities.
Reference link
- Finality issue updates May 2023
- https://twitter.com/robplust/status/1657044364382846978
- https://twitter.com/superphiz/status/1656780594326405121
- https://twitter.com/terencechain/status/1657021042110631936
We will continue to update Blocking; if you have any questions or suggestions, please contact us!
Was this article helpful?
93 out of 132 found this helpful
Related articles