DHT, a distributed hash table, is a key technology for distributed storage and download, and is now widely used in P2P networks. To understand the technical implementation of a distributed hash table, you first need to know what a hash algorithm and hash table are.
The hash algorithm is simply a function. This function has some special properties:
1. The length of the output is fixed regardless of the complexity of the input.
- Babbitt Column | Deng Jianpeng: On USDT Risk and Protection of Chinese Investors' Rights
- Data privacy black hole under the new crown epidemic: industrial crux needs to be solved by blockchain
- Really sweet! Big data shows that 2020 blockchain will become the world's most popular hard skill
- Pang Yinming: blockchain bubble fades, return to application value
- Babbitt column | Ren Zhengfei talks about "coin" for the first time, the world has no confidence in the dollar
- Cai Kailong: Wuzhen World Blockchain Conference 10 points summary, my round table is so hot that "almost hit"
2. No matter how small the input changes, the output will be completely different from before.
3. The original information cannot be reversed by hash value, which is irreversible.
A hash table, also known as a hash table, is used to store key-value pairs. When encrypting a file, you not only need to store the hash of the file, but you also need to store the file itself. This requires storing the files and hashes in pairs for easy searching.
For ordinary hash tables, the cost of expansion is very large. Because the normal Hash calculates the address as follows:
Hash(Key)%M , after the expansion, the position of the element has changed. It is mathematically proven that the expansion is twice as large, which results in the least number of elements of the hash. This is why the hash table is always expanded to twice the original size in order to reduce the movement of elements during expansion.
So the essence of the hash table is that when you look for a certain information, it is ultimately a process of taking the hash value to find a location, but we also see that when there are new nodes to join or the old node When exiting, you need to re-store key-value pairs. When the amount of information is large, it will easily lead to network congestion.
In order to solve the problem of node changes, in 1999, MIT's Karger et al. invented a consistent hash, which really allowed distributed storage to enter a stage where it could be scaled up.
What is a consistent hash?
First, organize the hash space into a virtual ring. Let's take the example of SHA256. SHA256 has 2^256 hashes, which combines all possible hash values into a circle, from 0 to 2^256−1, sorted clockwise, and at 12 o'clock, 0 coincides with 2^256−1 .
Then, the server used by each node for storage performs a hash calculation. You can select the server's IP address or host name for hash calculation (use IP address to prevent host name overlap, it is very important here), and calculate it according to the calculation. I will put the node servers in the corresponding positions of the ring in order.
Finally, the key of the data (key, the keyword of the data) is calculated using the same hash algorithm, and the position of the data on the ring is determined. From this position, the clock is “walked” along the ring, the first one. The server you encounter is the server it should be located to.
In this way, when a node stops working for some reason, it only affects the data between its counterclockwise direction and the previous node. Similarly, when a new node is added, it only affects its inverse. The data in the hour hand direction to the previous node.
Is there a disadvantage to ordinary consistency hashing? Of course there is. To sum up, it is: without considering the situation of each machine, it can not achieve a good load balancing.
Is there a solution? Yes, it is virtual machine technology. For example, when there are only two nodes in a hash ring:
It can be seen that there is a possibility that the above allocation is uneven. In order to achieve load balancing without adding physical devices, we add some other content to the IP of the two nodes (for example, #1. #2) Compute the hash again and get Node 1.1, Node 1.2, Node 2.1, Node 2.2, so that it implements the following situation:
When these virtual nodes join, the allocation of data will naturally change. Of course, there is only one physical entity of these virtual machines, which will naturally be stored in the real physical entity, and naturally the load can be balanced as much as possible.
In the underlying technology of IPFS, DHT is a very important part, and the reason why IPFS will prefer the fixed IP of the public network is because the fixed IP does not change the position in the hash ring, and thus does not cause the node to change. Additional network load. This is one of the reasons why mine revenues will be higher and more stable.
At this point, the basic technology of the distributed hash table DHT has been introduced.
(Author: white area)