Unprecedented demand for AI computing power, what is the role of Web3?

AI computing power is in high demand, but what is Web3's role?

Key points:

  • There are currently two major directions for the combination of AI and Crypto: distributed computing power and ZKML. This article will analyze and reflect on decentralized distributed computing power networks.
  • With the development trend of AI large models, computing power resources will be the next decade’s battleground and the most important thing in future human society, not only staying in commercial competition but also becoming a strategic resource for great power games. Investment in high-performance computing infrastructure and computing power reserves will increase exponentially in the future.
  • The demand for decentralized distributed computing power networks in AI large model training is the greatest, but it also faces the greatest challenges and technological bottlenecks, including complex data synchronization and network optimization issues, etc. In addition, data privacy and security are also important restrictive factors. Although some existing technologies can provide preliminary solutions, these technologies are still not applicable in large-scale distributed training tasks due to the huge computational and communication overheads.
  • The decentralized distributed computing power network has a greater opportunity to land in model reasoning, and the predicted incremental space is also large enough. However, it also faces challenges such as communication delay, data privacy, and model security. Compared with model training, the computational complexity and data interaction of inference are lower, and it is more suitable for distributed environments.
  • Through the cases of Together and Gensyn.ai, two start-up companies, the overall research direction and specific ideas of decentralized distributed computing power networks are explained from the perspectives of technical optimization and incentive layer design.

1. Distributed Computing Power – Large Model Training

When discussing the application of distributed computing power in training, we generally focus on the training of large language models. The main reason is that the training of small models does not require much computing power, and it is not cost-effective to do distributed work on data privacy and a bunch of engineering problems. It is better to solve it directly in a centralized manner. The demand for computing power of large language models is huge, and it is now in the initial stage of explosion. From 2012 to 2018, AI’s computing demand doubled every four months. It is now the concentration point of computing power demand, and it can be predicted that it will still be a huge incremental demand in the next 5-8 years.

While there are huge opportunities, we also need to be clear about the problems. Everyone knows that the scene is very large, but where are the specific challenges? The core of judging excellent projects in this field is who can target these problems instead of blindly entering the market.

(NVIDIA NeMo Megatron Framework)

1. Overall training process

Take the training of a large model with 175 billion parameters as an example. Due to the huge size of the model, parallel training is required on many GPU devices. Assuming there is a centralized computer room with 100 GPUs, each device has 32GB of memory.

  • Data preparation: First, a huge dataset is needed, which includes various data such as internet information, news, books, etc. Before training, these data need to be preprocessed, including text cleaning, tokenization, and vocabulary construction.

  • Data segmentation: The processed data will be divided into multiple batches for parallel processing on multiple GPUs. Suppose the selected batch size is 512, that is, each batch contains 512 text sequences. Then, we divide the entire dataset into multiple batches to form a batch queue.

  • Inter-device data transmission: At the beginning of each training step, the CPU takes out a batch from the batch queue and then sends the data of this batch to the GPU through the PCIe bus. Assuming that the average length of each text sequence is 1024 tokens, the data size of each batch is about 2MB (assuming that each token is represented by a single-precision floating-point number of 4 bytes). This data transmission process usually takes only a few milliseconds.

  • Parallel training: After receiving the data, each GPU device starts forward propagation and backward propagation calculation, and calculates the gradient of each parameter. Because the model is very large, the memory of a single GPU cannot store all parameters, so we use model parallelism to distribute model parameters on multiple GPUs.

  • Gradient aggregation and parameter update: After the backward propagation calculation is completed, each GPU gets the gradient of a part of the parameters. Then, these gradients need to be aggregated among all GPU devices to calculate the global gradient. This requires data transmission through the network. Assuming a 25Gbps network is used, it takes about 224 seconds to transmit 700GB of data (assuming that each parameter uses single-precision floating-point numbers, 175 billion parameters are about 700GB). Then, each GPU updates its stored parameters based on the global gradient.

  • Synchronization: After parameter update, all GPU devices need to be synchronized to ensure that they use consistent model parameters for the next training step. This also requires data transmission through the network.

  • Repeat training steps: Repeat the above steps until all batches are trained or the predetermined number of training rounds (epoch) is reached.

This process involves a large amount of data transfer and synchronization, which may become a bottleneck for training efficiency. Therefore, optimizing network bandwidth and latency, as well as using efficient parallel and synchronization strategies, is crucial for large-scale model training.

2. Bottleneck of Communication Overhead:

It should be noted that the bottleneck of communication is also a reason why current distributed computing networks cannot perform large language model training.

Each node needs to exchange information frequently to collaborate, which creates communication overhead. For large language models, this problem is particularly severe due to the huge number of model parameters. Communication overhead can be divided into several aspects:

  • Data transmission: During training, nodes need to exchange model parameters and gradient information frequently. This requires a large amount of data to be transmitted over the network, consuming a lot of network bandwidth. If the network conditions are poor or the distance between computing nodes is large, the delay of data transmission will be high, further increasing communication overhead.

  • Synchronization problem: During training, nodes need to collaborate to ensure the correct progress of training. This requires frequent synchronization operations between nodes, such as updating model parameters and calculating global gradients. These synchronization operations require a large amount of data to be transmitted over the network and require all nodes to complete the operation, which leads to a lot of communication overhead and waiting time.

  • Gradient accumulation and updating: During the training process, each node needs to calculate its own gradient and send it to other nodes for accumulation and updating. This requires a large amount of gradient data to be transmitted over the network and requires all nodes to complete the calculation and transmission of gradients, which is also the reason for a lot of communication overhead.

  • Data consistency: It is necessary to ensure that the model parameters of each node are consistent. This requires frequent data verification and synchronization operations between nodes, which leads to a lot of communication overhead.

Although there are some methods to reduce communication overhead, such as compression of parameters and gradients, efficient parallel strategies, etc., these methods may introduce additional computational burdens or have negative effects on the training performance of the model. Moreover, these methods cannot completely solve the problem of communication overhead, especially in the case of poor network conditions or large distance between computing nodes.

Example:

Decentralized Distributed Computing Network

The GPT-3 model has 175 billion parameters. If we use single-precision floating-point numbers (4 bytes per parameter) to represent these parameters, storing them requires about 700 GB of memory. In distributed training, these parameters need to be frequently transmitted and updated between various computing nodes.

Assuming there are 100 computing nodes and each node needs to update all the parameters at each step, each step requires transmitting about 70 TB (700 GB * 100) of data. If we assume that one step takes 1 second (a very optimistic assumption), then 70 TB of data needs to be transmitted every second. This kind of bandwidth demand already exceeds most networks and is a feasibility issue.

In reality, due to communication delays and network congestion, the time for data transmission may far exceed 1 second. This means that computing nodes may have to spend a lot of time waiting for data transmission instead of doing actual calculations. This will greatly reduce the efficiency of training, and this efficiency reduction is not something that can be solved by waiting. It is the difference between feasible and infeasible, which can make the entire training process infeasible.

Centralized Data Center

Even in a centralized data center environment, training large models still requires heavy communication optimization.

In the centralized data center environment, high-performance computing devices act as clusters and are connected via high-speed networks to share computing tasks. However, even in this high-speed network environment, the communication overhead is still a bottleneck when training models with a large number of parameters, because the model’s parameters and gradients need to be frequently transmitted and updated between computing devices.

As previously mentioned, assuming there are 100 computing nodes and each server has a network bandwidth of 25 Gbps. If each server needs to update all the parameters at each training step, each training step requires transmitting about 700 GB of data, which takes about 224 seconds. With the advantage of a centralized data center, developers can optimize network topology within the data center and use techniques such as model parallelism to significantly reduce this time.

By comparison, if the same training is done in a distributed environment, assuming there are still 100 computing nodes distributed around the world, the average network bandwidth of each node is only 1 Gbps. In this case, transmitting the same 700 GB of data takes about 5,600 seconds, much longer than in a centralized data center. Moreover, due to network latency and congestion, the actual time required may be even longer.

However, compared to the situation in a distributed computing network, optimizing communication overhead in a centralized data center environment is relatively easy. This is because in a centralized data center environment, computing devices usually connect to the same high-speed network, with relatively good network bandwidth and latency. In a distributed computing network, however, computing nodes may be distributed around the world, and network conditions may be relatively poor, making communication overhead a more serious issue.

OpenAI used a model parallel framework called Megatron to address the communication overhead problem during the training of GPT-3. Megatron parallelizes the model’s parameters across multiple GPUs by partitioning the parameters, with each device responsible for storing and updating a portion of the parameters, reducing the parameter volume each device needs to handle and lowering communication overhead. At the same time, high-speed interconnect networks were used during training, and network topology was optimized to reduce communication path length.

3. Why can’t distributed computing networks do these optimizations?

They can, but the effects of these optimizations are limited compared to centralized data centers.

Network topology optimization: Network hardware and layout can be directly controlled in a centralized data center, so network topology can be designed and optimized as needed. However, in a distributed environment, computing nodes are distributed in different geographical locations, even in China and the United States, and it is impossible to directly control the network connection between them. Although data transmission paths can be optimized through software, optimizing hardware networks directly is more effective. At the same time, due to differences in geographical location, network latency and bandwidth also vary greatly, further limiting the effectiveness of network topology optimization.

Model parallelism: Model parallelism is a technique that partitions a model’s parameters across multiple computing nodes and parallelizes processing to speed up training. However, this method usually requires frequent data transmission between nodes, so it has high requirements for network bandwidth and latency. Model parallelism is very effective in centralized data centers due to high network bandwidth and low latency. However, in a distributed environment, model parallelism is greatly restricted due to poor network conditions.

4. Challenges in data security and privacy

Almost all aspects involving data processing and transmission may affect data security and privacy:

  • Data distribution: Training data needs to be distributed to various computing nodes. Malicious use/leakage of data may occur during this process on distributed nodes.

  • Model training: During training, each node computes using its assigned data and outputs model parameter updates or gradients. If the node’s computation process is stolen or the results are maliciously parsed, data may also be leaked.

  • Parameter and gradient aggregation: The outputs of each node need to be aggregated to update the global model, and communication during the aggregation process may also leak information about the training data.

What are the solutions to data privacy issues?

  • Secure multi-party computation (SMC): SMC has been successfully applied in some specific and small-scale computing tasks. However, it has not been widely used in large-scale distributed training tasks due to its high computational and communication overhead.

  • Differential privacy (DP): DP is applied in some data collection and analysis tasks, such as Chrome’s user statistics. However, DP will have an impact on the accuracy of the model in large-scale deep learning tasks. At the same time, designing appropriate noise generation and adding mechanisms is also a challenge.

  • Federated learning (FL): FL is applied in some model training tasks of edge devices, such as vocabulary prediction of Android keyboard. However, in larger-scale distributed training tasks, FL faces problems such as high communication overhead and complex coordination.

  • Homomorphic encryption (HE): HE has been successfully applied in some computationally less complex tasks. However, in large-scale distributed training tasks, it has not been widely used due to its high computational overhead.

Summary:

Each of the above methods has its own suitable scenarios and limitations. There is no single method that can completely solve data privacy issues in large-scale model training in a distributed computing network.

Can ZK, which is highly anticipated, solve data privacy issues in large-scale model training?

Theoretically, zero-knowledge proofs (ZKPs) can be used to ensure data privacy in distributed computing, allowing a node to prove that it has performed a calculation according to the rules, but without revealing the actual input and output data.

However, in practice, using ZKPs in scenarios of large-scale distributed computing networks to train large models faces the following bottlenecks:

Computation and communication overhead: Constructing and verifying zero-knowledge proofs requires a large amount of computing resources. In addition, the communication overhead of ZKP is also large, as the proof itself needs to be transmitted. In the case of large model training, these overheads may become particularly significant. For example, if a proof needs to be generated for each small batch of computation, this will significantly increase the overall time and cost of training.

Complexity of ZK protocols: Designing and implementing a ZKP protocol that is suitable for large-scale model training will be very complex. This protocol needs to be able to handle large-scale data and complex computations, and needs to be able to handle possible exceptions and errors.

Hardware and software compatibility: Using ZKP requires specific hardware and software support, which may not be available on all distributed computing devices.

Summary

To use ZKP for large-scale distributed computing network to train large models, it requires years of research and development, as well as more academic resources and energy in this direction.

II. Distributed Computing – Model Inference

Another major scenario for distributed computing is model inference. According to our judgment on the development path of large models, the demand for model training will gradually slow down after reaching a high point, but the demand for model inference will correspondingly increase exponentially with the maturity of large models and AIGC.

Model inference tasks usually have lower computing complexity and weaker data interaction, making them more suitable for distributed environments.

(Power LLM inference with NVIDIA Triton)

1. Challenges

Communication latency:

Communication between nodes is essential in a distributed environment. In a decentralized distributed computing network, nodes may be scattered around the world, so network latency can be a problem, especially for real-time response inference tasks.

Model deployment and updates:

The model needs to be deployed to each node. If the model is updated, each node needs to update its model, which consumes a lot of network bandwidth and time.

Data privacy:

Although inference tasks typically require only input data and models, without returning large amounts of intermediate data and parameters, the input data may still contain sensitive information, such as users’ personal information.

Model security:

In a decentralized network, the model needs to be deployed to untrusted nodes, which may lead to leakage of the model, resulting in model copyright and abuse issues. This may also cause security and privacy issues. If a model is used to process sensitive data, nodes can infer sensitive information by analyzing model behavior.

Quality control:

Each node in the decentralized distributed computing network may have different computing capabilities and resources, which may make it difficult to guarantee the performance and quality of inference tasks.

2. Feasibility

Computational Complexity:

During the training phase, the model needs to iterate repeatedly, calculating forward and backward propagation for each layer, including activation function calculation, loss function calculation, gradient calculation, and weight update. Therefore, the computational complexity of model training is relatively high.

During the inference phase, only one forward propagation calculation is needed to predict the results. For example, in GPT-3, the input text needs to be converted into a vector, and then the forward propagation is performed through the model’s various layers (usually Transformer layers), and finally the output probability distribution is obtained, and the next word is generated based on this distribution. In GANs, the model needs to generate an image based on the input noise vector. These operations only involve the model’s forward propagation, and do not require gradient calculation or parameter update, so the computational complexity is low.

Data Interactivity:

During the inference phase, the model usually processes a single input, rather than a large batch of data during training. The result of each inference also depends only on the current input, not on other inputs or outputs, so there is no need for a large amount of data interaction, and the communication pressure is also smaller.

For example, in a generative image model using GANs, we only need to input a noise vector to the model, and the model will generate a corresponding image. During this process, each input only generates one output, and there is no dependency between the outputs, so there is no need for data interaction.

For GPT-3, generating each next word only requires the current text input and the model’s state, without the need for interaction with other inputs or outputs, so the requirement for data interactivity is weak.

Summary

Whether it is a large language model or a generative image model, the computational complexity and data interactivity of inference tasks are relatively low, making them more suitable for decentralized distributed computing networks. This is also a direction that most projects are currently focusing on.

3. Projects

The technical threshold and breadth of decentralized distributed computing networks are very high, and they also require hardware resource support, so we have not seen too many attempts at the moment. Examples of Together and Gensyn.ai:

1. Together

Together is a company that focuses on open source for large models and is committed to decentralized AI computing solutions. They hope that anyone, anywhere can access and use AI. Together has just completed a seed round financing of 20 million USD led by Lux Capital.

Together, co-founded by Chris, Percy, and Ce, was established with the initial goal of addressing the need for large-scale GPU clusters and expensive resources required for large model training, resources and capabilities that are often concentrated in a few large corporations.

From my perspective, a reasonable entrepreneurial plan for distributed computing power is:

Step 1. Open-source model

In order to achieve model inference in a decentralized distributed computing network, a prerequisite is that nodes must be able to obtain models at low cost. This means that models used on a decentralized computing network need to be open-source (if a model requires use under a corresponding license, it will increase the complexity and cost of implementation). For example, ChatGPT, as a non-open source model, is not suitable for execution on a decentralized computing network.

Therefore, it can be inferred that a company providing decentralized computing power networks has an implicit barrier to entry that requires strong capabilities in large model development and maintenance. Developing and open-sourcing a powerful base model can to some extent eliminate dependence on third-party model open-sourcing and solve the most basic problem of decentralized computing power networks. It is also more conducive to proving the effectiveness of computing networks in large model training and inference.

Together has also taken this approach. Recently, the RedBlockingjama based on LLaMA was jointly launched by Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, and Hazy Research, with the goal of developing a series of fully open-source large language models.

Step 2. Landing distributed computing power in model inference

As mentioned in the previous two sections, compared with model training, model inference has lower computational complexity and data interaction, making it more suitable for decentralized distributed environments.

Based on the open-source model, Together’s R&D team has made a series of updates to the RedBlockingjama-INCITE-3B model, such as using LoRA to achieve low-cost fine-tuning, making the model run more smoothly on CPUs (especially MacBook Pros with M2 Pro processors). At the same time, although this model is relatively small in size, its capabilities exceed those of other models of the same size and it has been put into practical use in legal, social, and other scenarios.

Step 3. Landing distributed computing power in model training

In the medium to long term, although there are great challenges and technical bottlenecks, the most attractive aspect is to undertake the computing power demand for large-scale AI model training. From the outset, Together began to lay out its work on how to overcome communication bottlenecks in decentralized training. They also published a related paper at NeurIPS 2022: Overcoming Communication Bottlenecks for Decentralized Training. We can mainly summarize the following directions:

  • Scheduling optimization

When training in a decentralized environment, it is important to assign tasks that require heavy communication to devices with faster connections, as connections between nodes have different latencies and bandwidths. Together describes the cost of a particular scheduling strategy through a model to better optimize the scheduling strategy to minimize communication costs and maximize training throughput. The Together team also found that even if the network is 100 times slower, the end-to-end training throughput is only 1.7 to 2.3 times slower. Therefore, it is quite possible to catch up with the gap between distributed networks and centralized clusters through scheduling optimization.

  • Communication Compression Optimization

Together proposes to compress communication for forward activation and backward gradients and introduces the AQ-SGD algorithm, which provides strict guarantees on convergence to stochastic gradient descent. AQ-SGD can fine-tune large base models on slow networks (such as 500 Mbps) and is only 31% slower than end-to-end training performance without compression on centralized compute networks (such as 10 Gbps). In addition, AQ-SGD can be combined with state-of-the-art gradient compression techniques (such as QuantizedAdam) to achieve a 10% end-to-end speedup.

  • Project Summary

The Together team has a comprehensive configuration, and members have strong academic backgrounds, with industry experts supporting everything from large-scale model development and cloud computing to hardware optimization. And Together has indeed demonstrated a long-term and patient approach in path planning, from developing open source large models to testing idle compute power (such as mac) in a distributed compute network for model inference, and then laying out distributed compute power in large model training. – there is a feeling of thick accumulation and thin hair 🙂

However, there is currently no research output from Together in the incentive layer, which I believe is equally important as technical research, and is a key factor in ensuring the development of decentralized compute networks.

2.Gensyn.ai

From Together’s technical path, we can roughly understand the landing process and corresponding research focus of decentralized compute networks in model training and inference.

Another important point to consider is the design of the computing network incentive layer/consensus algorithm. For example, an excellent network needs to:

  • Ensure that the rewards are attractive enough;

  • Ensure that each miner receives the appropriate rewards, including anti-cheating and “the more you work, the more you get”;

  • Ensure that tasks are reasonably scheduled and assigned among different nodes, without many idle nodes or some nodes being excessively congested;

  • Ensure that the incentive algorithm is concise and efficient, without causing too much system burden and delay;

……

Let’s take a look at how Gensyn.ai does it:

  • Become a node

First, the solver in the computing network competes for the right to handle the tasks submitted by users by bidding, and based on the task size and the risk of cheating discovered, the solver needs to deposit a certain amount of money.

  • Verification

While updating the Blockingrameters, the solver generates multiple checkpoints (to ensure the transparency and traceability of the work), and regularly generates cryptographic encryption inference proofs (proof of work progress) about the task. When the solver completes the work and produces a part of the calculation result, the protocol selects a verifier, and the verifier also deposits a certain amount of money (to ensure that the verifier performs the verification honestly), and decides which part of the calculation result needs to be verified based on the proofs provided above.

  • If the solver and verifier disagree

By using a data structure based on Merkle tree, the exact location of the calculation result discrepancy can be located. The entire verification operation is recorded on the chain, and the cheater’s deposit will be deducted.

Project summary

The design of the incentive and verification algorithms means that Gensyn.ai does not need to replay the entire calculation task during the verification process, but only needs to copy and verify a part of the results based on the proofs provided, which greatly improves the efficiency of verification. At the same time, nodes only need to store part of the calculation results, which also reduces storage space and computing resources consumption. In addition, potential cheating nodes cannot predict which parts will be selected for verification, so this also reduces cheating risks.

This way of verifying discrepancies and discovering cheaters can also quickly find the wrong place in the calculation process (starting from the root node of the Merkle tree, and gradually traversing down) without comparing the entire calculation result, which is very effective in processing large-scale calculation tasks.

In short, the design goal of the incentive/validation layer of Gensyn.ai is to be concise and efficient. However, currently this is only at the theoretical level, and specific implementation may still face the following challenges:

In terms of the economic model, how to set appropriate parameters to effectively prevent fraud, while not creating too high a barrier to entry for participants.

In terms of technical implementation, how to develop an effective periodic cryptographic proof, is also a complex problem that requires advanced knowledge of cryptography.

In terms of task allocation, it is necessary to have reasonable scheduling algorithms to support the selection and assignment of tasks to different solvers in the computing power network. Simply allocating tasks according to bid mechanisms is obviously questionable in terms of efficiency and feasibility. For example, nodes with strong computing power may be able to process larger-scale tasks, but may not participate in bids (this involves the issue of incentives for node availability), while nodes with low computing power may bid the highest but may not be suitable for handling some complex large-scale computing tasks.

IV. Some Thoughts on the Future

The question of who needs decentralized computing power networks has not actually been validated. It is obviously most sensible to use idle computing power for large-scale model training that requires huge computing resources, and it also has the greatest potential for imagination. However, in fact, bottlenecks such as communication and privacy have forced us to rethink:

Is training large models in a decentralized manner really promising?

If we step out of this consensus of “the most reasonable landing scenario,” is applying decentralized computing power to the training of small AI models also a huge scenario? From a technical point of view, the limiting factors are currently being solved by the scale and architecture of the model. At the same time, from the market perspective, we have always believed that the training of large models will be huge from now to the future, but does the market for small AI models have no appeal?

I don’t think so. Compared with large models, small AI models are easier to deploy and manage, and are more efficient in terms of processing speed and memory usage. In many application scenarios, users or companies do not need the more general reasoning ability of large language models, but only focus on a very refined prediction target. Therefore, in most scenarios, small AI models are still a more feasible choice and should not be prematurely ignored in the tide of large models.

We will continue to update Blocking; if you have any questions or suggestions, please contact us!

Share:

Was this article helpful?

93 out of 132 found this helpful

Discover more

Blockchain

99% of the transaction volume is fraudulent, what is left behind the false prosperity of the currency circle?

The amount of trading fraud has been ridiculous for the people of the coin circle, but all along, there are always bl...

Blockchain

Exchange Real Volume Report (on) | TokenInsight

Summary of points: 1. According to the report, 36% of the exchanges (11) have a real trading volume ratio higher than...

Policy

FTX's Big Sell Grayscale and Bitwise Assets On the Market for $744M

FTX creditors have requested approval from an investment advisor for the sale of trust assets and related procedures.

Market

Wu said Zhou's selection Hong Kong regulatory agency opens retail trading, Curve hacked, Binance US Department of Justice progress and news Top10 (0729-0805)

Author | Wu's Top 10 Blockchain News of the Week. The Hong Kong Securities and Futures Commission has approved the li...

Opinion

Wall Street Journal Binance Empire on the Verge of Collapse

After the collapse of FTX, the largest cryptocurrency exchange in the world seems to be Binance. However, less than a...