Looking Forward to the Potential Track Decentralized Computing Power Market (Part 1)

Anticipating the Prospects of a Decentralized Computing Power Market for Track (Part 1)

Author: Zeke, YBB Capital

Introduction

Since the birth of GPT-3, generative AI has brought explosive breakthroughs and extensive applications to the field of artificial intelligence, causing tech giants to flock to the AI race. However, problems come along with it. The operation of large-scale language models (LLM) for training and inference requires massive computational power. With the iteration and upgrading of models, the demand for computational power and costs increase exponentially. For example, the parameter gap between GPT-2 and GPT-3 is 1166 times (GPT-2 has 150 million parameters, while GPT-3 has 175 billion parameters). The training cost of GPT-3 can reach up to $12 million based on the pricing model of public GPU cloud at that time, which is 200 times higher than GPT-2. In practical usage, each user’s question requires inference computation. Considering the case of 13 million independent user visits in early this year, it corresponds to the need for over 30,000 A100 GPUs. Therefore, the initial investment cost will reach an astonishing $800 million, with estimated daily model inference expenses of $700,000.

Inadequate computational power and high costs have become the challenges faced by the entire AI industry. However, it seems that the same problem will also plague the blockchain industry. On the one hand, with the upcoming fourth halving of Bitcoin and the approval of ETF, the demand for mining hardware will inevitably increase as the future price rises. On the other hand, Zero-Knowledge Proof (ZKP) technology is thriving. Vitalik has repeatedly emphasized that ZK will have the same importance to the blockchain industry as blockchain itself in the next decade. Although this technology is highly anticipated by the blockchain industry, ZK, like AI, also consumes a considerable amount of computational power and time due to its complex calculation process.

In the foreseeable future, computational power shortage will become inevitable. So, will decentralized computational power market be a good business?

Definition of Decentralized Computational Power Market

The decentralized computational power market is essentially equivalent to the decentralized cloud computing race track. However, compared to decentralized cloud computing, I personally think that this term is more appropriate to describe the new projects discussed later. The decentralized computational power market should belong to the subset of DePIN (Decentralized Physical Infrastructure Network). Its goal is to create an open computational power market, where anyone with idle computational power resources can provide their resources on this market through token incentives, mainly serving B-end users and developer communities. From well-known projects, such as Render Network based on decentralized GPUs rendering solution and Akash Network, which is a distributed peer-to-peer market for cloud computing, they both belong to this race track.

In the following text, I will start with the basic concepts and then discuss three emerging markets in this race track: AGI computational power market, Bitcoin computational power market, and AGI computational power market in the ZK hardware acceleration market. The latter two will be discussed in “Prospects of Potential Race Tracks: Decentralized Computational Power Market (Part 2)”.

Overview of Computing Power

The concept of computing power can be traced back to the early days of computer invention, when computers were mechanical devices used for computing tasks. Computing power refers to the computational capability of these mechanical devices. As computer technology advanced, the concept of computing power evolved to encompass the collaborative work of computer hardware (CPU, GPU, FPGA, etc.) and software (operating system, compiler, applications, etc.).

Definition

Computing power refers to the amount of data that a computer or other computing device can process or the number of computational tasks it can complete within a certain period of time. Computing power is often used to describe the performance of a computer or computing device and is an important metric for measuring its processing capability.

Measurement Standards

Computing power can be measured in various ways, such as computing speed, computing energy consumption, computing accuracy, and parallelism. In the field of computers, commonly used metrics for measuring computing power include FLOPS (floating-point operations per second), IPS (instructions per second), TPS (transactions per second), etc.

FLOPS (floating-point operations per second) refers to the computational capability of a computer in performing floating-point operations (mathematical operations involving decimal numbers, which require consideration of precision and rounding errors). FLOPS measures how many floating-point operations a computer can complete per second. FLOPS is a metric for measuring high-performance computing capabilities and is commonly used to measure the computing power of supercomputers, high-performance computing servers, and graphics processing units (GPUs). For example, a computer system with 1 TFLOPS (1 trillion floating-point operations per second) means it can complete 1 trillion floating-point operations per second.

IPS (instructions per second) refers to the speed at which a computer processes instructions, measuring how many instructions a computer can execute per second. IPS is a metric for measuring the single-instruction performance of a computer and is commonly used to measure the performance of central processing units (CPUs). For example, a CPU with an IPS of 3 GHz (3 billion instructions per second) means it can execute 3 billion instructions per second.

TPS (transactions per second) refers to the capability of a computer to process transactions, measuring how many transactions a computer can complete per second. TPS is commonly used to measure the performance of database servers. For example, a database server with a TPS of 1000 means it can process 1000 database transactions per second.

In addition, there are specific computing power metrics for different application scenarios, such as inference speed, image processing speed, and speech recognition accuracy.

Types of Computing Power

GPU computing power refers to the computational capability of graphics processing units. Unlike central processing units (CPUs), GPUs are hardware specifically designed for processing graphics and video data. They have a large number of processing units and efficient parallel computing capabilities, allowing for simultaneous execution of numerous floating-point operations. Since GPUs were initially used for game graphics processing, they usually have higher clock frequencies and greater memory bandwidth compared to CPUs, supporting complex graphics calculations.

Differences between CPUs and GPUs

Architecture: CPUs and GPUs have different computing architectures. CPUs typically have one or multiple cores, where each core is a general-purpose processor capable of executing various operations. GPUs, on the other hand, have a large number of stream processors and shaders dedicated to executing image processing-related operations;

Parallel computing: GPUs typically have higher parallel computing capabilities. CPUs have a limited number of cores, and each core can only execute one instruction. However, GPUs can have thousands of streaming processors, allowing them to execute multiple instructions and operations simultaneously. Therefore, GPUs are typically better suited for parallel computing tasks, such as machine learning and deep learning, which require a large amount of parallel computation;

Programming: GPU programming is more complex compared to CPU programming, as it requires the use of specific programming languages (such as CUDA or OpenCL) and specific programming techniques to leverage the parallel computing capabilities of GPUs. In contrast, CPU programming is simpler and can utilize general-purpose programming languages and tools.

Importance of Computing Power

In the era of the Industrial Revolution, oil was the lifeblood of the world, penetrating into various industries. In the upcoming AI era, computing power will be the “digital oil” of the world. From the frenzy of major companies competing for AI chips, to Nvidia’s market capitalization surpassing one trillion, to the recent US blockade of high-end chips from China, down to the size of computing power, chip area, and even plans to ban GPU clouds, the importance of computing power is self-evident. Computing power will be the next era’s major commodity.

Prospect of Decentralized Computing Power Market (Part 1)

Overview of Artificial General Intelligence (AGI)

Artificial Intelligence (AI) is a new scientific and technological discipline that researches, develops theories, methods, technologies, and application systems for simulating, extending, and expanding human intelligence. It originated in the 1950s and 1960s, and after more than half a century of evolution, it has gone through three waves of symbolicism, connectionism, and behaviorism, and now, as an emerging general-purpose technology, it is driving huge changes in social life and various industries. The more specific definition of the current generation of AI is Artificial General Intelligence (AGI), a kind of artificial intelligence system with the ability to understand broadly, which can demonstrate intelligence similar to or surpassing that of humans in various tasks and domains. AGI fundamentally requires three elements: deep learning (DL), big data, and massive computing power.

Deep Learning

Deep learning is a subfield of machine learning (ML), and deep learning algorithms are modeled after the human brain’s neural networks. For example, the human brain contains millions of interconnected neurons that work together to learn and process information. Similarly, deep learning neural networks (or artificial neural networks) are composed of multiple layers of artificial neurons working together inside the computer. An artificial neuron is a software module called a node that uses mathematical calculations to process data. Artificial neural networks use these nodes to solve complex problems in deep learning algorithms.

Potential Track Preview: Decentralized Computing Power Market (Part 1)

Neural networks can be divided into input layer, hidden layers, and output layer, with parameters connecting different layers.

Input Layer: The input layer is the first layer of the neural network and is responsible for receiving external input data. Each neuron in the input layer corresponds to a feature of the input data. For example, when processing image data, each neuron may correspond to a pixel value of the image;

Hidden Layers: The input layer processes the data and passes it to deeper layers of the neural network. These hidden layers process information at different levels and adjust their behavior when receiving new information. Deep learning networks can have hundreds of hidden layers, which can be used to analyze problems from multiple perspectives. For example, if you have an unknown animal image that needs to be classified, you can compare it with animals you already know. You can determine what animal it is based on ear shape, number of legs, or the size of the pupils. Hidden layers in deep neural networks work in the same way. If a deep learning algorithm attempts to classify animal images, each hidden layer will process different features of the animal and try to classify it accurately;

Output Layer: The output layer is the last layer of the neural network and is responsible for generating the network’s output. Each neuron in the output layer represents a possible output category or value. For example, in a classification problem, each output layer neuron may correspond to a category, while in a regression problem, the output layer may have only one neuron, which represents the predicted result;

Parameters: In a neural network, the connections between different layers are represented by weight and bias parameters, which are optimized during the training process to enable the network to accurately identify patterns and make predictions in the data. Increasing the parameters can enhance the model capacity of a neural network, which means the model’s ability to learn and represent complex patterns in the data. However, the increase in parameters also increases the demand for computing power.

Big Data

In order to train effectively, neural networks usually require a large, diverse, high-quality, and multi-source data. It is the foundation of training and validating machine learning models. By analyzing big data, machine learning models can learn patterns and relationships in the data, enabling predictions or classifications.

Massive Computing Power

The multi-layer complex structure, large number of parameters, demand for processing big data, iterative training process (during the training phase, the model needs to iteratively calculate forward and backward propagation for each layer, including activation function computation, loss function computation, gradient computation, and weight updates), high-precision computing requirements, parallel computing capabilities, optimization and regularization techniques, as well as model evaluation and validation processes, all contribute to the demand for high computing power. With the advancement of deep learning, AGI’s requirement for massive computing power is increasing by about 10 times each year. The latest model, GPT-4, contains 1.8 trillion parameters, with a single training cost exceeding 60 million USD and requiring 2.15e25 FLOPS (21.5 quintillion floating point calculations). The demand for computing power will continue to expand in the future as new models are developed.

AI Computing Economics

Future Market Size

According to the most authoritative estimate, the global AI computing market size will increase from $19.5 billion in 2022 to $34.66 billion in 2026, as stated in the “2022-2023 Global Computing Power Index Assessment Report” jointly compiled by the International Data Corporation (IDC), Inspur Information, and the Global Industry Research Institute of Tsinghua University. Among them, the market size of generative AI computing will increase from $820 million in 2022 to $10.99 billion in 2026. The proportion of generative AI computing to the overall AI computing market will increase from 4.2% to 31.7%.

Prospects of the Decentralized Computing Power Market (Part 1)

Monopoly of Computing Power Economy

The production of AI GPUs is currently monopolized by NVIDIA, and they are also extremely expensive (the latest H100 single piece price has skyrocketed to $40,000). As soon as GPUs are released, they are quickly bought up by Silicon Valley giants, with some of these devices used for training their own new models. The remaining GPUs are rented out to AI developers through cloud platforms, such as Google, Amazon, and Microsoft’s cloud computing platforms, which control a large amount of computing resources such as servers, GPUs, and TPUs. Computing power has become a new resource monopolized by these giants. Many AI developers cannot even buy a dedicated GPU without paying a premium. In order to use the latest devices, developers are forced to rent AWS or Microsoft cloud servers. From the financial reports, this business has high profitability, with AWS cloud services having a gross profit margin of 61%, and Microsoft even higher with a gross profit margin of 72%.

Prospects of the Decentralized Computing Power Market (Part 1)

Do we have to accept this centralized authority and control, and pay 72% of the profit fees for computing power resources? Will the monopolies of Web2 giants continue into the next era?

Challenges of Decentralized AGI Computing Power

When it comes to anti-monopoly, decentralization is usually the optimal solution. However, based on existing projects, can we achieve the massive computing power required for AI through protocols like storage projects in DePIN and idle GPU utilization with RDNR? The answer is no. The road to slaying the dragon is not that simple. Early projects were not specifically designed for AGI computing power and do not have feasibility. Bringing computing power onto the blockchain faces at least five challenges:

1. Job Verification: Build a truly trustless computational network and provide economic incentives to participants. The network must have a way to verify whether deep learning computations are actually being executed. The core issue here is the state dependency of deep learning models; in a deep learning model, the input of each layer depends on the output of the previous layer. This means that we cannot simply verify a specific layer in the model without considering all the layers that come before it. The computation of each layer is based on the results of all the previous layers. Therefore, to verify the work done at a specific point (such as a specific layer), all the work from the beginning of the model to that specific point must be executed;

2. Market: The AI computing power market, as an emerging market, is subject to supply and demand constraints, such as the cold start problem. The liquidity of supply and demand needs to roughly match from the beginning for the market to grow successfully. To capture potential computing power supply, clear rewards must be offered to participants in exchange for their computing resources. The market needs a mechanism to track completed computational work and promptly pay the providers accordingly. In traditional markets, intermediaries are responsible for tasks such as management and onboarding, while reducing operational costs by setting minimum payment thresholds. However, this approach is cost-prohibitive when scaling the market. Only a small portion of the supply can be effectively captured economically, leading to a threshold balance where the market can only capture and maintain limited supply without further growth;

3. Halting Problem: The halting problem is a fundamental problem in computational theory. It involves determining whether a given computation task will complete within a finite time or never halt. This problem is undecidable, meaning that there is no universal algorithm that can predict whether all computation tasks will halt within a finite time. For example, executing smart contracts on Ethereum also faces similar halting problems. It is not possible to determine in advance how much computational resources a smart contract execution will require or whether it will complete within a reasonable time;

(In the context of deep learning, this problem becomes even more complex because the models and frameworks transition from static graph construction to dynamic construction and execution.)

4. Privacy: The design and development of privacy awareness are essential for project developers. While a lot of machine learning research can be conducted on public datasets, fine-tuning models on proprietary user data is often necessary to improve model performance and adapt it to specific applications. This fine-tuning process may involve handling personal data, so privacy protection requirements must be taken into account;

5. Parallelization: This is a critical factor that the current project lacks feasibility for. Deep learning models are typically trained in parallel on large hardware clusters with proprietary architectures and extremely low latency. However, frequent data exchange among GPUs in a distributed computing network introduces latency and is limited by the slowest GPU’s performance. In the case of untrusted and unreliable computing resources, addressing how to achieve heterogeneous parallelization is a must. Currently, a feasible method is to achieve parallelization through the use of transformer models, such as the Switch Transformers, which already have highly parallelized characteristics.

Solution: Although attempts to decentralize AGI computing power market are still in the early stages, there are two projects that have made preliminary progress in addressing the consensus design of decentralized networks and the practical implementation of decentralized computing power networks in model training and inference. The following analysis will use Gensyn and Together as examples to explore the design and challenges of a decentralized AGI computing power market.

Gensyn

Overview of Potential Competitive Track: Decentralized Computing Power Market (Part 1)

Gensyn is an AGI computing power market that is still in the construction phase, aiming to address various challenges in decentralized deep learning computing and reduce the cost of current deep learning. Gensyn is essentially a first-layer proof-of-stake protocol based on the Polkadot network. It directly rewards solvers through smart contracts in exchange for their idle GPU devices for computation and executing machine learning tasks.

So, let’s go back to the previous question, building a truly trustless computing network core revolves around verifying completed machine learning work. This is a highly complex problem that requires finding a balance between complexity theory, game theory, cryptography, and optimization.

Gensyn proposes a simple solution: solvers submit their completed machine learning task results. To verify the accuracy of these results, another independent verifier will attempt to re-execute the same work. This method can be called single replication, as only one verifier is involved in the re-execution. This means there is only one additional workload to verify the accuracy of the original work. However, if the verifier is not the requester of the original work, the trust issue still exists. Because the verifiers themselves may not be honest, and their work needs to be verified. This leads to a potential problem where if the verifier is not the requester of the original work, another verifier is needed to verify their work. But this new verifier may also be distrusted, thus requiring another verifier to verify their work, and so on, forming an infinite replication chain. Here, three key concepts need to be introduced, and they need to be interwoven to construct a participant system with four roles to solve the infinite chain problem.

Probabilistic learning proof: Use metadata based on gradient-based optimization processes to construct certificates for completed work. By replicating certain stages, these certificates can be quickly verified to ensure that the work has been completed as expected.

Graph-based fine-grained localization protocol: Use multi-granularity, graph-based localization protocols, and cross-evaluators for consistent execution. This allows re-execution and comparison of verification work to ensure consistency, ultimately confirmed by the blockchain itself.

Truebit-style incentive game: Use staking and slashing to construct incentive games, ensuring that every economically rational participant acts honestly and performs their expected tasks.

The participant system consists of submitters, solvers, verifiers, and whistleblowers.

Submitters:

Submitters are the end users of the system, providing tasks to be computed and paying for completed work units.

Solvers:

Solvers are the main workers of the system, performing model training and generating proofs to be checked by verifiers.

Verifiers:

Verifiers are crucial in linking the non-deterministic training process with deterministic linear computations. They replicate a portion of the solver’s proof and compare the distance to an expected threshold.

Whistleblowers:

Whistleblowers are the final line of defense. They inspect the work of verifiers and challenge them, hoping to receive hefty reward payouts.

System Operation

The protocol-designed game system operation consists of eight stages, covering four main participant roles, to complete the entire process from task submission to final verification.

Task Submission: A task consists of three specific pieces of information:

Metadata describing the task and hyperparameters.
A model binary file (or base architecture).
Preprocessed training data that is publicly accessible.

To submit a task, the submitter specifies the task’s details in a machine-readable format and submits it to the chain along with the model binary file (or machine-readable architecture) and the publicly accessible location of the preprocessed training data. The public data can be stored in simple object storage like AWS S3 or in a decentralized storage system like IPFS, Arweave, or SubsLianGuaice.

Profiling: The profiling process establishes a baseline distance threshold for verifying proofs. Verifiers periodically fetch profiling tasks and generate variant thresholds for comparing learning proofs. To generate the threshold, verifiers deterministically run and rerun a portion of the training using different random seeds, generating and checking their own proofs. During this process, verifiers establish an overall expected distance threshold for the non-deterministic work that can be used for verifying the solution.

Training: After profiling, the task enters the public task pool (similar to Ethereum’s Mempool). A solver is selected to execute the task and remove it from the task pool. The solver performs the task based on the submitter’s metadata and the provided model and training data. During the training process, the solver also generates learning proofs by periodically checkpointing and storing metadata (including parameters) of the training process to allow the verifiers to replicate the optimization steps as accurately as possible.

Proof Generation: The solver periodically stores the model weights or updates and the corresponding indices of the training dataset to identify the samples used for generating weight updates. Checkpoint frequency can be adjusted to provide stronger guarantees or save storage space. Proofs can be “stacked,” meaning they can start from a random distribution used to initialize the weights or from pre-trained weights generated with their own proofs. This allows the protocol to establish a set of proven, pre-trained base models that can be fine-tuned for more specific tasks.

6. Verification of proof: After the completion of the task, the solver registers the task completion with the blockchain and publicly showcases their proof of learning for verifiers to access. Verifiers extract verification tasks from the public task pool and perform computational work to reproduce a part of the proof and execute distance calculations. The blockchain then uses the obtained distance, along with the threshold computed during the analysis phase, to determine if the verification matches the proof.

7. Graph-based pinpoint challenge: After verifying the proof of learning, a challenger can replicate the verifier’s work to examine if the verification itself was executed correctly. If the challenger believes that the verification was performed incorrectly (maliciously or non-maliciously), they can initiate a challenge with the contract arbitration for a reward. This reward can come from both the solver and verifier’s deposits (in genuinely positive cases) or from the lottery pool bonus (in false positives), and is executed by the blockchain itself. The challenger (in their case, the verifier) only verifies and subsequently challenges the work when they expect to receive appropriate compensation. In fact, this means that the challenger anticipates joining and leaving the network based on the quantity of other challengers (i.e., with real-time deposits and challenges). Therefore, any challenger’s default strategy is to join the network when there are fewer challengers, submit a deposit, randomly select an active task, and initiate their verification process. After completing the first task, they grab another random active task and repeat until the number of challengers exceeds their predetermined payment threshold, then they will leave the network (or more likely, switch to another role within the network based on their hardware capabilities – verifier or solver) until circumstances reverse again.

8. Contract arbitration: When a verifier is challenged by a challenger, they enter a process with the blockchain to identify the disputed operation or input location and ultimately have the blockchain perform the final fundamental operation and determine the validity of the challenge. To keep the challenger honest and overcome the verifier’s dilemma, regular mandatory mistakes and jackpot payments are introduced here.

9. Settlement: In the settlement process, participants are paid based on the conclusions from probabilistic and deterministic checks. Different scenarios result in different payments based on the previous verification and challenges. If the work is deemed to be executed correctly and all checks have passed, the solution provider and verifier are rewarded based on the performed operations.

Project Summary

Gensyn has designed a fascinating game system in the verification layer and incentive layer that can quickly pinpoint errors by identifying divergence points in the network. However, the current system lacks many details. For example, how to set parameters to ensure reasonable rewards and penalties without creating excessively high barriers? Have the game aspects considered extreme situations and the issue of varying computational power among solvers? The current version of the white paper also lacks detailed explanations of heterogeneous parallel execution. At present, the implementation of Gensyn still faces obstacles and challenges.

Together.ai

Together is a company that focuses on open source large-scale models and is committed to decentralized AI computing solutions. They aim to make AI accessible and usable to anyone, anywhere. Strictly speaking, Together is not a blockchain project, but it has preliminary solutions to the latency problem in decentralized AGI computing networks. So, in this article, we will only analyze Together’s solution and not evaluate the project itself.

In a decentralized network that is 100 times slower than a data center, how can we achieve training and inference for large-scale models?

Let’s imagine how the distribution of GPU devices participating in the network would be in a decentralized scenario. These devices would be distributed across different continents, cities, and they would need to be connected, with varying latency and bandwidth between them. As shown in the diagram below, it simulates a distributed scenario where devices are distributed in North America, Europe, and Asia, each with different bandwidth and latency. So, how can we connect them together?

Potential Track Prospect: Decentralized Computing Market (Part 1)

Distributed computing modeling: The diagram below shows the scenario of basic model training on multiple devices, involving three communication types: forward activation, backward gradient, and lateral communication.

Potential Track Prospect: Decentralized Computing Market (Part 1)

Taking into account the communication bandwidth and latency, two forms of parallelism need to be considered: pipeline parallelism and data parallelism, corresponding to the three communication types in a multi-device scenario:

In pipeline parallelism, all layers of the model are divided into multiple stages, with each device processing one stage. The stage is a consecutive sequence of layers, such as multiple Transformer blocks. During forward propagation, activations are passed to the next stage, while during backward propagation, gradients of the activations are passed to the previous stage.

In data parallelism, devices independently compute gradients for different mini-batches but need to synchronize these gradients through communication.

Scheduling Optimization:

In a decentralized environment, the training process is often limited by communication. Scheduling algorithms generally assign tasks that require a significant amount of communication to devices with faster connections. Considering the dependencies between tasks and the heterogeneity of the network, the cost of specific scheduling strategies needs to be modeled. In order to capture the complex communication cost of training basic models, Together proposes a novel formula and decomposes the cost model into two levels using graph theory.

Graph theory is a branch of mathematics that primarily studies the properties and structures of graphs (networks). A graph consists of vertices (nodes) and edges (lines connecting the nodes). The main objective in graph theory is to study various properties of graphs, such as connectivity, coloring, and properties of paths and cycles in the graph.

The first layer is a balanced graph partitioning problem (dividing the set of vertices in a graph into several subsets of equal or approximately equal size, while minimizing the number of edges between the subsets. In this partitioning, each subset represents a partition, and communication costs are reduced by minimizing the edges between the partitions). This corresponds to the communication cost of data parallelism.

The second layer is a joint graph matching and traveling salesman problem (a combinatorial optimization problem that combines elements of graph matching and the traveling salesman problem. The graph matching problem is to find a matching in a graph that minimizes or maximizes a certain cost. The traveling salesman problem is to find the shortest path that visits all nodes in a graph), corresponding to the communication cost of pipeline parallelism.

Prospective Track: Decentralized Computing Power Market (Part 1)

The above figure is a flowchart to illustrate the process, but the actual implementation involves some complex mathematical formulas. To make it easier to understand, the following text will explain the process in a more simplified manner. For detailed implementation, you can refer to the documentation on the Together official website.

Let’s assume there is a set of devices D that contains N devices, and the communication between them has uncertain delays (matrix A) and bandwidth (matrix B). Based on the device set D, we first generate a balanced graph partition. Each partition or group of devices has roughly the same number of devices, and they all handle the same stages of the pipeline. This ensures that when data parallelism is utilized, each group of devices performs a similar amount of work. (Data parallelism refers to multiple devices executing the same task, while pipeline stages refer to devices executing different task steps in a specific order). Based on the communication delays and bandwidth, the “cost” of transferring data between device groups can be calculated using the formula. Each balanced device group is then merged to generate a fully connected coarse graph, where each node represents a stage of the pipeline and edges represent the communication costs between two stages. To minimize communication costs, a matching algorithm is used to determine which device groups should work together.

To further optimize, this problem can be modeled as an open-loop traveling salesman problem (open-loop means there is no need to return to the starting point), in order to find the optimal path for transferring data among all devices. Finally, Together uses an innovative scheduling algorithm to find the best allocation strategy for a given cost model, thereby minimizing communication costs and maximizing training throughput. According to measurements, even if the network is 100 times slower, the end-to-end training throughput is only approximately 1.7 to 2.3 times slower under this optimized scheduling.

Communication Compression Optimization:

Preview of Potential Race Track: Decentralized Computing Power Market (Part 1)

For communication compression optimization, Together has introduced the AQ-SGD algorithm (for the detailed calculation process, please refer to the paper “Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees”). The AQ-SGD algorithm is a novel activation compression technique designed to address the communication efficiency problem of pipeline parallel training on low-speed networks. Unlike previous methods that directly compress activation values, AQ-SGD focuses on compressing the variations in the activation values of the same training sample at different time periods. This unique approach introduces an interesting “self-execution” dynamic, and as the training stabilizes, the algorithm’s performance is expected to gradually improve. The AQ-SGD algorithm has undergone rigorous theoretical analysis, proving that it has good convergence rate under certain technical conditions and bounded error quantization functions. The algorithm not only can be effectively implemented, but also does not increase additional end-to-end runtime overhead, despite the need to use more memory and SSD to store activation values. Through extensive experiments on sequence classification and language modeling datasets, AQ-SGD can compress activation values to 2-4 bits without sacrificing convergence performance. Additionally, AQ-SGD can be integrated with state-of-the-art gradient compression algorithms to achieve “end-to-end communication compression,” where data exchanges between all machines, including model gradients, forward activations, and backward gradients, are compressed to low precision, greatly improving the communication efficiency of distributed training. Compared to end-to-end training performance without compression on a centralized computing power network (such as 10 Gbps), it is currently only 31% slower. Looking at the data combined with scheduling optimization, although there is still some gap compared to the centralized computing power network, there is significant potential for catching up in the future.

Conclusion

In the era of the AI wave, the AGI computing power market is undoubtedly the market with the greatest potential and highest demand among many computing power markets. However, it also has the highest development difficulty, hardware requirements, and funding needs. Based on the situations of the two projects mentioned earlier, the implementation of the AGI computing power market still has a certain distance to go, and the realization of a truly decentralized network is much more complex than the ideal scenario. Currently, it is clearly not sufficient to compete with cloud giants. As I wrote this article, I also observed that some projects that are still in their infancy (PPT stage) and not large in scale have begun to explore some new entry points, such as focusing on the less difficult inference stage or training of small models, which are more practical attempts.

The final form in which the AGI computing power market will be realized is still unknown. Although it faces many challenges, the decentralization and permissionless nature of AGI computing power are important in the long run. The rights of reasoning and training should not be concentrated in a few centralized giants. Because humans do not need a new “religion” or a new “pope,” and certainly should not have to pay expensive “membership fees.”

References:

1. Gensyn LiteLianGuaiper: https://docs.gensyn.ai/liteLianGuaiper/

2. NeurIPS 2022: Overcoming Communication Bottlenecks for Decentralized Training: https://together.ai/blog/neurips-2022-overcoming-communication-bottlenecks-for-decentralized-training-12

3. Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees: https://arxiv.org/abs/2206.01299

4. The Machine Learning Compute Protocol and our future: https://mirror.xyz/gensyn.eth/_K2v2uuFZdNnsHxVL3Bjrs4GORu3COCMJZJi7_MxByo

5. Microsoft: Earnings Release FY23 Q2: https://www.microsoft.com/en-us/Investor/earnings/FY-2023-Q2/performance

6. Contest for AI Tickets: BAT, ByteDance, Meituan competing for GPUs: https://m.huxiu.com/article/1676290.html

7. IDC: 2022-2023 Global Computing Power Index Evaluation Report: https://www.tsinghua.edu.cn/info/1175/105480.htm

8. Guosheng Securities’ Large-scale Model Training Estimation: https://www.fxbaogao.com/detail/3565665

9. Wings of Information: What is the relationship between computing power and AI? https://zhuanlan.zhihu.com/p/627645270

We will continue to update Blocking; if you have any questions or suggestions, please contact us!

AIcomputing powerDecentralizedDeepEthereumNew ProjectReport

Was this article helpful?

93 out of 132 found this helpful