In-depth Analysis | Current Status, Competitive Landscape, and Future Opportunities of the Fusion of AI and Web3 Data Industries

Uncovering the Power of AI and Web3 A Comprehensive Look into Opportunities, Competitors, and Progress in Data Fusion

The emergence of GPT has attracted global attention to large language models, and various industries are attempting to use this “black technology” to improve work efficiency and accelerate industry development. Future3 Campus and Footprint Analytics have jointly conducted in-depth research on the infinite possibilities of combining AI and Web3, and have jointly released the research report “Analysis of the Integration Status, Competitive Landscape, and Future Opportunities of AI and Web3 Data Industry”.

Abstract:

  • The development of LLM technology has made people pay more attention to the combination of AI and Web3, and a new application paradigm is gradually unfolding. In this article, we will focus on how to use AI to improve the user experience and productivity of Web3 data.

  • Due to the early stage of the industry and the characteristics of blockchain technology, the Web3 data industry faces many challenges, including data sources, update frequency, anonymity, etc., making solving these problems using AI a new focus.

  • Compared to traditional artificial intelligence, LLM provides ample room for imagination to improve the experience and productivity of blockchain data, with advantages such as scalability, adaptability, efficiency enhancement, task decomposition, accessibility, and ease of use.

  • LLM requires a large amount of high-quality data for training, and the blockchain field is rich in vertical knowledge and open data, which can provide learning materials for LLM.

  • LLM can also help produce and enhance the value of blockchain data, such as data cleaning, labeling, generating structured data, etc.

  • LLM is not a panacea and should be applied according to specific business needs. It is necessary to utilize the efficiency of LLM while ensuring the accuracy of the results.

  • The combination of AI and Web3 data is promoting the improvement of data processing efficiency and user experience. Currently, the exploration of LLM in the blockchain data industry mainly focuses on improving data processing efficiency through AI technology, using LLM’s interactive advantages to build AI agents, and using AI for pricing and transaction strategy analysis.

  • Currently, the application of AI in the Web3 data field still faces some challenges, such as accuracy, interpretability, commercialization, etc. Completely replacing human involvement still has a long way to go.

  • The core competitive advantage of Web3 data companies lies not only in AI technology itself, but also in data accumulation capability and the ability to deeply analyze and apply data.

  • AI may not be the solution to the commercialization of data products in the short term, and commercialization requires more productization efforts.

1. Development and Combination of AI and Web3

1.1 Development History of AI

The history of artificial intelligence (AI) can be traced back to the 1950s. Since 1956, people have started paying attention to the field of artificial intelligence and gradually developed early expert systems to help solve problems in various professional fields. Subsequently, the rise of machine learning expanded the application of AI, and AI began to be widely used in various industries. Today, with the outbreak of deep learning and generative artificial intelligence, it has brought infinite possibilities to people, and each step is full of continuous challenges and innovations in order to pursue higher levels of intelligence and broader application fields.

Figure 1: AI Development Journey

On November 30, 2022, ChatGPT was introduced, showcasing for the first time the possibility of low-threshold, efficient interaction between AI and humans. ChatGPT sparked a broader discussion on artificial intelligence, redefining the way we interact with AI to make it more efficient, intuitive, and humanistic. This also led to increased attention on more generative artificial intelligence models like Anthropic (Amazon), DeepMind (Google), Llama, and others, entering the public’s view. At the same time, professionals from various industries started actively exploring how AI can drive the development in their respective fields or seeking to stand out in their industries by combining with AI technology, further accelerating the penetration of AI in various domains.

1.2 The Fusion of AI and Web3

Web3’s vision began with reforming the financial system, aiming to empower users and potentially lead the transformation of modern economy and culture. Blockchain technology provides a solid technical foundation to achieve this goal, not only redesigning the mechanisms for value transfer and incentives but also supporting resource allocation and decentralized power distribution.

Figure 2: Web3 Development Journey

As early as 2020, Fourth Revolution Capital (4RC), an investment firm in the blockchain field, pointed out that the combination of blockchain technology and AI would disrupt global industries such as finance, healthcare, e-commerce, and entertainment through decentralization.

Currently, the integration of AI and Web3 focuses mainly on two directions:

  • Utilizing AI to enhance productivity and user experience.

  • Combining the transparent, secure, decentralized storage, traceability, and verifiability of blockchain technology, along with the decentralized production relations of Web3, to address pain points that traditional technology cannot solve or incentivize community participation, thereby improving production efficiency.

In the market, AI and Web3 integration are explored in the following directions:

Figure 3: Panorama of AI and Web3 Integration

  • Data: Blockchain technology can be applied to model data storage, providing encrypted datasets to protect data privacy and record the source and usage of model data, as well as verify its authenticity. By accessing and analyzing data stored on the blockchain, AI can extract valuable information for model training and optimization. At the same time, AI can also serve as a data production tool to improve the production efficiency of Web3 data.

  • Algorithms: The algorithms in Web3 can provide a more secure, trustworthy, and autonomous computing environment for AI, offering encryption protection for model parameters to prevent system abuse or malicious operations. AI can interact with the algorithms in Web3, such as executing tasks and validating data through smart contracts. At the same time, AI algorithms can provide Web3 with more intelligent and efficient decision-making and services.

  • Computing Power: Web3’s decentralized computing resources can provide high-performance computing capabilities for AI. AI can utilize the decentralized computing resources in Web3 for model training, data analysis, and predictions. By distributing computing tasks to multiple nodes on the network, AI can accelerate computation speed and handle larger-scale data.

In this article, we will focus on exploring how to use AI technology to improve the productivity and user experience of Web3 data.

2. Web3 Data Status

2.1 Web2 & Web3 Data Industry Comparison

As the core component of AI, “data” in Web3 is different from what we are familiar with in Web2. The difference lies mainly in the application architecture of Web2 and Web3, which leads to different data characteristics.

2.1.1 Web2 & Web3 Application Architecture Comparison

Figure 4: Web2 & Web3 Application Architecture

In the Web2 architecture, web pages or apps are usually controlled by a single entity (usually a company). The company has absolute control over the content and logic they build. They can decide who can access the content and logic on their servers, as well as the rights users have and the duration that the content exists online. Many cases have shown that internet companies have the right to change the rules on their platforms or even terminate services for users, and users cannot retain the value they have created.

On the other hand, the Web3 architecture relies on the concept of a Universal State Layer, where some or all of the content and logic are placed on a public blockchain. This content and logic are publicly recorded on the blockchain and are accessible to everyone. Users can directly control the on-chain content and logic. In Web2, users need an account or API key to interact with the content on the blockchain. Unlike Web2, Web3 users can interact with the content on the blockchain without authorizing an account or API key (except for specific management operations).

2.1.2 Web2 vs Web3 Data Characteristics

Figure 5: Web2 vs Web3 Data Characteristics

Web2 data is typically closed and highly restricted, with complex permission controls, mature standards for various data formats, and complex business logic abstractions. These data are large in scale but have relatively low interoperability. They are usually stored on central servers and do not focus on privacy protection, most of which are non-anonymous.

In contrast, Web3 data is more open and accessible, although it has lower maturity and mainly consists of unstructured data with rare standardization and simplified business logic abstractions. Web3 data has a relatively smaller scale compared to Web2, but it has higher interoperability (such as EVM compatibility) and can be stored in a decentralized or centralized manner. It also emphasizes user privacy, and users usually interact with it in an anonymous manner.

2.2 Web3 Data Industry Status, Outlook, and Challenges

In the Web2 era, data was as precious as oil reserves, and accessing and obtaining large-scale data was always a major challenge. In Web3, the openness and sharing of data make it seem like “oil is everywhere,” enabling AI models to easily obtain more training data, which is crucial for improving model performance and intelligence. However, there are still many unresolved issues in handling the data of this “new oil” in Web3, including the following challenges:

  • Data Source: On-chain data is “standard” but complex and scattered, requiring a significant amount of manual labor for data processing.

When processing on-chain data, the time-consuming and labor-intensive indexing process needs to be repeatedly executed. Developers and data analysts need to spend a lot of time and resources to adapt to the data differences between different chains and projects. The on-chain data industry lacks unified production and processing standards. In addition to what is recorded on the blockchain ledger, events, logs, and traces are mostly defined and produced (or generated) by individual projects. This makes it difficult for non-professional traders to identify and find the most accurate and reliable data, increasing the difficulties they face in on-chain trading and investment decisions. For example, decentralized exchanges like Uniswap and PancakeSwap may have differences in data processing methods and data accuracy. The checks and standardization processes during these procedures further complicate data processing.

  • Data Updates: On-chain data is large in volume and updated frequently, making it difficult to process into structured data in a timely manner.

Blockchain is constantly changing, with data updates occurring at the level of seconds or even milliseconds. The frequent generation and updates of data make it difficult to maintain high-quality data processing and timely updates. Therefore, an automated processing workflow is crucial, but it also poses a major challenge in terms of cost and efficiency. The Web3 data industry is still in its early stages. With the emergence of new contracts and iterative updates, the lack of data standards and the diversity of formats further complicate data processing.

  • Data Analysis: The anonymous nature of on-chain data makes it difficult to distinguish data identities.

On-chain data usually does not contain sufficient information to clearly identify the identity of each address, making it difficult to link the data with off-chain economic, social, or legal trends. However, the movements of on-chain data are closely related to the real world, and understanding the correlation between on-chain activities and specific individuals or entities in the real world is crucial for specific scenarios such as data analysis.

As discussions on productivity changes triggered by Large Language Models (LLMs) technology arise, whether AI can be used to solve these challenges has become one of the focal points of the Web3 field.

3. The Chemical Reaction Produced by the Collision of AI and Web3 Data

3.1 Comparison of Traditional AI and LLM Features

In terms of model training, traditional AI models are usually smaller in scale, with parameter numbers ranging from tens of thousands to millions, but to ensure the accuracy of output results, a large amount of manually labeled data is required. The reason why LLM is so powerful is partly because it uses massive corpora to fit hundreds of billions or even trillions of parameters, greatly enhancing its ability to understand natural language. However, this also means that more data is needed for training, resulting in high training costs.

In terms of capabilities and operating modes, traditional AI is more suitable for specific domain tasks, providing relatively accurate and specialized answers. In comparison, LLM is more suitable for general tasks but is prone to illusionary problems. This means that in some cases, its answers may not be precise or professional, or even completely wrong. Therefore, if objective, trustworthy, and traceable results are required, it may be necessary to perform multiple checks, multiple trainings, or introduce additional error correction mechanisms and frameworks.

Figure 6: Comparison of Traditional AI and Large Language Model (LLM) Features

3.1.1. Traditional AI in the Web3 Data Field

Traditional AI has shown its importance in the blockchain data industry, bringing more innovation and efficiency to this field. For example, the 0xScope team adopts AI technology to build a graph-based cluster analysis algorithm that accurately identifies related addresses among users through weight distribution of different rules. The application of this deep learning algorithm improves the accuracy of address clusters and provides more accurate tools for data analysis. Nansen uses AI for NFT price prediction, providing insights into trends in the NFT market through data analysis and natural language processing techniques. On the other hand, Trusta Labs uses machine learning methods based on asset graph mining and user behavior sequence analysis to enhance the reliability and stability of its witch detection solution, helping to maintain the security of the blockchain network ecosystem. Additionally, Goplus utilizes traditional AI in its operations to improve the security and efficiency of decentralized applications (dApps). They collect and analyze security information from dApps and provide fast risk alerts, helping to reduce the risk exposure of these platforms. This includes detecting risks in dApp master contracts by evaluating factors such as the open-source status and potential malicious behavior, as well as collecting detailed audit information including audit company credentials, audit time, and audit report links. Footprint Analytics uses AI to generate code that produces structured data, analyzing NFT trading, wash trading, and robot account filtering investigation.

However, traditional AI has limited information and focuses on using pre-defined algorithms and rules to perform preset tasks, while LLM can understand and generate natural language through large-scale natural language data learning, making it more suitable for handling complex and massive text data.

Recently, with significant advancements in LLM, people have started to think and explore the combination of AI and Web3 data.

3.1.2. Advantages of LLM

LLM has the following advantages compared to traditional artificial intelligence:

  • Scalability: LLM supports large-scale data processing

LLM has excellent scalability and can efficiently handle large amounts of data and user interactions. This makes it highly suitable for tasks that require large-scale information processing, such as text analysis or large-scale data cleaning. Its high data processing capability provides strong analytical and application potential for the blockchain data industry.

  • Adaptability: LLM can learn and adapt to multiple domain requirements

LLM has exceptional adaptability, allowing it to be fine-tuned for specific tasks or embedded in industry or private databases, enabling it to quickly learn and adapt to subtle differences in different domains. This feature makes LLM an ideal choice for solving multi-domain and multi-purpose problems, providing broader support for the diversity of blockchain applications.

  • Improved efficiency: LLM automates tasks to enhance efficiency

LLM’s high efficiency brings significant convenience to the blockchain data industry. It automates tasks that would normally require a large amount of manual time and resources, thereby increasing productivity and reducing costs. LLM can generate large amounts of text, analyze massive datasets, or perform various repetitive tasks in seconds, reducing waiting and processing times, making blockchain data processing more efficient.

  • Task decomposition: Generates specific plans for certain tasks, breaking down big tasks into small steps

LLM Agent has the unique capability to generate specific plans for certain tasks, breaking down complex tasks into manageable small steps. This feature is beneficial for handling large-scale blockchain data and performing complex data analysis tasks. By breaking down large tasks into smaller ones, LLM can better manage the data processing flow and produce high-quality analysis.

This capability is crucial for AI systems that execute complex tasks, such as robot automation, project management, natural language understanding and generation, enabling them to transform high-level task objectives into detailed action plans, improving task execution efficiency and accuracy.

  • Accessibility and user-friendliness: LLM provides user-friendly interaction through natural language

LLM’s accessibility allows more users to easily interact with data and systems, making these interactions more user-friendly. Through natural language, LLM makes data and systems more accessible and interactive without users having to learn complex technical terms or specific commands such as SQL, R, Python, etc., for data retrieval and analysis. This feature broadens the audience range for blockchain applications, allowing more people to access and use Web3 applications and services, regardless of their technical proficiency, thereby promoting the development and popularization of the blockchain data industry.

3.2 Integration of LLM and Web3 Data

Figure 7: Integration of blockchain data and LLM

Training large language models requires reliance on large-scale data, building models by learning patterns within the data. The interactions and behavioral patterns inherent in blockchain data fuel LLM’s learning. The quantity and quality of data also directly affect the learning effectiveness of LLM models.

Data is not only a consumable for LLM, but LLM also contributes to data production and can provide feedback. For example, LLM can assist data analysts in data preprocessing, such as data cleaning and labeling, or generating structured data, removing noise from the data, and highlighting useful information.

3.3 Enhancing LLM with Common Technical Solutions

The emergence of ChatGPT not only showcases the general capability of LLM in solving complex problems but also triggers a global exploration of overlaying external capabilities onto the general capability. This includes enhancing general capabilities (such as context length, complex reasoning, mathematics, code, multimodality, etc.) and expanding external capabilities (handling unstructured data, using more sophisticated tools, interacting with the physical world, etc.). Integrating domain-specific knowledge in the crypto field and personalized private data with the general capabilities of large models is a core technological challenge in commercializing large models in the crypto vertical.

Currently, most applications are focused on Retrieval-Augmented Generation (RAG), such as prompt engineering and embedding techniques. Existing proxy tools also mainly focus on improving the efficiency and accuracy of RAG work. The main reference architectures for the LLM-based application stack in the market are as follows:

  • Prompt Engineering

Figure 8: Prompt Engineering

Currently, most practitioners adopt basic solutions, namely Prompt Engineering, when building applications. This approach involves designing specific prompts to modify the input of the model to meet the specific requirements of the application. It is the most convenient and fast practice. However, basic Prompt Engineering has some limitations, such as outdated database updates, content redundancy, and limitations on the support for input context length and multi-turn question answering.

Therefore, the industry is also researching more advanced improvement solutions, including embedding and fine-tuning.

  • Embedding

Embedding is a widely used data representation method in the field of artificial intelligence that efficiently captures the semantic information of objects. By mapping object attributes into vector form, embedding technology can quickly find the most likely correct answer by analyzing the relationships between vectors. Embedding can be built on top of LLM to leverage the rich language knowledge learned by the model on a wide-ranging corpus. By introducing specific task or domain information into the pre-trained large model using embedding technology, the model becomes more specialized and adapted to specific tasks while retaining the generality of the base model.

In simple terms, embedding is like giving a well-rounded college student a specialized reference book with task-specific knowledge to complete a task. They can refer to the reference book at any time to solve specific problems.

  • Fine-tuning

Figure 9: Fine Tuning

Fine-tuning, unlike embedding, adapts a pre-trained language model to a specific task by updating its parameters. This approach allows the model to perform better on specific tasks while maintaining its generality. The core idea of fine-tuning is to adjust model parameters to capture specific patterns and relationships relevant to the target task. However, the general capability of the fine-tuned model is still limited by the base model itself.

In simple terms, fine-tuning is like giving a well-rounded college student specialized courses in their field of study, allowing them to master specialized knowledge beyond their general abilities and solve problems in their specialized area.

  • Re-training LLM

Although the current LLM is powerful, it may not meet all requirements. Retraining LLM is a highly customized solution that involves introducing new datasets and adjusting model weights to make it more suitable for specific tasks, requirements, or domains. However, this approach requires a significant amount of computational resources and data, and managing and maintaining the retrained model is also a challenge.

  • Agent Model

Figure 10: Agent Model

The Agent Model is a method of building intelligent agents, with LLM as the core controller. This system also includes several key components to provide a more comprehensive intelligence.

  • Planning: Breaking down large tasks into smaller tasks to make them easier to complete

  • Memory: Improving future plans by reflecting on past actions

  • Tools: The agent can call external tools to obtain more information, such as search engines, calculators, etc.

The AI agent model has powerful language understanding and generation capabilities, which can solve general problems, perform task decomposition, and self-reflect. This allows it to have wide-ranging potential in various applications. However, the agent model also has some limitations, such as limitations in context length, easy errors in long-term planning and task decomposition, and unreliable output content. These limitations require long-term research and innovation to further expand the application of agent models in different fields.

The various technologies mentioned above are not mutually exclusive. They can be used together in the training and enhancement process of the same model. Developers can fully leverage the potential of existing large language models and try different approaches to meet increasingly complex application requirements. This integrated use not only helps improve the performance of the model but also promotes rapid innovation and progress in Web3 technologies.

However, we believe that although existing LLMs have played an important role in the rapid development of Web3, before fully attempting these existing models (such as OpenAI, Llama 2, and other open-source LLMs), we can start from shallow to deep, begin with prompt engineering and embedding strategies like RAG, and carefully consider fine-tuning and retraining the base models.

3.4 How LLM Accelerates Various Processes of Blockchain Data Production

3.4.1 General Processing Flow of Blockchain Data

In the current blockchain space, builders are gradually realizing the value of data products. This value covers several areas, including product operation monitoring, predictive modeling, recommendation systems, and data-driven applications. Although this awareness is growing, data processing, which is an essential step from data acquisition to data application, is often overlooked.

Figure 12: Blockchain Data Processing Flow

  • Convert the blockchain’s original unstructured data, such as events or logs, into structured data

Every transaction or event on the blockchain generates events or logs, which are usually unstructured data. This step is the first entry point to obtain data, but the data still needs further processing to extract useful information and obtain structured raw data. This includes organizing data, handling exceptional cases, and converting it into a common format.

  • Convert structured raw data into meaningful abstract tables

After obtaining structured raw data, further business abstraction is needed to map the data to business entities and metrics, such as transaction volume, user count, and other business indicators. This transformation turns raw data into meaningful data for business and decision-making purposes.

  • Calculate and extract business metrics from abstract tables

With abstracted business data, further calculations can be performed to derive various important derived metrics. Examples of these metrics include monthly growth rate of total transaction volume, user retention rate, and other core indicators. These indicators can be achieved using tools such as SQL and Python, making it more likely to monitor business health, understand user behavior and trends, and support decision-making and strategic planning.

3.4.2 Optimization of Blockchain Data Generation Process with the Integration of LLM

LLM solves multiple issues in blockchain data processing, including but not limited to:

Processing unstructured data:

  • Extracting structured information from transaction logs and events: LLM can analyze blockchain transaction logs and events to extract key information such as transaction amount, transaction addresses, timestamps, etc., and convert unstructured data into business-meaningful data, making it easier to analyze and understand.

  • Cleansing data and identifying anomalies: LLM can automatically identify and cleanse inconsistent or abnormal data, helping to ensure data accuracy and consistency, thereby improving data quality.

Performing business abstraction:

  • Mapping raw on-chain data to business entities: LLM can map raw blockchain data to business entities, for example, mapping blockchain addresses to actual users or assets, making business processing more intuitive and effective.

  • Processing unstructured on-chain content and labeling: LLM can analyze unstructured data, such as sentiment analysis results from Twitter, and label them as positive, negative, or neutral sentiments, helping users better understand sentiment tendencies on social media.

Interpreting natural language data:

  • Calculating core metrics: Based on business abstraction, LLM can calculate core business metrics such as user transaction volume, asset value, market share, etc., to help users better understand the key performance of their business.

  • Querying data: LLM can understand user intent through AIGC and generate SQL queries, allowing users to submit query requests in natural language without having to write complex SQL queries. This increases the accessibility of database queries.

  • Indicator selection, sorting, and correlation analysis: LLM can assist users in selecting, sorting, and analyzing multiple indicators to better understand their relationships and correlations, thereby supporting deeper data analysis and decision-making.

  • Generating natural language descriptions of business abstractions: LLM can generate natural language summaries or explanations based on factual data, helping users better understand business abstractions and data indicators, increasing interpretability, and making decisions more rational.

3.5 Current Use Cases

Based on LLM’s own technology and product experience advantages, it can be applied to different on-chain data scenarios. These scenarios can be divided into four categories from easy to difficult:

  • Data transformation: performing operations such as data enhancement and reconstruction, such as text summarization, classification, and information extraction. These applications are developed quickly but are more suitable for general scenarios and not ideal for simple batch processing of large amounts of data.

  • Natural language interface: connecting LLM to knowledge bases or tools to automate question answering or basic tool usage. This can be used to build professional chatbots, but its actual value is influenced by factors such as the quality of the connected knowledge base.

  • Workflow automation: using LLM to standardize and automate business processes. This can be applied to more complex blockchain data processing processes, such as deconstructing smart contract execution processes, risk identification, etc.

  • Assisting robots and assistant systems: Assistant systems are enhanced systems that integrate more data sources and functionalities on top of natural language interfaces, greatly improving user work efficiency.

Figure 11: LLM application scenarios

3.6 Limitations of LLM

3.6.1 Industry status: Mature applications, ongoing challenges, and unresolved challenges

In the Web3 data field, despite some important progress, there are still some challenges.

Relatively mature applications:

  • Using LLM for information processing: LLM and other AI technologies have been successfully used for tasks such as text summarization, summarizing, and explanation, helping users extract key information from long articles and professional reports, improving the readability and understandability of data.

  • Using AI to solve development problems: LLM has been applied to solve problems in the development process, such as replacing Stack Overflow or search engines, providing developers with problem-solving and programming support.

Challenges to be addressed and explored:

  • Generating code using LLM: The industry is working to apply LLM technology to the conversion from natural language to SQL query language, in order to improve the automation and understandability of database queries. However, there are many difficulties in this process. For example, in certain contexts, the generated code requires high accuracy, syntax must be 100% correct to ensure bug-free execution of the program and obtain correct results. Challenges also include ensuring successful and accurate answers to questions and deep understanding of the business.

  • Data labeling: Data labeling is crucial for training machine learning and deep learning models, but in the Web3 data field, especially when dealing with anonymous blockchain data, the complexity of labeling data is high.

  • Accuracy and hallucination issues: The occurrence of hallucinations in AI models can be influenced by multiple factors, including biased or insufficient training data, overfitting, limited contextual understanding, lack of domain knowledge, adversarial attacks, and model architecture. Researchers and developers need to continuously improve the training and calibration methods of models to improve the credibility and accuracy of generated text.

  • Using data for business analysis and content generation: Using data for business analysis and content generation is still a challenging problem. The complexity of the problem, the need for well-designed prompts, high-quality data, data quantity, and methods to reduce hallucination issues are all unresolved issues.

  • Automatically indexing smart contract data for data abstraction based on the business domain: Automatically indexing smart contract data for data abstraction in different business domains is still an unsolved problem. This requires comprehensive consideration of the characteristics of different business domains, as well as the diversity and complexity of data.

  • Handling more complex modalities such as time series data and tabular document data: Models like DALL·E 2 excel at generating images, speech, and other common modalities from text. However, in the blockchain and financial domains, special treatment is needed for some time series data, and simply vectorizing text is not enough. Joint training with time series data, cross-modal joint training, and other techniques are important research directions for achieving intelligent analysis and applications.

3.6.2 Why relying solely on LLM cannot solve the problems of the blockchain data industry

As a language model, LLM is more suitable for handling scenarios that require fluency, but when it comes to accuracy, further adjustments may be needed. When applying LLM to the blockchain data industry, the following framework can provide some guidance.

Figure 13: Fluency, accuracy, and use case risk of LLM output in the blockchain data industry

When assessing the suitability of LLM in different applications, focusing on fluency and accuracy is crucial. Fluency refers to whether the model’s output is natural and coherent, while accuracy indicates whether the model’s answers are correct. These two dimensions have different requirements in different application scenarios.

For tasks that require high fluency, such as natural language generation and creative writing, LLM typically performs well because of its powerful natural language processing capabilities, which enable it to generate smooth text.

Blockchain data faces challenges in data parsing, data processing, and data application, among others. LLM possesses excellent language comprehension and reasoning abilities, making it an ideal tool for interacting with, organizing, and summarizing blockchain data. However, LLM cannot solve all problems in the field of blockchain data.

In terms of data processing, LLM is more suitable for rapid iteration and exploratory processing of on-chain data, constantly trying out new processing methods. However, LLM still has certain limitations in tasks such as detailed cross-checking in production environments. Typical issues include insufficient token length to handle long contexts, time-consuming prompts, unstable answers affecting downstream tasks, and inefficiency in executing large-scale tasks.

Furthermore, LLM is prone to hallucination issues during the content processing. It is estimated that ChatGPT has a hallucination probability of about 15% to 20%, and many errors are difficult to detect due to the opacity of its processing. Therefore, the establishment of frameworks and the combination of expert knowledge become crucial. In addition, there are many challenges in combining LLM with on-chain data:

  • On-chain data comprises various entity types and a large quantity. Feeding it to LLM in an effective manner for specific commercial scenarios, similar to other vertical industries, requires more research and exploration.

  • On-chain data includes structured and unstructured data. Currently, most data solutions in the industry are based on understanding business data. In the process of parsing on-chain data, using ETL to filter, clean, supplement, and restore business logic can further organize unstructured data into structured data, providing more efficient analysis for various business scenarios in the later stages. For example, structured DEX trades, NFT marketplace transactions, wallet address portfolios, etc., possess the aforementioned characteristics of high quality, high value, accuracy, and authenticity, and can provide efficient supplements to general-purpose LLM.

4. Misunderstood LLM

Can LLM directly handle unstructured data, rendering structured data unnecessary?

LLM is usually pre-trained on massive amounts of text data, making it naturally suitable for processing various types of unstructured textual data. However, different industries already possess a large amount of structured data, especially in the Web3 field where data has been parsed. How to effectively utilize this data and enhance LLM is a hot research topic in the industry.

For LLM, structured data still possesses the following advantages:

  • Massive: A large amount of data is stored in databases and other standard formats behind various applications, especially private data. Each company and industry still possess a large amount of data behind closed doors that LLM has not been trained on.

  • Already existing: This data does not need to be reproduced and has a very low entry cost. The only problem is how to make use of it.

  • High quality and high value: Accumulated expertise within the field is often embedded in structured data and used for industry-academia research. The quality of structured data is key to data usability, including completeness, consistency, accuracy, uniqueness, and factualness.

  • Efficiency: Structured data is stored in tables, databases, or other standardized formats. The schema is predefined and remains consistent throughout the dataset. This means that data format, type, and relationships are predictable and controllable, making data analysis and querying simpler and more reliable. Moreover, the industry already has mature ETL and various data processing and management tools, which are more efficient and convenient for use. LLM can utilize this data through APIs.

  • Accuracy and factualness: LLM’s textual data, based on token probabilities, currently cannot output exact answers stably, and the illusion problem has always been a core fundamental problem for LLM to solve. For many industries and scenarios, this can lead to security and reliability issues, such as in healthcare and finance. Structured data is capable of assisting and correcting these problems for LLM.

  • Representation of relationship graphs and specific business logic: Different types of structured data can be input into LLM in specific organizational forms (such as relational databases, graph databases, etc.) to address different domain problems. Structured data uses standardized query languages (such as SQL), making complex queries and analysis more efficient and accurate. Knowledge graphs can better express the relationships between entities and make association queries easier.

  • Low usage cost: LLM does not need to retrain the entire underlying model from scratch every time. By combining agents and LLM APIs, LLM can be accessed faster and at a lower cost.

Currently, there are still some imaginative viewpoints in the market that believe LLM has extremely strong abilities in processing textual and unstructured information. They claim that by simply importing raw data, including unstructured data, into LLM, the desired results can be achieved. This idea is similar to expecting a general LLM to solve math problems, where most LLMs may make errors when dealing with simple elementary arithmetic problems without having a specifically built mathematical model. Instead, establishing vertical models similar to mathematical abilities and image generation models within the Crypto LLM context would be the practical solution for LLM to better apply in the Crypto field.

4.2 Can LLM infer content from text information like news and Twitter, eliminating the need for on-chain data analysis to draw conclusions?

While LLM can gather information from sources such as news and social media, insights directly obtained from on-chain data are still invaluable. The main reasons are:

  • On-chain data is the raw, firsthand information, whereas news and social media can be biased or misleading. Analyzing on-chain data directly can reduce information bias. Although there is a risk of understanding bias when using LLM for text analysis, directly analyzing on-chain data can minimize misinterpretation.

  • On-chain data contains comprehensive historical interactions and transaction records, allowing for the discovery of long-term trends and patterns. On-chain data can also provide a holistic view of the entire ecosystem, such as fund flows and relationships between parties. These macro insights contribute to a deeper understanding of the situation. News and social media information, on the other hand, are often fragmented and short-term.

  • On-chain data is open. Anyone can verify the analysis results, avoiding information asymmetry. News and social media may not always disclose information accurately. Textual information and on-chain data can verify each other. Integrating the two can lead to more comprehensive and accurate judgments.

On-chain data analysis remains indispensable. LLM’s role is to assist in extracting information from text, but it cannot replace direct analysis of on-chain data. Utilizing the strengths of both is key to achieving the best results.

4.3 Is it easy to build blockchain data solutions on top of LLM using tools like LangChain, LlamaIndex, or other AI tools?

Tools like LangChain and LlamaIndex provide convenience for building custom LLM applications, making rapid development possible. However, successfully applying these tools in real production environments presents more challenges. Building an efficient and high-quality LLM application is a complex task that requires a deep understanding of blockchain technology and the workings of AI tools, as well as effectively integrating them. This is an important but challenging task for the blockchain data industry.

In this process, it is crucial to recognize the characteristics of blockchain data, which demand high precision and verifiability. Once data is processed and analyzed through LLM, users have high expectations for its accuracy and trustworthiness. This potentially conflicts with LLM’s fuzzy fault-tolerant nature. Therefore, when constructing blockchain data solutions, it is essential to carefully weigh these two demands to meet user expectations.

In the current market, although some foundational tools already exist, this field is still rapidly evolving and iterating. Similar to the development process in the Web2 world, which evolved from early scripting languages like PHP to more mature and scalable solutions like Java, Ruby, Python, as well as JavaScript and Node.js, and newer technologies like Go and Rust, the AI tool landscape is also continuously changing. Emerging GPT frameworks like AutoGPT, Microsoft AutoGen, and OpenAI’s recent release of ChatGPT 4.0 Turbo are just showcasing a fraction of the future possibilities. This indicates that there is still ample room for development in the blockchain data industry and AI technology, requiring continuous effort and innovation.

When using LLM, there are two traps that need to be particularly aware of:

  • High expectations: Many people believe that LLM can solve all problems, but in reality, LLM has clear limitations. It requires a lot of computational resources, has a high training cost, and the training process may be unstable. The capabilities of LLM should have realistic expectations, understanding that it performs well in certain scenarios, such as natural language processing and text generation, but may not be competent in other fields.

  • Ignoring business requirements: Another trap is forcefully applying LLM technology without fully considering business needs. Before applying LLM, it is essential to clarify specific business requirements. The suitability of LLM as the best technology choice needs to be evaluated, and risk assessment and control should be done. The effective application of LLM emphasizes careful consideration based on the actual situation to avoid misuse.

Although LLM has great potential in many fields, developers and researchers need to exercise caution and maintain an open exploratory attitude when using LLM in order to find more suitable application scenarios and maximize its advantages.

5. Current Status and Development Roadmap of Web3 Data Industry Combined with AI

5.1 Dune

Dune is a leading open data analytics community in the current Web3 industry. It provides tools for blockchain querying, extraction, and visualization of a large amount of data. Users and data analysts can use simple SQL queries to query on-chain data from the pre-populated Dune database and create corresponding charts and insights.

In March 2023, Dune proposed plans regarding AI and future integration with LLM, and released its Dune AI product in October. The core focus of Dune AI-related products is to utilize LLM’s powerful language and analytical capabilities to enhance the Wizard UX and provide better data querying and SQL writing on Dune.

(1) Query interpretation: In the product released in March, users can obtain a natural language explanation of SQL queries by clicking a button. This aims to help users better understand complex SQL queries, thereby improving the efficiency and accuracy of data analysis.

(2) Query translation: Dune plans to unify different SQL query engines on Dune (such as Postgres and Slianguairk SQL) to DuneSQL, enabling LLM to provide automated query language translation capabilities to assist users in a smooth transition that benefits the promotion of the DuneSQL product.

(3) Natural language querying: Dune AI, released in October, allows users to ask questions and retrieve data using plain English. The goal of this feature is to enable users without SQL knowledge to easily access and analyze data.

(4) Search optimization: Dune plans to improve search functionality using LLM, helping users filter information more effectively.

(5) Wizard knowledge base: Dune plans to release a chatbot that helps users quickly navigate blockchain and SQL knowledge in the Spellbook and Dune documentation.

(6) Simplifying SQL writing (Dune Wand): In August, Dune introduced the Wand series of SQL tools. Create Wand allows users to generate complete queries from natural language prompts, Edit Wand allows users to modify existing queries, and the Debug feature automatically debugs syntax errors in queries. The core of these tools is LLM technology, which simplifies the query writing process, allowing analysts to focus on the core logic of analyzing data without worrying about code and syntax issues.

5.2 Footprint Analytics

Footprint Analytics is a blockchain data solutions provider that utilizes artificial intelligence technology to offer a codeless data analysis platform, a unified data API product, and the Footprint Growth Analytics BI platform for Web3 projects.

One of Footprint’s strengths lies in its development of an on-chain data production line and ecological tools. By establishing a unified data lake to bridge on-chain and off-chain data, as well as meta-databases for business registration similar to on-chain records, Footprint ensures the accessibility, user-friendliness, and quality of data for analysis and usage. Footprint’s long-term strategy is to focus on technological depth and platform construction in order to create a “machine factory” capable of producing on-chain data and applications.

The combination of Footprint’s products with AI is as follows:

Since the release of the LLM model, Footprint has been exploring the integration of existing data products with AI to improve the efficiency of data processing and analysis, and to create more user-friendly products. In May 2023, Footprint began offering data analysis capabilities with natural language interaction to its users. It upgraded its existing codeless platform to a more advanced product feature, allowing users to quickly obtain data and generate charts through conversation without the need for familiarity with platform tables and design.

Currently, most data products in the market that combine LLM and Web3 technology primarily focus on reducing user barriers and changing interaction paradigms. However, Footprint’s key focus in product and AI development is not only to help users improve their data analysis experience but also to accumulate specialized data and business understanding in the crypto field. This includes training language models specific to the crypto domain to enhance the efficiency and accuracy of vertical application scenarios. Footprint’s advantages in this area will be reflected in the following aspects:

  • Data knowledge (the quality and quantity of knowledge repositories). The efficiency of data accumulation, sources, volume, and categories. Especially in Footprint’s MetaMosaic sub-product, the accumulation of relationship graphs and static data related to specific business logics.

  • Knowledge architecture. Footprint has accumulated structured data tables that abstract more than 30 public chains according to business sectors. The knowledge of the data production process from raw data to structured data can enhance the understanding of raw data, thereby improving model training.

  • Data types. There is a noticeable gap in training starting from unstructured raw data on the blockchain to structured data and meaningful business indicators, both in terms of training efficiency and machine costs. A typical example is the need for more data to provide LLM, which requires professional data from the crypto field as well as more readable and structured data. Additionally, larger user volumes serve as feedback data.

  • Crypto fund flow data. Footprint has abstracted fund flow data closely related to investments. It includes information such as the time, subject (including flow direction), token type, amount (with related token price at that time), business types, as well as labels for tokens and subjects. This data serves as the knowledge repository and data source for LLM, enabling analysis of token funds, chip distribution, monitoring of fund flows, identification of on-chain anomalies, and tracking of savvy capital.

  • Injection of private data. Footprint divides its model into three major layers: a large base model with world knowledge (such as OpenAI and other open-source models), domain-specific vertical models, and personalized expert knowledge models. Users can unify their different sources of knowledge repositories on Footprint and utilize private data to train private LLM models, suitable for more personalized application scenarios.

In the exploration of Footprint combined with LLM model, we have also encountered a series of challenges and problems, among which the most typical ones are insufficient tokens, time-consuming prompt prompts, and unstable answers. In the vertical field of on-chain data where Footprint is located, the greater challenge is the multiple types of on-chain data entities, large quantities, fast changes, and the need for more research and exploration in the industry on how to feed them to LLM in what form. The current toolchain is still relatively early-stage and needs more tools to solve specific problems.

In the future, the integration of Footprint with AI in terms of technology and products includes the following:

(1) In terms of technology, Footprint will combine with LLM model to explore and optimize in three aspects:

  • Support LLM in reasoning on structured data, so that the structured data and knowledge that has been precipitated in the crypto field can be applied to LLM’s data consumption and production.

  • Help users establish personalized knowledge bases (including knowledge, data, and experience), and use private data to enhance the ability of already optimized crypto LLM, allowing everyone to build their own models.

  • Assist in AI analysis and content production. Users can use dialogue to create their own GPT, combining fund flow data and private knowledge bases to generate and share crypto investment content.

(2) In terms of products, Footprint will focus on exploring innovations in AI product applications and business models. According to Footprint’s recent promotion plan for products, it will launch an AI crypto content generation and sharing platform for users.

In addition, in the exploration of future partners, Footprint will focus on two aspects:

First, strengthen cooperation with KOLs to facilitate the production of valuable content, community operation, and knowledge monetization.

Second, expand more cooperative project parties and data providers, create an open and win-win user incentive and data cooperation, and establish a mutually beneficial one-stop data service platform.

5.3 GoPlus SecurityGoplus

GoPlus Security is currently the leading user security infrastructure in the Web3 industry, providing various security services for users. It has been integrated into mainstream digital wallets, market websites, Dex, and other Web3 applications on the market. Users can directly use various security protection functions such as asset security detection, transfer authorization, and anti-phishing. The user security solutions provided by GoPlus can comprehensively cover the entire lifecycle of user security, protecting user assets from various types of attackers.

The development and planning of GoPlus with AI are as follows:

GoPlus mainly explores AI technology in its two products, AI automated detection and AI security assistant:

(1) AI Automated Detection

Starting in 2022, GoPlus has developed an AI technology-based automated detection engine to improve the efficiency and accuracy of security detection comprehensively. GoPlus’s security engine adopts a multi-level, funnel-shaped analysis method, including static code detection, dynamic detection, and feature or behavior detection. This composite detection process enables the engine to effectively identify and analyze the characteristics of potential risk samples, thereby modeling attack types and behaviors effectively. These models are critical for the engine to identify and prevent security threats, helping the engine determine whether risk samples have specific attack characteristics. In addition, GoPlus’s security engine, after long-term iteration and optimization, has accumulated abundant security data and experience. Its architecture can respond quickly and effectively to emerging security threats, ensuring timely detection and prevention of various complex and novel attacks, providing all-round protection for user security. Currently, AI-related algorithms and technologies are used in multiple security scenarios, such as risk contract detection, phishing website detection, malicious address detection, and risk transaction detection. The use of AI technology can accelerate risk mitigation, improve detection efficiency, reduce detection costs; on the other hand, it reduces the complexity and time costs of manual involvement, improves the accuracy of judging risk samples, especially for new scenarios that are difficult for manual delineation or engine recognition, AI can better collect features and form more effective analysis methods.

In 2023, with the development of large models, GoPlus quickly adapted and adopted LLM. Compared to traditional AI algorithms, LLM has significantly improved efficiency and effectiveness in data recognition, processing, and analysis. The emergence of LLM has helped GoPlus accelerate its technical exploration in AI automated detection. In the direction of dynamic fuzz testing, GoPlus uses LLM technology to effectively generate transaction sequences and explore deeper states to discover contract risks.

(2) AI Security Assistant

GoPlus is also leveraging LLM-based natural language processing capabilities to develop an AI security assistant, providing real-time security consultancy and improving user experience. The AI assistant, based on the GPT large model, has developed a proprietary user security agent that analyzes, generates solutions, breaks down tasks, and executes based on the input of frontend business data, providing the necessary security services for users. The AI assistant simplifies communication between users and security issues, reducing the threshold for user understanding.

In terms of product features, due to the importance of AI in the security field, in the future, AI has the potential to completely change the structure of existing security engines or antivirus engines and introduce a brand-new engine architecture centered around AI. GoPlus will continue to train and optimize AI models, aiming to transform AI from an auxiliary tool to the core function of its security detection engine.

In terms of business models, although GoPlus’s services are currently mainly targeting developers and project owners, the company is exploring more products and services directly targeting C-end users, as well as new revenue models related to AI. Providing efficient, accurate, and low-cost C-end services will be the core competitive advantage for GoPlus in the future. This requires continuous research by the company to train and output more on AI large models that interact with users. At the same time, GoPlus will also collaborate with other teams, share their security data, and promote AI applications in the security field to prepare for possible industry changes in the future.

5.4 Trusta Labs

Trusta Labs, established in 2022, is an AI-driven data startup in the Web3 space. Trusta Labs focuses on utilizing advanced AI technology for efficient processing and precise analysis of blockchain data, to build the on-chain reputation and security infrastructure for blockchain. Currently, Trusta Labs’ business mainly includes two products: TrustScan and TrustGo.

(1) TrustScan is a product designed for B-end clients, mainly used to help Web3 projects analyze and refine on-chain user behavior in terms of user acquisition, user activity, and user retention, to identify high-value and genuine users.

(2) TrustGo is a product targeting C-end clients. Its provided media analysis tools analyze and evaluate on-chain addresses from five dimensions: funding amount, activity, diversity, identity rights, and loyalty. This product emphasizes in-depth analysis of on-chain data to enhance the quality and security of transaction decisions.

Trusta Labs development and planning with AI are as follows:

Currently, Trusta Labs has two products that use AI models to process and analyze interaction data from on-chain addresses. The behavioral data of on-chain address interactions belongs to sequential data, which is very suitable for training AI models. In the process of cleaning, organizing, and labeling on-chain data, Trusta Labs delegates a large amount of work to AI, greatly improving the quality and efficiency of data processing, while also reducing a significant amount of labor costs. Trusta Labs uses AI technology to conduct in-depth analysis and mining of on-chain address interaction data. For B-end clients, it can effectively identify addresses with a higher likelihood of witchcraft. In multiple projects that have used Trusta Labs’ products, they have successfully prevented potential witch attacks. As for C-end clients, through the use of the TrustGo product and existing AI models, it effectively helps users gain a deep understanding of their on-chain behavioral data.

Trusta Labs has been closely following the technological progress and practical application of LLM models. With the continuous reduction of model training and inference costs, as well as the accumulation of a large amount of corpus and user behavioral data in the Web3 field, Trusta Labs will find the right timing to introduce LLM technology and use the productivity of AI to provide more in-depth data mining and analysis capabilities for products and users. With the rich data already provided by Trusta Labs, we hope to use intelligent analysis models based on AI to provide more reasonable and objective interpretations of the data results, such as providing qualitative and quantitative analysis of captured witch accounts for B-end users, allowing users to better understand the reasons behind the data analysis and provide more substantial evidence when explaining complaints to their clients.

On the other hand, Trusta Labs also plans to utilize already open-source or relatively mature LLM models, and combine them with intent-based design concepts to build AI Agents, which will help users solve on-chain interaction problems more quickly and efficiently. In terms of specific application scenarios, in the future, users can directly communicate with the AI Assistant through natural language using the AI Assistant trained based on LLM provided by Trusta Labs. The AI Assistant can “intelligently” provide feedback related to on-chain data and provide suggestions and planning for subsequent operations based on the provided information, truly realizing user-centric one-stop intelligent operations, greatly reducing the threshold for users to use data and simplifying the execution of on-chain operations.

In addition, Trusta believes that in the future, as more and more AI-based data products emerge, the core competitive factor of each product may not lie in the use of which LLM model, but in the deeper understanding and interpretation of the already acquired data. Based on the analysis of the acquired data, combined with LLM models, can more “intelligent” AI models be trained.

5.5 0xScope

0xScope, established in 2022, is a data-oriented innovation platform that focuses on the combination of blockchain technology and artificial intelligence. 0xScope aims to change the way people handle, use, and perceive data. Currently, 0xScope has launched 0xScope SaaS products and 0xScopescan for B-end and C-end clients respectively.

(1) 0xScope SaaS products, a SaaS solution for enterprises, empower enterprise customers to manage post-investment, make better investment decisions, understand user behavior, and closely monitor competitive dynamics.

(2) 0xScopescan, a B2C product, allows cryptocurrency traders to investigate the flow and activities of selected blockchains.

The focus of 0xScope’s business is to abstract general data models from on-chain data, simplify on-chain data analysis work, and transform on-chain data into understandable on-chain operational data, thereby helping users to conduct in-depth analysis of on-chain data. By using the data tools platform provided by 0xScope, not only can the quality of on-chain data be improved, hidden information can be unearthed and revealed to users, but the platform also greatly reduces the threshold for data mining.

0xScope’s development and planning with AI are as follows:

0xScope’s products are being upgraded by integrating large-scale models, which includes two directions: first, further reducing the user’s threshold of use through natural language interaction; second, improving processing efficiency in data cleaning, parsing, modeling, and analysis through AI models. At the same time, 0xScope’s products will soon launch an AI interactive module with Chat functionality, which will greatly reduce the threshold for users to query and analyze data, allowing them to interact and query underlying data using natural language only.

However, 0xScope has encountered the following challenges in the process of training and using AI: first, the cost and time required for AI training are high. After a question is asked, AI takes a long time to reply. Therefore, this challenge forces the team to streamline and focus on business processes, focusing on vertical Q&A rather than making it an all-in-one super AI assistant. Second, the output of the LLM model is uncontrollable. Data products hope to provide accurate results, but currently the results provided by the LLM model may have certain deviations from the actual situation, which is very detrimental to the experience of data products. Additionally, the output of large-scale models may involve user’s private data. Therefore, when using the LLM mode in the product, the team needs to have a high degree of restrictions to make the results output by the AI model controllable and accurate.

In the future, 0xScope plans to use AI to focus on specific vertical tracks and conduct deep cultivation. Currently, based on the accumulation of a large amount of on-chain data, 0xScope can define the identity of on-chain users and continue to use AI tools to abstract on-chain user behavior, thus creating a unique data modeling system and revealing the hidden information in on-chain data.

In terms of cooperation, 0xScope will focus on two types of groups: the first group includes developers, project parties, VCs, exchanges, and other entities that can directly benefit from the product and require the data provided by the current product; the second group includes partners who have a demand for AI Chat, such as Debank, Chainbase, etc. They only need relevant knowledge and data and can directly use AI Chat.

6.VC insight——The Commercialization and Future Development of AI+Web3 Data Companies

This section explores the current status and development of the AI+Web3 data industry from the perspectives of four experienced VC investors. It discusses the core competitiveness of Web3 data companies and their future commercialization pathways.

6.1 Current Status and Development of the AI+Web3 Data Industry

Currently, the combination of AI and Web3 data is in an exploratory stage. From the development directions of leading Web3 data companies, it is essential to combine AI technology and LLM. However, LLM has its own technological limitations and cannot solve many problems in the current data industry.

Therefore, we need to recognize that blindly combining AI does not necessarily enhance project advantages or use the concept of AI for hype. Instead, we need to explore practical and promising application areas. From the VC perspective, there have been explorations in the combination of AI and Web3 data in the following aspects:

1) Enhancing the capabilities of Web3 data products through AI technology, including the use of AI technology to improve the internal data processing and analysis efficiency of enterprises, as well as automated analysis, retrieval, and other capabilities for user data products. For example, Yuxing from SevenX Ventures mentioned that the main help of AI technology in Web3 data is efficiency improvement. For instance, Dune uses LLM models for code anomaly detection and natural language to SQL conversion for information indexing. There are also projects that use AI for security alerts, where AI algorithms are more effective than pure mathematical statistics for anomaly detection, enabling more effective monitoring in the security aspect. Additionally, Zi from Wei Capital mentioned that companies can save a lot of labor costs by training AI models for data pre-labeling. Nevertheless, VCs believe that AI plays a supportive role in enhancing the capabilities and efficiency of Web3 data products, such as data pre-labeling, and manual verification may still be required to ensure accuracy.

2) Building AI Agents/Bots using LLM’s advantages in adaptability and interaction. For example, using large language models to retrieve data from the entire Web3, including on-chain data and off-chain news data, for information aggregation and sentiment analysis. Harper from Hashkey Capital believes that such AI Agents are more inclined towards information integration, generation, and interaction with users, and they may be relatively weaker in terms of information accuracy and efficiency.

Although there have been numerous cases in these two application areas, the technology and products are still in the early exploration stage. Therefore, continuous technological optimization and product improvement are needed in the future.

3) Using AI for pricing and trading strategy analysis: Currently, there are projects in the market that utilize AI technology for price estimation of NFTs, such as NFTGo invested by Aurora Ventures, and some professional trading teams use AI for data analysis and trade execution. Additionally, Ocean Protocol recently launched an AI product for price prediction. These types of products seem imaginative, but they still need validation in terms of product implementation and user acceptance, especially in terms of accuracy.

On the other hand, many VCs, especially those who have invested in Web2, will pay more attention to the advantages and applications of Web3 and blockchain technology for AI. Blockchain technology has the characteristics of being publicly verifiable and decentralized, and cryptographic technology provides privacy protection capabilities. In addition, Web3 is reshaping production relations, which may bring some new opportunities for AI:

(1) AI data rights and verification. The emergence of AI has made content generation abundant and cheap. Tang Yi from Sequoia Capital mentioned that it is difficult to determine the quality and author of digital works and other content. In this regard, a new system for data content rights is needed, and blockchain may be able to provide assistance. Zi Xi from GGV Capital mentioned that there are data exchanges that trade data in NFTs, which can solve the problem of data rights.

In addition, Yuxing from SevenX Ventures mentioned that Web3 data can improve AI forgery and black box issues. Currently, AI models have black box problems in both the model algorithms and the data, which can result in biased output. However, Web3 data is transparent and publicly verifiable, making the sources and results of AI models clearer and making AI more fair, reducing bias and errors. However, the current amount of Web3 data is not enough to empower AI training itself, so it will not be realized in the short term. But we can use this feature to put Web2 data on the chain to prevent deepfakes of AI.

(2) AI data annotation crowdsourcing and UGC communities: Currently, traditional AI annotation faces the problems of low efficiency and quality, especially in specialized knowledge areas that may require interdisciplinary knowledge. Traditional general data annotation companies cannot cover these areas, often requiring internal professional teams. However, blockchain and Web3 concepts can help improve this problem by introducing data annotation crowdsourcing services. For example, Questlab, invested by GGV Capital, uses blockchain technology to provide data annotation crowdsourcing services. Additionally, blockchain concepts can be used to solve the economic problems of model creators in some open-source model communities.

(3) Data privacy deployment: Blockchain technology combined with cryptographic technology can ensure data privacy and decentralization. Zi Xi from GGV Capital mentioned a synthetic data company they invested in that uses large models to generate synthetic data for software testing, data analysis, and AI model training. The company faces many privacy deployment issues when handling data and uses Oasis blockchain to effectively avoid privacy and regulatory problems.

6.2 AI+Web3: How to Build Core Competitiveness for Data Companies

For Web3 technology companies, the introduction of AI can increase the attractiveness or attention to projects to a certain extent, but currently, most Web3 technology companies’ AI-related products are not sufficient to become the core competitiveness of the company. They mainly provide a friendlier user experience and improved efficiency. For example, the threshold for AI Agents is not high, and early companies may have a first-mover advantage in the market, but it does not create barriers.

The core competitive advantage and barriers for teams in the Web3 data industry should be their data capabilities and how they apply AI technology to solve specific analytical scenarios.

Firstly, the team’s data capabilities include the ability to analyze and adjust models based on data sources, which is the foundation for subsequent work. In interviews, SevenX Ventures, Matrix Partners, and Hashkey Capital all mentioned that the core competitive advantage of AI+Web3 data companies depends on the quality of the data sources. In addition, engineers need to be proficient in model fine-tuning, data processing, and parsing based on the data sources.

On the other hand, the specific combination of team AI technology and scenarios is also very important, and the scenarios should be valuable. Harper believes that although the combination of Web3 data companies and AI is currently mainly starting with AI Agents, their positioning is different. For example, SLianGuaice and Time invested by Hashkey Capital and chainML have launched an infrastructure for creating AI agents, and the created DeFi agent is used by SLianGuaice and Time.

6.3 Future Commercialization of Web3 Data Companies

Another important topic for Web3 data companies is commercialization. For a long time, the profit models of data analytics companies have been relatively single, mostly ToC free, and mainly ToB profit-oriented, which heavily relies on the willingness of B-side clients to pay. In the field of Web3, the willingness of enterprises to pay is generally low, coupled with the dominance of startup companies in the industry, it is difficult for project parties to support long-term payment. Therefore, current Web3 data companies face challenges in terms of commercialization.

On this issue, VCs generally believe that the current combination of AI technology is only applied to solving internal production process issues and has not fundamentally changed the difficulty of monetization. Some new product forms, such as AI Bots, may increase users’ willingness to pay to a certain extent in the ToC field, but it is still not strong enough. AI may not be the solution to the commercialization problem of data products in the short term, and more productization efforts are needed, such as finding more suitable scenarios and innovative business models.

In the future integration of Web3 and AI, combining Web3’s economic model with AI data may generate new business models, mainly in the ToC field. Subin from Matrix Partners mentioned that AI products can be combined with token mechanics to increase the stickiness, daily activity, and emotional connection of the entire community. This is feasible and easier to monetize. Tang Yi from Qiming Venture Partners mentioned that from an ideological perspective, Web3’s value system can be combined with AI and is suitable as a bot’s account system or value conversion system. For example, a robot has its own account and can make money through its intelligent part, as well as pay for maintaining its underlying computing power, and so on. But this concept belongs to future imagination, and practical applications may still have a long way to go.

In the original business model, that is, direct user payment, there needs to be a strong product capability to make users have a stronger willingness to pay. For example, higher quality data sources and benefits brought by the data outweigh the cost of payment. This is not only about the application of AI technology but also about the capabilities of the data team itself.

Sure! The HTML code “

” represents an empty paragraph element.

We will continue to update Blocking; if you have any questions or suggestions, please contact us!

Share:

Was this article helpful?

93 out of 132 found this helpful

Discover more

Blockchain

Opinion: How far is the blockchain deposit really “landing”?

Recently, Tao Kaiyuan, the vice president of the Supreme People's Court, published an article in the "Guang...

Blockchain

Panorama of 1.49 million blockchain patents: BATJP accounts for 26%, and the game category welcomes a breakthrough

Analyst | Carol Editor | Bi Tongtong Produced | PANews On April 3, the central bank held a 2020 national telecom conf...

Blockchain

The privacy controversy behind "ZAO": The blockchain world will do better?

On August 30th, an AI face-changing app with the slogan "Only one photo, starred in the world" swept the ci...

Blockchain

Experts: China's blockchain industry has broad prospects for development

Source: Xinhuanet original title: "Experts: China's blockchain industry has broad prospects for development...

Blockchain

American Computer Society: What exactly can blockchain technology do?

Associate Professor Scott Ruoti, electrical computers and assistant professor in the Department of Science and the Un...

Blockchain

Opinion | How will blockchain technology affect the way countries are governed?

Source: Surging News, original title "Blockchain Technology and Governance Transformation" Author: Yan Bing...