How to efficiently extract and apply semantic information from on-chain transactions

Efficient Methods for Extracting and Utilizing Semantic Data from On-Chain Transactions

Each transaction on the chain is not just a simple fund transfer, but also carries a lot of semantic information waiting to be discovered.

Guest: Wu Zhiying, PhD student at Sun Yat-sen University

Edited by: aididiaojp.eth, Foresight News

This article is a written summary of a video sharing by Wu Zhiying, a doctoral student at Sun Yat-sen University, as part of the Web3 Youth Scholar Program. The Web3 Youth Scholar Program is jointly launched by DRK Lab, imToken, and Cryptape. It invites renowned young scholars in the field of cryptography to share the latest research results with the Chinese community.

Hello everyone, I am Wu Zhiying, a first-year doctoral student at Sun Yat-sen University. I am also one of the technical leaders of XBlock.pro, a blockchain data website. My main research focus is on blockchain transaction data mining. Today, our topic is “Blockchain Transaction Semantic Analysis and Applications,” mainly discussing transaction semantics and blockchain transaction tracking. Later, we will briefly introduce some downstream applications of blockchain transaction semantics.

The title contains a keyword called “blockchain transaction semantics,” so what is blockchain transaction semantics? This is a relatively new concept that we believe in. A large number of Web3 applications appear in blockchain platforms, supporting DeFi, GameFi, and other business scenarios. Users can initiate numerous on-chain transactions to interact with Web3 applications to achieve their purposes. These on-chain transactions carry a lot of semantic information.

Transactions in blockchain systems can carry many semantic information, not just completing a simple fund transfer. In the above image, “transfer” represents a simple fund transfer, transferring USDT from one account to another. It can also perform more complex business logic, such as token exchange. For example, an account can send USDT to a decentralized exchange to exchange for 0.2 Ether, and the logic behind this transaction is quite complex. Here, we give a definition of blockchain transactions, which refers to the transaction itself and the intent information reflected during its execution process. The bottom right image shows the distribution of transaction types we crawled in 2021. We found that over 30% of transactions are no longer simple fund transfers, and we can predict that more of such transactions will appear in the future. Therefore, studying the semantic information of transactions becomes necessary.

Extracting blockchain transaction semantics presents certain challenges. Here, I list three points. Firstly, compared to traditional financial systems, blockchain accounts do not have identity information, or it is difficult to obtain their identity information. Therefore, it is challenging to infer transaction intent based on identity information. As a result, fraudulent behaviors, among others, can be hidden in blockchain transactions. Secondly, blockchain transaction extraction requires efficiency. We need to ensure that the extraction of blockchain transaction semantics is faster than the block generation speed on the blockchain. Otherwise, it will be difficult to access the latest transaction information, and many downstream tasks, such as detecting fraudulent transactions, will be ineffective. Lastly, there is a need for generality. The Web3 ecosystem is developing rapidly, with new concepts appearing every year. Our semantic information extraction solutions need to adapt to new emerging businesses, which is also a challenge.

Our idea is whether transactions can be represented as a more general vector. Once processed into vectors, we will have many downstream models that can be used, such as random forests similar to the traditional machine learning field, or more advanced neural networks, which can be applied to downstream tasks related to transaction semantics, especially tasks such as attack detection or account classification. We can use it to detect fraudsters and analyze the ecosystem of blockchain trading platforms, and discover new hotspots.

We proposed a solution that roughly consists of three steps. First, we accelerate the workflow efficiency by accessing IPC data interfaces in parallel through the blockchain client. After improving the efficiency of the transaction module acquisition, there is a significant improvement in the overall implementation. Then, we construct the obtained transaction data into a network and use the network motif concept to represent the transaction semantics information as a vector to extract the transaction semantics. Once represented as a vector, it can be applied to various downstream tasks, such as transaction classification and account classification, as well as ecosystem analysis.

The first part is the acquisition of transaction data. We chose the RPC scheme, which is a remote call interface provided by the blockchain client to obtain on-chain data. However, this type of network request is usually slow, which is the main drawback at present. The second type of scheme is that some blockchain explorers provide data interfaces, and we can easily access transaction data through these interfaces. However, these websites usually have anti-spider mechanisms, and extracting transaction data on a large scale may result in IP blocking, so this is not a scalable solution. Another type of scheme is using modified full nodes commonly used by everyone. We can build some nodes, modify the startup code, especially insert some code to obtain these transaction data. This scheme may require more resources, and if the node version is significantly updated, it may be necessary to rewrite the inserted code, so we chose the first scheme.

The network request speed ensures efficient data acquisition. So our method is to parallel request RPC interfaces to improve data acquisition efficiency. There are two parallel request schemes. One is to simultaneously request multiple RPC interfaces until all requests are completed, and then request transaction data from the next block. From the graph, it can be seen that there are a lot of invalid waiting time between different threads, which leads to serious resource waste. To improve this, we just need to eliminate the invalid waiting time. Therefore, we choose the second parallel scheme, which is the bottom-right graph. We parallel request multiple interfaces between different threads, but we don’t need to wait for other requests to be completed. In order to know which step I have requested or the progress, we can design additional synchronization components to know the progress of our data collection. This can improve the efficiency of the entire transaction data acquisition.

In the second part, we discuss how to extract transaction semantic information from the acquired transaction data. Before this, there were actually some tools, as discussed in this document, that extract semantic information through expert experience or rules, but such methods may not be able to well address new business models that may appear in the future. For example, similar to the lower left corner example. It actually provides the ability to understand the transaction intent within the browser, and we can see that it indicates the SWAP operation being performed. If based on rules, there may be some transactions that cannot identify semantic information. For example, if the second transaction cannot identify the semantic information, it may display a string, which is actually the function signature of an external transaction. Function signatures are difficult for non-professional blockchain users to understand and are not conducive to computer processing. Of course, some work has also proposed transaction extraction methods applicable to the DeFi field.

How do we create a more general transaction semantic extraction technique? We assume that transactions trigger a series of low-level semantics, and after combining these low-level semantics, higher-level semantic information emerges. Here, I provide a definition for low-level semantics, which are behaviors defined by the blockchain system and smart contracts. For example, fund transfer is a typical low-level semantic. Fund transfer semantics are usually defined by the blockchain system or smart contract protocols, such as ERC20 token transfer. We believe that fund transfer is a classic low-level semantic. The image on the right shows the code and interpretation of token exchange. We see that the person initiating the token exchange first initiates an external transaction, then the rooter requests them to send assets to LianGuaiir, and then LianGuaiir returns the exchanged assets to them. By modeling this fund transfer relationship, we realize that the substructure consisting of the three nodes in the image actually reflects the semantic information of token exchange. We consider modeling the fund transfer process as a graph and then mining high-order structures within the graph, which can reflect the high-level semantic information in the transaction. We refer to the motif-based methods. Specifically, we represent the semantic information of a transaction as a vector, which we call the semantic vector. The i-th element represents the frequency of the i-th motif appearing. We actually use 16 motifs, which are network substructures composed of a few nodes and edges.

Why choose 16 motifs? This is related to a reference article cited in the document. This article believes that these 16 network motifs can better reveal high-order information in various complex systems. For example, this M16 pattern often appears in neural networks, and this triangle network motif appears frequently in the field of aviation networks, including various other categories in protein network structures. So, these 16 motifs are beneficial for various systems. Therefore, we also chose these 16 motifs. On the right, I provide the calculation method for the first nine motifs, as the calculation method for the remaining motifs is too complex.

If we use the network motif method to calculate the transaction semantics of the blockchain, what will be the computational cost? We use gb/gl to represent the number of fund transfers in a block. We estimate the number of fund transfers in the worst case scenario to determine if efficiency can still be achieved when there are a large number of fund transfers. Fund transfers in a block require gas fees, and we choose a more efficient log transfer scheme, assuming that all fund transfers are presented in the form of logs, represented by gl to indicate the gas consumption. This gives us the worst case scenario. Under the constraints that the semantic vectors of all transactions in a block satisfy the conditions in the above graph, we fill in these two numbers according to the Ethereum Yellow Paper, and finally conclude that most CPUs can complete the calculations within one second.

Next, we will demonstrate the efficiency of data retrieval and the experimental conclusions through experiments. First, let’s look at the lower left graph. The horizontal axis represents the average time cost of data retrieval for different methods in each block, and the vertical axis represents different categories of methods. From the graph, we can see that the speed of a full node is the fastest, and our method is already close to the speed of a full node, but with much less memory consumption. Some well-known blockchain platforms have block speeds of 2.3 seconds. If the time cost exceeds 2.3 seconds, RPC-based solutions cannot be used as the latest data cannot be retrieved. The middle graph evaluates the cost of extracting semantic vectors. We used a clever approach here. The horizontal axis represents the tool’s concurrency, increasing from left to right, and the vertical axis represents the time cost of the extraction tool. We drew two lines, one for using only data retrieval and the other for connecting the data retrieval with a semantic vector calculation module. As the concurrency increases, the time cost of ETL decreases. However, when the semantic vector calculation module is connected, the time gradually converges. The converged result is the time cost of transaction vector calculation for the two lines. We estimated the cost of semantic vector calculation by increasing the concurrency. The graph on the right is actually the low-dimensional visualization of semantic vectors. We selected the top nine well-known transaction labels and visualized the vectors under different labels after T-SNE dimension reduction. We found that some semantics have clear boundaries, although there are also some overlaps, but at least it shows that the method we adopted for extracting transaction semantics has good distinguishability.

Next, let’s discuss the downstream applications of transactions, mainly transaction classification and account classification. We use transaction semantic vectors as the features of transactions and then use some simple or typical machine learning methods to classify transactions. The table displays five common transaction classifications. In addition, we demonstrate in the paper how to use transaction information in account classification, especially in the classification of contract accounts for counterfeit tokens. In terms of data construction, we consider contracts and their creator accounts as nodes, and transactions between accounts as edges. For these transactions, we calculate semantic vectors and perform contract node classification. We use traditional models including some neural network models and include these transaction semantic features in the features of contract nodes. For each contract node, we take a collection of semantic vectors associated with it and then calculate statistical features of the semantic vectors, such as maximum, minimum, mean, and variance. Finally, we concatenate these statistical features with the features of the contract node itself. We also conducted classification experiments for contract nodes using traditional machine learning methods such as random forests or MLP, as well as experiments using graph neural networks. We found that these methods all have varying degrees of improvement. Furthermore, we applied transaction semantics to ecosystem analysis of some blockchains by observing the trends of network motifs corresponding to different elements in the vector over time in 2021.

The three images in the bottom left corner depict some distinctive trends in model changes. First, the frequency of smart contract usage is increasing, as shown in figures M1 and M11. After modeling the process of fund transfer as a network, if a transaction triggers fund transfer activities beyond external transactions, it indicates that other fund transfer activities are likely triggered by smart contracts. As the number of unidirectional edges in M1 increases, it is likely that additional fund transfer activities are triggered by smart contracts. Second, before 2022, NFT gained popularity, but now the market is bleak. It is related to M3, which had a significant increase in 2021. The right graphic actually depicts the underlying logic of NFT trading. When we model all fund transfer activities as a graph, we find a circular structure composed of three nodes in the graph, which further confirms our speculation. We conducted a correlation coefficient analysis between the trend of M3 changes and the daily market trading volume of NFT. Finally, we found that the correlation coefficient reached 0.82, indicating a strong correlation. Another downstream application is transaction tracking, which is actually a post-event regulatory measure. Pre-event measures such as warnings and audits may not always be effective, so in the event that theft or fraud has already occurred, our ultimate goal in transaction tracking is to search for fund flow connections between different transactions, hoping to discover the destination of hacked funds. If we find the laundered funds in an exchange, we can apply to the exchange and use some off-chain methods to locate the fund recovery method. For transaction tracking tasks, the core idea is to start with a few accounts, such as a victim node or two victim nodes, to search for the hacker’s transaction addresses, and then search for associated transactions through those addresses.

Currently, there are some pain points in transaction tracking methods. Due to the large scale of the transaction network, the speed of processing data is limited. Some systems rely on experts for interaction, but human intervention can weaken real-time response capability. The second pain point is that the existing tracking scope is not precise enough, as accounts have pseudonyms, which means that tracking solutions have to expand the search scope to ensure coverage of illicit cash flows. Additionally, current transaction methods may ignore the semantic information of transactions, resulting in deviations from actual fund flows.

We have designed a fast and accurate tracking solution based on transaction semantics. We divide on-chain transactions into two modes: Xfer mode, where assets are transferred from one account to another, and Swap mode, where one token is exchanged for another, which often occurs in liquidity additions, staking, and token minting. By identifying these two modes, we can design more accurate transaction tracking solutions. In the above diagram, if we want to track the final flow of assets from S to U, using some traditional methods may eventually trace it to hash2 transaction, which is obviously incorrect. We will notice that in the output transactions, the hash2 transaction involves both input and output funds, indicating that it is a token exchange process, essentially converting 2.5KUSDC into 0.8ETH. Therefore, the 2.5kUSDC ultimately transfers to these two orange nodes. We refer to this method of improving transaction tracking accuracy as token redirection.

We have also conducted some related experiments, such as the mentioned obfuscation experiment. If we remove the semantic information identification module, the precision decreases by 12%, which indicates that semantic information is very helpful in the process of transaction tracing. In the actual case of Cryptopia being hacked, our solution not only discovered 32% of the funds flow mentioned in the expert report, but also identified the flow of over $10.32 million assets to two other undetected exchanges, which increased our overall tracing accuracy to 97%.

Currently, the work on transaction semantics is still in the exploratory stage, and these two papers serve as a starting point. We found that the application of transaction semantics is beneficial for various downstream tasks, but there are still some limitations in the current solution. For example, we only consider the low-level semantics of fund transfers, and we have not analyzed the execution process of smart contracts. In future work, we may focus on more modal transaction data, such as logs and trace information of the execution process. In addition, we will also undertake more downstream tasks, such as attack detection and large-scale account classification.

We will continue to update Blocking; if you have any questions or suggestions, please contact us!

Share:

Was this article helpful?

93 out of 132 found this helpful

Discover more

Blockchain

Smart Contract Series 1: The Cornerstone of the Digital Society - Smart Contracts

Author: University of Aeronautics and Astronautics Laboratory Distributed Innovation Institute Yunnan, Beijing and Be...

News

Investment tips for the next bull market: In-depth analysis of the development status and trends of 15 cryptocurrency tracks

Following the regular industry cycle pattern, the bear market has passed halfway. The Ethereum upgrade has brought ab...

Blockchain

Bitcoin cash is being questioned, unknown miners control over 50% of computing power for 24 hours

According to Cointelegraph, on October 26, an unknown bitcoin cash miner controlled more than 50% of the token's...

Blockchain

Content is king: how to break the blockchain industry

Guide: In the circle of the entire blockchain, there are many investors, speculators, developers, and project parties...

Blockchain

Blockchain Technology Criterion: Besieged, Troubled Privacy Defender

Guide Since the birth of BTC, the encryption pass has gone through ten years. At present, countries have different at...

News

Blockchain Weekly | The central bank released the 2020 recruitment announcement; Jia Nan Zhizhi is going to the US

Highlights this week The central bank released the 2020 recruitment announcement, the central bank digital currency r...