Last month, AI engineer Adam King combined the advantages of artificial intelligence in forecasting, and proposed the use of deep reinforcement learning to build automated transactions for cryptocurrencies. In the presentation model, the program's rate of return has reached a staggering 60 times (only technical discussion, irrelevant investment advice).

But at the time, this display model was slightly rough. Although it can make a profit, it is not stable. Using it may make you earn a lot of money, and it may make you lose nothing, quite a bit of "rich and dangerous".

The problem of instability has been plaguing Adam's brother. After a month of crouching, Xiao Ge proposed to introduce feature engineering and Bayesian optimization into the model. Will these improvements work? How much can the rate of return increase? Let's take a look at Adam's latest masterpiece! In the previous article , we used deep reinforcement learning to create a Bitcoin automated trading agent that can make money. Although this agent is able to automatically trade Bitcoin for profit, its yield is not as bright as it is today. Today we will **greatly improve this bitcoin trading agent, thereby increasing its profitability.**

- Bitcoin Pizza Festival | What happened to the man who bought tens of thousands of bitcoins to buy pizza?
- Bitcoin UTXO hit a record high, and April is the key January to break the resistance
- The community is hot 丨 whether Bitcoin halving can bring rise, see what the industry says
- The value of bitcoin or cryptocurrency goes far beyond investment!
- Twitter Featured: Buying 11,000 Bitcoins in a single month, is the institution also suffering from phobia?
- Is Bitcoin too hot? $ 400 million in Grayscale secretly picked up the disk

It should be noted that the **purpose of this article is to test whether the most advanced deep reinforcement learning technology can be combined with the blockchain to create a profitable Bitcoin automated trading agent.** At present, it seems that the industry has not realized the potential of deep reinforcement learning in automatic trading. Instead, it is not a "tool that can be used to build trading algorithms." However, recent advances in the field of deep learning have shown that reinforcement of learning agents on the same problem can often learn more features than ordinary supervised learning agents.

For this reason, I did related experiments to explore what kind of profitability the trading agent based on deep reinforcement learning can achieve. Of course, the **result may be that deep reinforcement learning has strong limitations so that it is not Suitable for trading agents, but who do not know who to know the results?**

First, we will **improve the policy network of the deep reinforcement learning model** and smooth the input data so that the trading agent can learn more features in very little data.

Next, **we will use the current feature engineering approach to improve the observation space of the** trading agent, while fine-tuning the reward function of the trading agent to help it find better trading strategies.

Finally, before training and testing the yields obtained by trading agents, we will **use Bayesian optimization to find hyperparameters that maximize yield** .

High energy in front, fasten your seat belts, let us start this journey of full-scale exploration.

**Improvement on deep reinforcement learning model**

In the previous article, we have implemented the basic functions of the deep reinforcement learning model.

GitHub address:

https://github.com/notadamking/Bitcoin-Trader-RL

It is imperative to **improve the profitability of the deep learning learning agent** , in other words, to make some improvements to the model.

**Recurrent Neural Network (RNN)**

The first improvement we need to make is to **use a cyclic neural network to improve the policy network** , that is, to replace the previously used multilayer perceptron with a Long Short-Term Memory (LSTM) network (Multi-Layer Perceptron). , MLP) network. Since the recurrent neural network can remain in an internal state over time, we no longer need to slide the "review window" to capture the behavior before the price change. The cyclic nature of the cyclic neural network can automatically capture these behaviors at runtime. In each time step, the new data in the input data set and the output from the previous time step are entered together into the recurrent neural network.

Therefore, the long-term and short-term memory networks can maintain an internal state at all times. In each time step, the agent will remember some new data relationships, and will forget some of the previous data relationships, and the internal state will be updated accordingly.

The cyclic neural network receives the output of the previous time step

How does the cyclic neural network handle the output of the previous time step and the input of this time step

Long-term and short-term memory network implementation code LSTM_model.py

Between the cyclic neural network is uniquely advantageous for the internal state. Here, we use the long- and short-term memory network strategy to update the near-end strategy to optimize the PPO2 model.

**Data stability**

In the previous article I pointed out that the data of bitcoin transactions is non-stationary (that is, there are some trends, not just random fluctuations), so any machine learning model is difficult to predict the future.

A stationary time series is a time series in which the mean, variance, and autocorrelation coefficients (related to their own lag) are constant.

Moreover, **the time series of cryptocurrency prices has obvious trends and seasonal effects (seasonal effects refer to the abnormal returns of the stock market associated with the seasons, a "visual" in the stock market, and are contrary to market effectiveness. )** , both of which affect the accuracy of the algorithm for time series prediction, so here we need to use differential and transform methods to process the input data, and build a normal data distribution from the existing time series to solve the problem. .

In principle, the difference process is to make a difference between the derivative of the cryptocurrency price (ie, the rate of return) in any two time steps. Ideally, this eliminates the trend in the input time series, but differential processing does not work for seasonal effects, and the processed data still has a strong seasonal effect. This requires **us to perform logarithmic processing to eliminate it before the differential processing. After such processing, we will get a smooth input time series** , as shown in the right figure below.

From left to right are:

Closing price of cryptocurrency, closing price after differential processing, logarithmic processing and closing price after differential processing

Logarithmic processing and differential processing code diff_and_log_time_series.py

After processing the input time series, we can verify it using the Augmented Dickey-Fuller test.

Run the following code:

Augmented Dickey-Fowler test code adfuller_test.py

We get a p-value of 0.00, which means we **reject the null hypothesis in the hypothesis test and confirm that the processed input time series is stationary** .

We can run the augmented Dickey-Fowler test code above to verify the stationarity of the input time series.

After completing this work, we will **use the feature engineering approach to further optimize the observation space of the trading agent** .

**Feature engineering**

In order to further improve the profitability of trading agents, we need to do some feature engineering.

Feature engineering is the process of using this domain knowledge to generate additional input data to optimize the machine learning model.

Specific to the trading agent, we will **add some common and effective technical indicators** in the **input dataset, as well as the output of the seasonal effect prediction model SARIMAX in the Python data analysis program library StatsModels** . These technical indicators will bring some relevant, but potentially lagging, information to our input dataset, which will greatly improve the accuracy of the trading agent's forecast. The combination of these optimization methods can provide a very good observation space for the transaction agent, allowing the agent to learn more features and gain more benefits.

**technical analysis**

In order to select the technical indicators, we will **compare the correlation of all 32 indicators (58 features) available in the Python technical analysis library** . You can use the data analysis tool pandas to calculate the correlation between the various metrics of the same type (such as momentum, volume, trend, volatility), and then select only the least relevant metrics as features in each type. In this way, the value of these technical indicators can be maximized without causing excessive noise interference to the observation space.

Thermal map of technical indicator correlation on bitcoin datasets made using Python's advanced visualization library seaborn

The results show that the volatility indicator and some momentum indicators are highly correlated. After removing all duplicate features (features with an absolute mean of correlations greater than 0.5 in each type), we add the remaining 38 technical features to the observation space of the trading agent.

In the code, we need to **create a function called add_indicators to add these features to the data frame** . In order to avoid recalculating these features in each time step, we only initialize in the transaction agent environment. The add_indicators function is called during the procedure.

The code for initializing the transaction agent environment initialize_env.py

Here, the transaction agent environment is initialized and features are added to the data frame before data smoothness processing is performed.

**Statistical Analysis**

Next we need to add a predictive model.

**Since the Seasonal Auto Regressive Integrated Moving Average (SARIMA) can quickly calculate the price forecast in each time step, the calculation in the smooth data set is very accurate, so we use it to encrypt the currency. Price forecast.**

In addition to these advantages, the model is very simple to implement, it can also give a confidence interval for the predicted value, which usually provides more information than giving a predicted value alone. For example, when the confidence interval is small, the trading agent will be more convinced of the accuracy of the predicted value. When the confidence interval is large, the trading agent knows to take greater risks.

Add the code to the SARIMA predictive model add_sarimax_predictions.py

Here we add the SARIMAX predictive model and confidence intervals to the observation space of the trading agent.

Now that we **have updated the strategy with a better-performing cyclic neural network and improved the observation space of the transaction agent using the feature engineering approach** , it is time to optimize the other parts.

**Reward optimization**

Some people may feel that the reward function in the previous article (that is, the total value of reward assets is increasing) is already the best solution, but through further research I found that there is room for improvement in the reward function. **Although the simple reward function we used before has been able to make a profit, the investment strategy it gives is very unstable and often leads to serious loss of assets.** In order to improve this, in addition to considering the increase in profits, we also need to consider other indicators of rewards.

A simple improvement to the reward indicator is **not only to reward the profit brought by the bitcoin when the bitcoin price rises, but also to avoid the loss avoided by selling the bitcoin when the bitcoin price falls.** For example, we can reward the behavior of agents buying Bitcoin and increasing total assets, and selling Bitcoin to avoid the reduction of total assets.

**While this reward indicator is excellent at increasing profitability, it does not take into account the high risks associated with high returns.** Investors have long discovered the loopholes behind this simple investment strategy and have improved it into a risk-adjustable reward indicator.

**Volatility-based reward indicator**

A typical example of this risk-adjustable reward indicator is the Sharpe Ratio, also known as the Sharp Index. It calculates the ratio of excess return to volatility for a portfolio over a specified time period. The specific calculation formula is as follows:

The formula for calculating the Sharpe ratio: (revenue of the portfolio – the return of the market) / standard deviation of the portfolio

From the formula we can conclude that in order to maintain a high Sharpe ratio, the portfolio must simultaneously guarantee high returns and low volatility (ie risk).

As a reward indicator, the Sharpe ratio has withstood the test of time, but it is not perfect for automated trading agents because it has an adverse effect on the upside volatility, while in the bitcoin trading environment. Sometimes we need to take advantage of the upstream standard deviation, because the upstream standard deviation (that is, the bite rise in bitcoin prices) is usually a good window of opportunity.

The use of the Sortino Ratio can solve this problem well. The Sotino ratio is very similar to the Sharpe ratio, except that it only considers the downside standard deviation, not the overall standard deviation. Therefore, the Sotino ratio does not have any adverse effect on the upward standard deviation. Therefore, we choose the Sotino ratio for the first reward indicator of the trading agent. Its calculation formula is as follows:

The formula for calculating the Sotino ratio: (revenue of the portfolio – the return of the market) / the downside standard deviation of the portfolio

**Other reward indicators**

We chose the Calmar ratio as the second reward indicator for trading agents. So far, all of our reward metrics have not taken into account the key factor in the maximum drawdown of Bitcoin.

The maximum retracement rate is the difference in value between the price of the bitcoin from the peak of the price to the bottom of the price, which is used to describe the worst case after buying Bitcoin.

The maximum retracement rate is fatal to our investment strategy, **because only a sudden drop in the price of the currency, our long-term accumulation of high returns will be lost** .

Maximum retracement rate

In order to eliminate the negative impact of the maximum retracement rate, we need to choose a reward indicator that can handle this situation, for example, **using the Calmar ratio** . This ratio is similar to the Sharpe ratio except that it replaces the standard deviation of the portfolio on the denominator with the maximum retracement rate.

Calmar ratio calculation formula: (revenue of portfolio – revenue of the market) / maximum revaluation rate

Our last reward indicator is the Omega ratio that is widely used in the hedge fund industry. In theory, when measuring risk and return, the Omega ratio should be better than the Sortino ratio and the Calmar ratio because it can use the distribution of returns to assess risk in a single indicator.

When calculating the Omega ratio, we need to calculate the probability distribution of the portfolio above or below the specific benchmark separately, and then divide the two to calculate the ratio. The higher the Omega ratio, the higher the probability that Bitcoin's upside potential will exceed the downside potential.

Omege ratio calculation formula

The formula for calculating Omega ratios looks complicated, but don't worry, **it's not hard to implement it in code** .

**Code**

While the code for writing each reward indicator sounds interesting and challenging, here I'm going to use **Python to quantify the financial package empirical to calculate them** for the sake of everyone's understanding. Fortunately, this package contains exactly the three reward metrics we defined above, so in each time step, we only need to send the list of revenue and market returns for the time period to the Emprical function. These three ratios will be returned.

Use the impyrical package to calculate the code for three reward metrics, risk_adjusted_reward.py

In the code, we set a reward for each time step with a predefined bonus function.

So far, we have determined how to measure the success of a trading strategy, and now it is time to figure out **which indicators will bring higher returns** . We need to enter these reward functions into the automatic hyperparameter optimization software framework Optuna, and then use Bayesian optimization to find the optimal hyperparameters for the input dataset.

**Toolset**

As the saying goes, good horses are well equipped with saddles. Any good technical staff needs a set of useful tools, otherwise it will be difficult for a woman to be without a meter.

But I am not saying that we have to recreate the wheel. We should learn to use the tools that programmers used to develop for us at the cost of baldness, so that their work is not in vain. **For the transaction agent we developed, the most important tool to use is the automatic hyperparameter optimization software framework Optuna** . In principle, it uses the tree structure Parzen Estimators (TPEs). To achieve Bayesian optimization, this estimation method can be run in parallel, which makes our graphics card useless, and the time required to perform the search will be greatly shortened. in short,

Bayesian optimization is an efficient solution for searching hyperparameter spaces to find hyperparameters that maximize a given objective function.

In other words, Bayesian optimization can effectively improve any black box model. In terms of working principle, **Bayesian optimization models the objective function to be optimized by using surrogate functions or the distribution of substitution functions.** Over time, the algorithm continually retrieves the hyperparameter space to find hyperparameters that maximize the objective function, and the effect of the distribution is gradually improved.

The theory says so much, how do we apply these technologies to the Bitcoin automated trading agent? In essence, we can use this technique to find a set of optimal hyperparameters, so that the agent has the highest yield. This process is like taking a needle with the best effect in the sea of super-parameters, and Bayesian optimization is the magnet that takes us to find this needle. let's start.

**Using Optuna to optimize hyperparameters is not difficult.**

First, we need to create an optuna instance, which is the container that loads all the hyperparameter tests. In each test we need to adjust the hyperparameter settings to calculate the corresponding loss function value of the objective function. After the instance initialization is complete, we need to pass in the target function and then call the study.optimize() function to start the optimization. Optuna will use the Bayesian optimization method to find the hyperparameter configuration that minimizes the loss function.

Code optimization_with_optuna.py using the Optuna library Bayesian optimization

In this example, the **objective function is to train and test the agent in the bitcoin trading environment, and the loss value of the objective function is defined as the opposite of the average profit of the agent during the test. The** reason for adding the negative value to the income value is Because the higher the average return, the better, and in Optuna, the lower the loss function, the better. A negative sign just solves the problem. The optimize optimization function provides the test object for the target function. In the code we can specify the variable settings in the test object.

Optimize the objective function code optimize_objective_fn.py

The optimize_ppo2 optimization agent function and the optimize_envs optimization agent environment function receive the test object as input and return a dictionary containing the parameters to be tested. The search space for each variable is set by the suggest function, we need to call the suggest function in the experiment and pass the specified parameters to the function.

**For example, if the set parameter obeys a uniform distribution on a logarithmic scale** , the function is called.

Trial.suggest_loguniform('n_steps',16,2048),

Equivalent to giving the function a new floating point number of 2 exponential powers (such as 16, 32, 64, …, 1024, 2048) between 16-2048.

**Furthermore, if the set parameters obey the uniform distribution on the normal scale** , the function is called.

Trial.suggest_uniform('cliprange',0.1,0.4),

Equivalent to giving the function a new floating point number between 0.1 and 0.4 (such as 0.1, 0.2, 0.3, 0.4).

I believe you have seen the rules, which is to set the variables:

Suggest_categorical('categorical',['option_one','option_two']), where categorical is the strategy for setting variables, option_one and option_two are respectively two options for variables. In the previous function, these two options are the scope of the variable. I understand that the code below believes that it is hard to beat you.

The code of the trading agent optimize_ppo2.py

Optimize the trading environment code optimize_envs.py

After the code is written, we **run the optimization function on a high-performance server in the form of CPU/graphics co-operation** . In the setup, Optuna creates a SQLite database from which we can load optimized instances. This example records the best performing test during the test, from which we can derive the optimal hyperparameter set in the agent trading environment.

Load the code of the optuna instance load_optuna_study.py

**At this point, we have improved the model, improved the feature set, and optimized all the hyperparameters.** But as the saying goes, it is the scorpion that is pulled out by Mad.

**So how does the trading agent perform under the new reward indicator?**

During the training, I used the **profit, Sortino ratio, Calmar ratio and Omega ratio** to optimize the agent. Next, we need to test which kind of reward indicator trains the highest profit in the test environment. Of course, the data in the test environment is the bitcoin price trend that the agent has never seen before in the training process. The fairness of the test.

**Income comparison**

Before we look at the results, we need to know what a successful trading strategy is like. For this reason, we will benchmark some common and effective bitcoin trading strategies. Surprisingly **, one of the most effective bitcoin trading strategies in the past decade has been buying and holding, while the other two good trading strategies use simple but effective technical analysis to generate buy/sell. Signals to guide the transaction.**

**1. Buy and hold**

This trading strategy refers to buying Bitcoin as much as possible and holding it all the time (that is, the “Holding of the Rivers and Lakes” in the blockchain community). Although this trading strategy is not particularly complicated, the chances of making money in the past are high.

**2. Relative Strength Index (RSI) divergence**

When the relative strength index continues to fall and the closing price continues to rise, this is the signal that needs to be sold, and when the relative strength index continues to rise and the closing price continues to fall, it is the signal that needs to be bought.

**3. Simple Moving Average (SMA) crossover**

When the long-term simple moving average exceeds the short-term simple moving average, this is the signal that needs to be sold, and when the short-term simple moving average exceeds the long-term simple moving average, it is the signal that needs to be bought.

You might ask, why do these simple benchmarks? This is done **by comparison to prove that our reinforcement learning transaction agent can play a role in the bitcoin market.** If the benefits of the agent are not even better than these simple benchmark returns, then we are equivalent to spending a lot of development time and The graphics card is used to carry out a scientific experiment. Now let us prove that this is not the case.

**Experimental result**

Our dataset uses the hourly closing price position value (OHCLV) data downloaded from the cryptocurrency data website CryptoDataDownload, with the first 80% of the data used to train the agent and the last 20% tested as new data to understand the intelligence. The profitability of the body. This simple form of cross-validation is sufficient for our needs, and if the Bitcoin automated trading agent is really ready for production, we can use all the data sets for training and then perform on the newly generated data sets every day. test.

**Not much nonsense, let's take a look at the results.**

As you can see, **the agent using the Omega ratio as a reward indicator does not have a bright deal during the testing phase.**

The total value of assets in an agent that uses the Omega ratio as a reward indicator over more than 3,500 hours of trading time

Analyzing the transactions conducted by the agent We can see that it is clear that the Omega ratio reward indicator generates an over-trade trading strategy, so that the agent fails to seize the market opportunity to obtain revenue.

**Agents that use the Calmar ratio as a reward indicator are slightly better than agents that use the Omega ratio as a reward indicator, but the end result is very similar.** It seems that we have invested a lot of time and energy just to make things worse…

The total value of assets in an agent that uses the Calmar ratio as a reward indicator over more than 3,500 hours of trading time

**What if the profit only is used as a reward indicator?** In the last article, this kind of reward indicator proved to be a bit of a failure. Can all the modifications and optimizations made this time become decadent?

During the four-month test period, the **average profit of the agent using profit as a reward indicator reached 350% of the initial amount of the account.** You may have been scared by this result. This should be the peak that reinforcement learning can achieve, right?

The total value of assets in an agent that uses profit as a reward indicator for more than 3,500 hours of trading time

This is not the case. **The average earnings of the agent using the Sortino ratio as a reward indicator reached 850% of the initial amount of the account.** When I saw this number, I couldn't believe my eyes, so I immediately went back and checked if there was a problem in the code. But after a thorough inspection, it is clear that there are no errors in the code, which means that these agents already know how to conduct bitcoin transactions.

The total value of assets in an agent that uses the Sortino ratio as a reward indicator over more than 3,500 hours of trading time

It seems that the agent using the Sortino ratio as a reward indicator has learned the importance of low-priced and high-priced sales while minimizing the risk of holding bitcoin, while they have escaped the two traps of over-trading and under-investment. . Although we don't know the specific trading strategies that the Agent has learned, we can clearly see that the Agent has learned to earn money by trading Bitcoin.

Agents using the Sortino ratio as a reward indicator are trading Bitcoin,

The green triangle indicates the buy signal and the red triangle indicates the sell signal.

Now, I have not been stunned by the excitement of experimental success. I clearly know that the Bitcoin Automated Trading Agent is far from being ready for production. Having said that, these results are more impressive than any trading strategy I have seen so far. What's shocking is that we **didn't tell the agent about the a priori knowledge of how the cryptocurrency market works and how to make money in the cryptocurrency market. It's just that the agent repeatedly tries and tries to get it right. The effect** , however, has gone through many trials and trials and errors.

**Written at the end**

In this article, we optimized the bitcoin automated trading agent based on reinforcement learning, allowing it to make better decisions when trading bitcoins, and thus gain more revenue! In this process, we spent a lot of time and energy, and encountered a lot of difficulties. We refine the difficulties and then break through them one by one, and finally complete the optimization of the agent. The specific steps are as follows:

1 Use the cyclic neural network to upgrade the existing model, that is, upgrade to a long-term and short-term memory network using smooth data;

2 using domain knowledge and statistical analysis for feature engineering, providing the agent with more than 40 new features for learning;

3 Introduce the risk of investment into the reward indicator of the agent, not just the profit;

4 using Bayesian optimization to find the optimal hyperparameters in the model;

5 Benchmarking using common trading strategies to ensure that the benefits of the agent can outperform the market.

**In theory, this high-yield trading agent has done a good job.**

However, I received a lot of feedback, they claimed that the trading agent is only learning the fit curve, so in the **face of real-time data in the production environment, the trading agent will never be able to gain** . While our approach to training/testing agents on different data sets should solve this problem, the **model does have the potential to over-fit the data set and may not scale well to real-time data** . That being said, in my opinion, these trading agents have learned far more than simple curve fittings, so I think they can make a profit in real-time trading scenarios.

In order to test this idea, I will bring these reinforcement learning-based agents to the production process for the next period of time. To this end, we must first update the operating environment of the agent to support other cryptocurrencies such as Ethereum and Litecoin. Then we will upgrade the agent so that it can trade on the cryptocurrency exchange Coinbase Pro in real time.

This will be an exciting experiment, please don't miss it.

It should be emphasized that all the methods and investment strategies in this article are for educational purposes and should not be considered investment advice. Our Bitcoin automated trading agent is far from the actual production level, so please manage your wallet.

Reference resources:

1) Circulating neural network and long-term and short-term memory network tutorial based on Python language and Tensorflow framework

Https://adventuresinmachinelearning.com/recurrent-neural-networks-lstm-tutorial-tensorflow/

2) Analysis of seasonal effect autoregressive moving average model prediction time series based on Python language

Https://machinelearningmastery.com/sarima-for-time-series-forecasting-in-python/

3) Analysis of non-stationary time series processing based on Python language

Https://www.analyticsvidhya.com/blog/2018/09/non-stationary-time-series-python/

4) Hyperparametric optimization algorithm

Https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

5) Recent advances in machine learning methods in the financial sector

https://dwz.cn/iUahVt2u

Source | Towards Data Science Compilation | Guoxi Editor | George Producer | Blockchain Base Camp (Blockchain_camp)