In response to the surge in capacity, how does Coinbase guarantee reliability?

Jordan Sitkin and Luke Demi discuss how Coinbase responds to the surge in cryptocurrency in 2017, and how engineers have used the lessons learned from these experiences to create new tools and technologies for capacity planning, the future of the cryptocurrency boom. The tide is ready. Luke Demi is a software engineer at the Coinbase Reliability Team. Jordan Sitkin is a software engineer at Coinbase. This article is organized in their speech at QCon, "Capacity Planning for Crypto Mania."

text

Demi: I just said it straight away. If we try to access Coinbase in May 2017, we might see such a page. We may be trying to check our balance, or buying a new cryptocurrency, or trying to withdraw funds from Coinbase's website, but we are seeing the price of cryptocurrency soaring. People suddenly became interested in buying cryptocurrencies, so the cryptocurrency trading system crashed. A bit like QCon's website, right? Just a joke, but we can't wait to see the story of this "currency war" from the person in charge of the matter.

But anyway, calm down. That's right, this is a terrible phenomenon for a number of reasons. But mainly because if you were a Coinbase customer at the time, then you will have to worry about it – you will worry that your money is gone. Coinbase has just been hacked, what happened? We started to see some terrible sayings like this: "Coinbase has fallen with Bitcoin and Ethernet." Well, that's right. However, it is very hurtful. Over the years we have been low-key, trying to build this ecosystem and trying to lead this change. The comments on these Reddit were a bit humorous, but caught the mood of the day. I read another comment: "Interestingly, Coinbase should use AWS, so capacity shouldn't be a problem. Unless they don't use auto-scaling or are too cheap to buy more AWS resources, or just lazy."

This is a portrayal of our day in a room. It was very interesting that day because we were worried all day. About 8 hours, we sat in a room with a sofa and wanted to find a way to get the website back to its original state. The New York Times published an article that day, saying that Coinbase was down and unable to handle the network load. This is an important turning point for our company. We realize that if we can't unite, then Coinbase can't survive.

Siktin: We call this interview an encrypted fanatic capacity plan. The content of this interview is how we get back to the best. In other words, how do we reverse this story from the dark days that Luke just introduced and last year. We are going to introduce what we learned during that time. In that few months, Coinbase's traffic increased by 20 times. We made some mistakes, got some lessons, and we were fortunate to have some interesting challenges while working. Of course, during this time, our tools and systems had to mature quickly. We will introduce the story of Coinbase from that day and will present some insights and experiences at any time. Then, we plan to talk about what we are doing now and make better preparations for the future.

Let's start with a brief introduction. I am Jordan, this is Luke. We are a member of the Coinbase Reliability team. It should be noted that in the crazy craze last year, we only involved two of them. We are also responsible for maintaining system reliability now and in the future.

Buying and selling digital currency

 

As many of you have pointed out, everyone is already familiar with Coinbase. For Coinbase, most people know that it is an application. We want it to be the easiest and most trusted place to manage and manage digital currencies. But now, Coinbase is actually more than just an application. We are a small collection of other brands and services around the name Coinbase. To give a more comprehensive overview of our technology stack, what we're talking about here today is a standalone Rails application. It is supported by the MongoDB database and its infrastructure is deployed and managed on AWS.

In today's speech, we want to tell you two important things. First of all, the topic of today's speech is related to capacity planning, but we will actually spend most of the time talking about load testing. The reason is that we feel that load testing (or capacity testing, stress testing, batch testing) is our most important tool in capacity planning. I mean, through capacity testing (or different names in the same terminology), you can actually simulate and study real faults that can occur in production. This should be easier to understand.

Secondly, we would like to tell you that in a load test environment, complete equivalence with production is not a requirement for good results from load test systems. One of the ways we are going to solve it is to introduce what we call the capacity cycle concept. This will be the theme we will return to throughout the speech, so I hope to introduce it to you now. We are prepared to give a detailed introduction to its meaning, but we also intend to share some of our experiences in this way.

Backend RPM

 

Demi: Ok, let's take a look back and talk about the background of this story. In fact, it just shows us how our traffic patterns look bad before things get worse. What I am saying is that they are very casual. This is our backend traffic to the Rails service. We can see that the traffic has increased for a while and then decreased, but its fluctuations are in a fairly narrow range. There is a red line drawn here. There is no other special reason. It just means that the above situation is very bad, but we are still far from there.

At this point in time, the red line means that people are really interested in things like cryptocurrency, profitability, and they are not the type of things we worry about. We don't think we will come across this red line. However, this red line eventually became a disaster. This is our flow chart. Slightly mention that the chart is as of July 2017. This picture shows how our traffic changes and how we pass through this red line. Obviously, this is behind the enthusiasm of Ethereum and Bitcoin.

We stayed above the red line for a few days. In a few days, we started from 3 am Pacific time until midnight, with more than 100,000 requests per minute, because people just kept trying to log in to buy products. Our system has collapsed, but it doesn't matter, people just want to keep coming back. The most appropriate description of this phenomenon is the "explosion", people fly around. Here is another article about the New York Times. You can see how bad things have progressed during that time.

Network service time crash

 

The reason why things are not going well is that although things are going in the same way, they are progressing too fast. This is New Relic, very simple, it shows our Ruby time and MongoDB time. We can see that Ruby time obviously accounts for a large chunk of what we do inside our application, and MongoDB only has a little time. This is what we usually see when we visit a website. 80% of the time is Ruby, a little bit is MongoDB, and there are other services we are not counting here. However, when we go through these problems, when the website is down, it looks like this. This is confusing for a whole bunch of different reasons. First, Ruby and MongoDB are closely linked. We can see that for some reason they are synchronous. Obviously, they have risen a lot. I mean, let's take a look at the chart on the left, which is a 4 second response time, just to get Ruby and Monogo. We can imagine that no one can pass at that time.

This is a very confusing chart. But what is even more confusing is that at that point in time, this is the only chart we have. We stared at the chart and sat there and said, "Well, obviously Mongo is slow, but Ruby is also very slow." "Is there any other situation?" "Man, the chart is like this." We didn't use tools to delve into what's going on. In fact, this chart is wrong. However, it made us fall into a difficult thing to do during this time. We spent many days doing ridiculous things. The most reliable way we should have at the time was: "What if we adjust the kernel parameters?" That made sense at the time. Yes, maybe the kernel is the problem. There are a lot of things that are hard to do.

Wake up the dreamer

 

To explain the cycle that Jordan proposed, let me describe what happened at the time. We will perform a detailed and consistent load test every morning, or throughout the day. When we went on to analyze it, we were completely wrong. For the improvement of tools and cycles, what we should do is add tools to help us understand what happens during the load test. At that time, we didn't do that. We just stared at the same tool again and thought about weird ideas.

In order to solve this problem and some of its extensions, we did what everyone loves to do – pushing the responsibility to the database and the upgraded version. We upgraded every version of MongoDB to keep it up to date. If you are familiar with MongoDB, you will know that the version we used at the beginning is very bad, so there will be some considerable improvements in the upgraded version. We have done other things, such as separating our clusters. For example, we have a set of users and a set of transactions are hosted on the same host, we will separate them. In this way, they each have more room for growth, which frees us from the predicament of the time. However, we realized that if we want to be able to withstand more loads, we need a better approach. Because at this point in time, I still feel that traffic has been increasing, that is, things are still getting worse. This is indeed the case.

However, we have got a major lesson here. This is obvious, but it is really important to make it clear. Good tools will make the problem appear. However, bad tools can make the problem blurry and confusing, and one day it will make us look like an idiot.

Sitkin: Go back to our timeline and assume we are in mid-2017. Go back to this flow chart that we have been watching. We can see that after this breakthrough, we were forced to experience some panic, quickly improved the system, and survived. The traffic increases and is relatively stable. Our traffic has reached a new baseline level. We basically deal with this problem at this point in time. We can see that although there are some fluctuations, we are basically in a new flow balance.

At this point, those load tests from users no longer caused our site to crash in the same way. We did not jump into this cycle, which we later learned. As a result, we have made a big difference to our tools and optimized our existing tools. However, we don't have enough creativity to force our system to create the next failure mode. Therefore, we are still a little uneasy and worried about the future. To illustrate this from a cycle perspective, we end up calling it the YOLO load test phase, where we can see a one-week version of our load test step.

We realized that we needed to manually return this load test in order to get back to the fast improvement cycle. So, we only do the simplest things we can think of. We basically just used some off-the-shelf tools and ran some synthetic load tests. We tested it again according to the development environment. We are not sure where to start the best, so we started working.

Pillar of actual load testing

 

Before moving on to the details of what we've done, it's a good time to introduce what we call the actual load test pillar concept. This is a way to decompose the load test strategy in three categories and is a better way to assess the actual feasibility. One of the key points is that each of these three categories can be solved separately, while improving the authenticity of our load tests. We don't necessarily need to improve them together, or they don't even have to be equally important to each other.

The first one we call data; now shows the shape of the data in our database. Similar: How many lines per type in our test environment? Is the relationship between these records true compared to all in production? Next is the traffic. This is the shape of the flow coming out of the load test system. We can ask ourselves, how many requests have appeared? What is the rate? Which endpoints do these requests reach? Does the distribution of these requests match the type of traffic we actually see in the production environment? The third category is the simplest of the three, is the physical system that makes up our load test environment match what we actually run in production? How many servers do we run in a load test environment? How are they related? Does the network layer look the same?

In this case, return to what we are really doing at this point in time. We started with a ready-made open source tool called Locust. We feel that for us, this is the easiest way to start implementing distributed load testing. We know that we can get quite a lot of throughput from it. If you are not familiar with Locust, I will briefly introduce it. This is a tool that allows us to write simulated user streams in Python and then play them back on multiple nodes to implement distributed load testing. Among them, one of the main controls we have is to debug the number of users in our simulated user pool and then add it to this pool.

Tell me more details. This is the basic architectural diagram of Locust. We have a master node that we interact with to control the test, and then it controls a bunch of slave nodes. Each of these slave nodes is responsible for maintaining a small pool of users running in our simulated user stream written in Python.

At this point in time, this is our first attempt. In the end, this test is a toy for us. It doesn't really provide the type of results we need. However, we did not really understand the reasons it did not provide, nor did we really believe the results it provided us. However, despite these problems, it does help us better understand what a good load test system should look like.

From the perspective of the pillars we just introduced, we know that data and systems do not match production. This is a design. We know that the test we just did was experimenting with a naive first pass. But the real problem is that we don't understand the way the system outputs traffic is unrealistic. We haven't built a process that improves the reality, not even creating this initial user stream. We did not do this based on the basic principles. Therefore, the system will never be an important tool in our toolbox because we don't trust it. Coincidentally, this quickly became irrelevant.

The holy grail

 

Demi: Because in late 2017, in October 2017, cryptocurrencies really became mainstream. Now, we call the company we work for as a company that “Mom knows”. Our app ranks first in the app store download leaderboard, and the style of news reporting begins with this: "Bitcoin fanaticism," "cryptocurrency fanaticism," "Should we buy some cryptocurrency?", "We Should you invest your money in cryptocurrency?"

Our traffic is now showing this trend. This red line is free to draw for expansion. And for larger scales, there will be a bigger jump in traffic, which is why we were down. During this time, the situation was not so bad. We will definitely fail in some places, depending on how close we are, believe it or not. But the reason we can survive is that we have something here, and we call it the "Holy Grail of the Capacity Cycle." In order to solve this problem, which is the basic problem we have realized, we have entered the rhythm of this cycle. For example, we will wake up at 4:30 in the morning, when people on the East Coast will think, "Should I buy Bitcoin?" They began to bombard our website. We spent a morning just to keep the site online. At lunch, we will analyze the results: What is the reason for the downtime today? What are we going to do tomorrow?

For example, let's say that our collection of users on Mongo has now been broken down into a new cluster, and we have no more room for vertical expansion. So what can we do? So, we spent a night trying to improve. We add detection to increase clarity, which can be improved by adding new features. Then, on day 2, we tested the load again. This cycle will continue. Every day we wake up and have a cup of coffee. "Okay, let's take a load test now," then, in the afternoon, analyze what happened and improve the tool.

This is some of the important things we did for this user cluster during this time. Over the weekend, we added a full cache layer using the identity cache we call Memcached. This reduces the load on the user cluster by 90%. This gives us enough headroom, at least for a few weeks. It allows us to stick to the next day and solve the next problem. This process allows us to solve the problem. It also tells us an important reason: when we solve problems, the faster the feedback, the faster the progress. Every time we go through this cycle, it is an opportunity for improvement. The more times we go through, the more improvements we can make.

Sitkin: Let's go back to our timeline. Now in 2018, we have just experienced this painful, very tiring cycle. During this cycle, we got a fast feedback cycle and rapid improvements, and we made a lot of progress on our system. We have entered a period of stability, not a record high every day. We started to hit record highs less frequently. You might think that we will feel relieved by this experience. Yes, believe me, it’s great to be able to sleep more in these days.

However, every morning we have a potential anxiety, because we are no longer doing this wonderful load test. In terms of increasing our capacity, we don't have this real catalytic idea, that is, what is the most important thing we have to do. In terms of cycle, basically, this is when our load test is gradually out of our control. A little scared because we know there are still some lingering performance issues. Obviously, new features still need to be added as new situations continue to emerge.

In fact, we are now very concerned about launching new currencies. Every time we do this work, we can see slightly different interesting traffic spikes, and we want different traffic patterns to prepare for these situations. Therefore, at this point in time we feel the need to go back to reality and resume this load test step. We need to be able to see the actual load in the test environment.

Production flow capture and playback

 

Based on our experience with YOLO synthetic load testing methods, we are not very confident about the flow from the synthesis system. In the process of dealing with this problem, this lingering problem has appeared several times, that is why we can not use the actual production data in the load test. This seems to be a natural problem and very interesting. In this way, the way we create downtime in our load tests will naturally match the type of downtime we see in production. The reason is that we are actually using actual user behavior.

In addition, we are pretty sure that MongoDB is the most sensitive part of the current stack. It is the most shared resource, which is why we were down. However, in general, testing something alone is not a good idea, and I hope to use another concept here to prove why we feel it is useful to test it separately in something like a database. Neil Gunther has a concept of Universal Scalability Law. This is one of many different ways of describing system topology extensions. On this simple chart here, we already have two lines; the dotted line represents how the system without shared resources can scale. As throughput increases, the load has a perfect linear relationship, so it can perform perfect linear expansion. There is no reason for the relationship between the two to change as the load increases, and there are no shared resources in the system.

However, in contrast, we have this red line, which represents how a system with shared resources can scale because it competes for shared resources (such as databases). As the load increases, throughput will decline based on contention around shared resources. The shape of this curve will vary depending on the actual design of the system. However, the key here is that as the system grows, our revenue on these resources diminishes. Then, as the load grew more, we even started to retreat a little bit. This is a saying that the load on the database almost always makes us fail. It is for this reason that it is a shared resource. Most of the time, we have a fairly clear extension to things like application nodes. These nodes are stateless and are not shared resources themselves, but they are usually loaded into the database in question.

Mongoreplay cannon (mongoreplay cannon)

Therefore, we designed a capture playback around the load test database. In the end, we built something called the Mongoreplay cannon. It consists of two main parts; we already have the Mongoreplay capture process, which is organically integrated with our application nodes and back-end staff. It creates an extra socket on the database driver that listens to Mongo's wired traffic and stores it. Then, we have another process: when we are ready to implement the load test, we can interact and process the capture files, combine them into one file, and then play back to the test environment in multiples of the actual volume. These files are usually clones of our production database.

This is a huge victory. This is very good for us when we are talking about testing the database separately. We use it to do a good job of things, such as correctly adjusting our cluster size with confidence. Because we know that we can adjust different tunable parameters according to the database's change parameters, and know exactly how it responds in the real world. And because we are playing back the same load on the database that the application actually generates. We are able to create these very realistic faults in our test environment. Of course, the next thing we want to consider is that if this trick is really easy to use, then we can test the database separately. But what about testing the rest of our system with the same strategy?

Demi: As we can see, it is clear that we can test MongoDB separately. So literally, we can capture traffic and play it back between Rails and Mongo. This is the closest result we can get. However, we realized that there are many boundaries in our system and we have to come up with a way to test that matches the success of our Mongoreplay story. All this effort is to ensure that we don't take the opportunity at the right time. If we don't test the relationships and boundaries in the load test, then it can tell us a lot about the individual system. But what if there is a new regression between a slightly less prominent system or other shared resource in our environment? So we know that we need to try to find a different approach.

Traffic, data and systems

 

Let us start with traffic and look back at the three or three pillars of realism. These are the sorts we have based on their effectiveness. For example, the lowest-level, least practical, or basic test method is to use this simple single-user stream. So this means testing only what happens when the user clicks on three or four pages, just as we are running an A/B test or something like that.

The second, perhaps a more realistic approach, can do something like synthetic traffic generation based on real user flow. For example, we might have a tool that can generate traffic comprehensively, but we need to look at our log data to find out how to play it back in the new environment. Finally, Holy Grail is the idea of ​​this capture playback. We use completely real production traffic, we just point it to other content and see how it responds.

By the way, capture playback sounds good. The main problem we encountered was that it was extremely difficult to record the post body, especially for traffic. We store a lot of sensitive information in the post body, and there are a lot of problems when it comes to replaying this information in a non-production environment or in a production environment. In addition, many IDs for subsequent requests are not generated with certainty. Therefore, when we try to replay traffic, it is difficult to match those requests. As a result, what we need to do is rewrite them in some way to make them work, which is detrimental to reality.

Then, come back and talk about the data. The easiest way is to test the database with the siege test development environment to create them. Just create some basic users and insert them into the load testing framework. The next best option is to generate the data in a comprehensive way, but in a more practical way. Actually look at the user's layout, whether there are many users with many accounts, and the relationships between these users, and then create these things comprehensively in our environment.

Finally, there are ways to clear the data in the top two. Therefore, we can use the production database and put it in a less productive environment or in another production environment. However, it is important to explicitly delete the important customer information we want to protect. Obviously, the best possible way to get a real load test is to actually test the data in a production environment.

Finally, let's talk about the system, which makes sense. However, if we test in a simple development environment, we can't get the actual results. If there is a problem with the size of our cluster, then there will be no actual results. Then, create a production parity environment. Maybe we want to make a compilation framework that creates all of our production resources by pointing it to another AWS account and running it. This is very realistic. However, it is clear that testing in production is the ultimate strategy. We got the exact production environment, it has all the problems and nodes and so on.

However, when we started thinking about this problem, we realized that it was difficult to capture and play back, and the synthesis method was difficult. However, we are blockchain companies. Why can't we solve this problem with a blockchain solution? So we decided to adopt a bold new approach. So today, we are pleased to announce the release of Load Testing Coin at Coinbase. Load Testing Coin is an ERC20-compatible coin that allows you to tell your users to load and test your load. I am just kidding. We don't want to make everyone too excited. We have to do this, not a random Load Testing Coin, it is fake. But, in fact, it will be cool, right? If you only need to tell the user, …, not coin, this coin is stupid. Seriously, would you buy it?

Sitkin: Yes, I might consider buying it.

Demi: We will cheer. Sorry, this speech is to be transcribed, right? Can you get rid of this? So the key here is that the ideal solution is to tell the actual user to the website to do the real thing. Just like what happened to us, we want to create password fanatics as needed, but if we can motivate people to join, that's fine. Unfortunately, this idea was rejected. But the truth is, what we decided to do is to return to our initial strategy with Locust. We decided to see how many improvements can we make to this strategy based on these layers? How do we go up these layers and improve this strategy to the practical things we can use in our daily work?

Capacity cycle in practice

 

Sitkin: So, as Luke said, we return to the test strategy we used to call YOLO. However, we are starting from this new starting point to better understand what we need to do with the load test system. Therefore, in this section, I will gradually introduce the feeling of applying this capacity cycle in our actual daily work. As far as the load system is concerned, the load system is in a very basic state, and then our cycle is used to increase its authenticity. In the end, find some interesting extensions to our system.

The settings for this example are here and we are planning to start as soon as possible. We narrowed it down to just enough to create seed data with the specification factory. We can get realistic baseline levels from the data. Then, for traffic, we just wanted to recreate the script, just a single simplified version of a single-user stream, just the most basic possible work. Then, the system becomes based on the development environment we are already running, that is, a few nodes behind the load balancer, without background workers. Then, our three main backend databases each have only one instance. So it's straightforward, very simple, and very streamlined.

let us start. We performed the first load test with up to 400 requests per second. First, the CPU on Redis is fully utilized. We reached the upper limit. So, naturally, we'll take a closer look at how the system is laid out. Obviously, the problem here is that we isolate the workload in production by running several different Redis clusters, and then in our test environment we only run this cluster of services. So this is a fast cycle. Obviously, the easiest thing we can do here is to make our system more realistic. So, we intend to improve the layer by breaking the Redis cluster and test environment to match what we are running in the production environment.

Ok, then let's go through this cycle again. As a result, this time, we took the opportunity at the same time. Looking at the statistics of our Redis cluster CPUs, we see one of them running again at full capacity. As it turns out, we actually just forgot to disable a tracking tool that consumes a lot of computing resources, and we don't use it in production. Therefore, we have another good opportunity to improve the authenticity of the system categories in our load test environment, ie only need to disable the tool. Therefore, this is another fast feedback cycle.

Ok, let's test it again. This time, we went further, with 850 requests per second. Looking at these metrics in our system, this time something different happened. We looked at the Redis status and it looks fine. We looked at Mongo statistics and it looks fine. So we narrowed down the scope of the view, we only looked at the production requests we actually saw on this endpoint, and the end-to-end traces of the requests we saw in the load test system, and everything looked almost identical . But obviously, the problem is earlier than we expected. But it turns out that we noticed that the CPU on our application node is full. Basically, Ruby time is the root of the problem.

We have another opportunity here to improve the authenticity of the system by matching the ratio between the application nodes and the support services we actually run in production. The easiest way is to increase the number of application nodes running in the test environment, as this brings us closer to the production system. Therefore, we can make changes quickly and complete the cycle again.

The fourth time, we went further, with 1500 requests per second. The first thing that happened this time was that the CPU load on our MongoDB cluster was too high. If you use MongoDB often, you should know the excellent tool mloginfo. We used this tool. This is really a great tool for parsing MongoDB service logs and providing specific sorting subdivisions, especially if the query shape is expensive and slow in the system. And, it points out to us that there is a single, expensive query that has never appeared before and is ranked high here. This situation points us to the fact that there may be significant differences between the load test system and the shape of the data in production. Therefore, here we realize that we did not properly reset our database state between each load test run. This points us to the fact that we need to increase the authenticity of the data category, such as fixing this reset script.

Now we use another modified test run environment for another test run. We made an error with 1400 requests per second, a very similar range. But this time, the CPU statistics on MongoDB are no problem again. However, by looking at the trace, we see that the response from Memcached is actually slow. That is definitely what we are not used to seeing. Therefore, we did a little research on this issue and looked at hardware-level statistics. The first thing is that we are trying to run production-level loads on a single T2 micro. T2 micro is a small instance of running Memcached. Therefore, it is clear that it will soon collapse.

Again, there is a simple way to add realism by increasing the size of the instance on the MongoDB node. This gives us another opportunity to complete a cycle. This time, we actually did a pretty good job; there were 1,700 requests per second on this user stream. From a resource perspective, this is actually a particularly expensive stream. Usually, we don't often encounter this load in production. It is very encouraging to reach 1750 requests per second in a test environment; this is a good result.

The CPU statistics on Mongo look pretty good, and like our cross-stack hardware statistics, they look pretty healthy. This leaves us a little paused, because if our user flow is actually running in a very realistic way, then we naturally wouldn't expect to be able to provide such a load. Therefore, this makes us realize that there may be differences in traffic categories between what we see in the production environment and what we see in the load test environment. Going a little deeper, we noticed that this is absolutely true. In order to reproduce this user behavior in our application, we just run a small number of endpoints that really need to be clicked. By better investigating and writing better scripts to better rebuild what users see in the logs as they actually execute the process, we can improve the authenticity of traffic categories.

The last time this cycle was run, this time, we get up to 700 requests per second. This may be more realistic, depending on what we might expect to see. Analysis shows that most of our stack hardware statistics, especially Mongo, are fine. We are used to seeing failures in Mongo, so this is often where we first look at hardware statistics at the database level. However, the CPU on our web node is also full. The last time we encountered this problem, we felt that the application node in the test environment doubled because it allowed us to get closer to the production environment. However, this time we are not going to do this because the ratio of the number of nodes running in the load test to the number of nodes in the production environment is accurate. Therefore, doing this is not right, it will destroy our authenticity by adding more nodes.

However, when we really dig into some analysis to see what our application did on the CPU and actually took up something, we found something interesting. This is a problem we have already realized. However, we noticed that this particular performance issue is the bottleneck that caused our test failure. We have some event tracking that is very inefficient. So here, this is the result we hope to get. We actually found a convincing realistic bottleneck. When we reach this level of load, this may cause us to crash in production. In fact, the right thing to do here is to apply fixes. Therefore, we have improved the system. Now we can jump back to the cycle and get more insights.

I left this example here. However, I hope that by explaining this real-world example, I have already let you know how powerful this fast feedback cycle is. It can improve the system under test and improve the test system itself. Each iteration of the cycle is an opportunity to improve one of these things. Soon you will find out where the most important problem is.

One important thing to note about these issues is that we may realize that our system has many potential performance issues. But the real value is that it forces one of these issues to become the most prominent and most likely problem for us to fail. So, this makes our work plan very easy. This is the most important thing, it will actually take the ability to a new level.

Now, here we have some bigger plans for the future. However, I hope to leave an experience in this section that we are all pursuing authenticity in a load test environment. But the point is that as we increase our authenticity every step of the way, this process is as important as the ultimate goal, because we have to learn a lot along the way.

future

 

Demi: We are here today for this matter. The load we are using now is preparing for the issuance of the new currency. Now, this is more like a tool for Luke and Jordan. We can use it to find and confirm anything that triggers a surge in traffic each time we issue a new currency. We can test the known flows that will become prominent in these processes and ensure that they are not present before they happen, and we have not found the bottleneck.

We are currently increasing the authenticity of these tests. Building automated methods to understand the latest traffic patterns and applying them to the test methods we are actually running is our current top priority. At the same time, we need to continue to improve the way we use data. We hope to quickly conclude that our traffic is based on completely real traffic and that our data is based on real cleansing of user data. Ideally, our system is the same.

This cycle will continue to guide us. We are very excited about this capacity cycle. We advertise the effectiveness of this cycle internally to everyone; some believe that some people do not believe it. However, what we really have to do internally is to stick to this idea. We don't want this to be a tool for Luke and Jordan. Because the tools of Luke and Jordan can't survive for a long time. What do we do when we are distracted while doing other things? What if the tool starts to separate and disintegrate? It will become useless.

What we are trying to do is create a repeatable persistence element to deliver our code to production. We want to create this cycle that is essentially fast feedback for everyone. Therefore, when someone pushes for new production changes, they should know if it will affect performance in production. Therefore, some of the ways we deal with this problem may become part of the build process in a compressed test environment, such as a very consistent environment. How many requests can we handle to control the most important flows? These are the things we are going to pay attention to now.

lesson

 

But before we leave, I want to quickly review our lessons. First of all, good tools will make the problem come out, and bad tools will make the problem obscure and make us confused. I think this applies to almost anything, especially if we are the ones who often mine problems in production. If we don't have the right tools, add the right tools. If we don't understand our tools, be curious and study it. Don't think that what we see is actually there.

Faster feedback means faster progress. This applies to everything, but it is especially suited to our survivability and development during these crazy traffic growth periods. This is only because we can use this quick feedback to guide the actions we take. In the end, we used a simplified load test environment to get excellent load test results and increased authenticity. Therefore, when we perform load testing, the process is as valuable as the goal. Don't be afraid to go deeper and start right away to get results. We have encountered problems in the past, and we have experienced countless opportunities. We first thought of authenticity and tried to make everything the same as in the production environment, but in the process, we lost what we learned along the way.

The speech ends here. In the end, we learned a lot of really valuable lessons about capacity planning. We certainly hope that this speech will help you avoid the mistakes we have made. This is our tweet number. However, we really want to know how you are doing load testing in the field. Hope this will inspire some interesting conversations. Of course, you can also contact us via Twitter at other times during the meeting.

Q&A

 

Participant 1: Thank you for your presentation. Very insightful and truly reflects some of the activities our team is currently working on. I really want to learn from your experience and apply it to my team. One of my questions is, where did you trigger these tasks? If the load test is reduced by 5, 10, and 50 times from the production load, how far will you extend? Are you satisfied with "Well, the current design is like this, what can it support?"

Sitkin: Locust provides a web UI that lets us really start testing. So this is one of the benefits of using this open source tool. In order to answer your question about how we decided to take the test to the next level, we have basically set it to a ridiculous degree so far, we are looking for time when performance begins to decline and service crashes. Therefore, at this point in time, we are really doing stress testing. We are looking for the time when the system stops working, and then just measure what it is and constantly increase it. We have a rough idea of ​​what we want to achieve, and this is a rough multiple of the reasons why we took the opportunity in the last solution. However, a large part of it is only aware of where a particular user stream begins to disappear.

Participant 1: There is still a problem. Since relying on services is not under your control, how do you use dependent services? If these are bottlenecks, do you have any suggestions to solve them, or even work with these teams to solve the problem?

Sitkin: Yes, this is a bit of an artistic part of the load test I said. One of the benefits of being very simple at the beginning is that if it's hard to get started, let's put it aside. Then, when you have confidence in the authenticity of a small stream, you can start adding these dependencies one by one. Each of them has its own characteristics, the best way to delete or simulate.

For example, as a cryptocurrency company, many of the things we end up with are blockchains. Of course, we must creatively eliminate these things. We can only achieve a certain degree of authenticity with such things. Therefore, I find it difficult to give a general answer to such questions. However, I think that going back to the process, everything is about solving them one by one. Get a fast feedback cycle and add only one at a time. Not all at once, the pursuit of complete authenticity. Because the observations obtained by adding a change can be very useful.

Participant 2: Thank you for your wonderful speech. My question is, as a fast-growing company, when you quickly release new products and new features, do you see in production that the shape of the data and the pattern of requests often change? If so, how do you get your load test environment to keep up with these changes?

Demi: Yes, especially in terms of traffic, this is one of the main benefits of using capture playback. You can guarantee to capture and run again, you can guarantee that the traffic is up to date. but. Our current treatment method is. Run through a script in various parts of the data, and then we extract the streams that exist in it. Therefore, there is a certain stateful flow. Therefore, you can't just play back the requests in order, you need to pull them out.

What we can do is generate the data shapes and apply them to Locust. This part is actually easier. The shape of the data is actually extremely difficult. This is a problem we have not solved at all. We know what the shape of the data is, and we can actually use the analysis tools that our data team is using. They can guide us on the largest users we have, for example, 10,000 different transactions, right? So we can build a distribution there, but to keep it up to date, we haven't solved the problem at all. This is very difficult. But ideally, what we can do is interact with these tools and make adjustments as we clean and update the tests.

Participant 3: Awesome presentation. I want to know, because generating load tests is quite expensive, basically like an attack, because we are trying to get the site down. I want to know, have you found a way to calculate the theoretical limits in real situations? Is it possible to infer and find the next bottleneck based on some examples?

Sitkin: I have come across a little bit of the concept of general extension theory. These are things we should pay attention to when we implement this somewhat clever way to load test. This also helps us form assumptions about what might go wrong. However, I think we found that the good thing about entering the cycle is that we can test these assumptions very quickly. This also makes our performance theory approach a little better, seeing the true relationship between these things and the real world. However, in order to answer your questions more directly, we have not done a similar approach to academic research. We have been paying attention to the inspiration generated in our system. Do you have any different opinions?

Demi: It's expensive. What we have to do now is to build some really complex technologies so that we don't fail because of financial problems. This involves how to scale. Fortunately, AWS makes this very easy. The way the database is extended, the way the application service is extended, we have a script to start and close. This is helpful.

However, what we like to do is that when we accept the idea of ​​this squeeze test, every pull request or submission into our main project is what we are going to do. I think we haven't really discussed what we can do as a start, but use all of the same data we currently have in the squeeze test environment and do the same in a full production environment to see if they can handle How many. Then we can make inferences, which would be a pretty good deal. What do you think of this idea?

Sitkin: Good idea. Luke.

Demi: It's not easy to make a speech together. We must communicate each idea. I am very happy that you like this.

Read the original English text: Capacity Planning for Crypto Mania; https://www.infoq.com/presentations/coinbase-cryptocurrency

Source: blockchain outpost

Translator | Yao Jialing