Topic: Cookie Run: Kingdom, a total of 56 hours of emergency inspection retrospective-Why did you pick the sword at that time
Lecture: Changwon, Park Sae-mi-Dev Sisters / Devsisters
Presentation: Programming / Server Engineering
Recommended Target: Server Engineer, DevOps Engineer
Difficulty: Basic prior knowledge needs
[Lecture Topic] Cookie Run: Kingdom was released in 2021 and was loved a lot, but unfortunately, it was a long-term temporary inspection. In this announcement, I would like to convey the experience of checking for 56 hours early last year. We will share the lessons from the post-war background, the solution, and the lessons from the problem.
Regular inspection, temporary inspection, emergency inspection, and extension. It is a type of ‘inspection’, which is often called four online games among users. Although regular inspections have been recognized as a routine for the stable operation, update, and patch of the game, the remaining three types are out of the routine and expectations of the users, so it is also a sword that does not want to be used. However, for the stable operation of the game, there are often situations that have to be selected.
Cookie Run: Kingdom, released in January 2021, faced this problem early in launch. A total of 56 hours of inspection of 36 hours on January 25 and 20 hours on February 19 were written in some media as well as the community. Why did Dev Sisters pulled out two swords of emergency inspections and extension inspections that they didn’t want to draw, and what happened inside at that time? Dev Sister’s Park Sae-mi and Lee Chang-won introduced the timeline of the causes, response processes, and realities of the problem.
■ DB capacity difference is faster than expected, the first inspection occurring, until the transfers of more than 7TB
Prior to the lecture, some of the actions taken by Dev Sisters were introduced for the stable service of Cookie Run: Kingdom. In Dev Sisters, the database user data is replicated and stored in a seven-medium replica and periodically backup. In addition, AWS Global Accelerator responds to communication network failures, predicts the workloads necessary for databases and game server infrastructure, and distributed them to three IDCs. In addition, clients-server real-time analysis logs are also applied to optimization.
The DevOps team collaborated with the server development team to conduct 79 load tests before launching, and selected and combined the instance type and size suitable for the workload of Cookie Run: Kingdom, and found a cycle to find and improve the problem of architectures. Through this process, Cookie Run: Kingdom was released on January 25 last year, and succeeded in digesting all traffic at the time of launch. Even though many more users were introduced than they were proposed, they were able to continue their services stably because of their relaxed sets of load goals and strong in sales.
Immediately after launch, the first weekend was usually the most important issue when managing the server, so the monitoring continued to monitor in Dev Sisters. Fortunately, the first weekend passed without an issue, but the embers of the problem remained. Cookie Run: Kingdom adopted CockroachDB, a SQL database that works in a distributed environment, and the speed of the database storage was faster than expected. The development team did not assume that the DB storage was not assumed, but during the weekend monitoring, the database capacity was confirmed and decided to make a first priority on Monday and to establish a strategy.
As a result of identifying the DB storage risk at work on Monday, it was calculated that it reached the Catastrophic failure state in 36 hours. The trend of the user’s entire trend was usually to be bent, but it was difficult to be sure that this expectation would be correct because the number of users continued to increase in Cookie Run: Kingdom. So it was necessary to prepare insurance in case of a disc grass.
In any case, if not measured within 36 hours, the deadly error was confirmed, so the engineering team, which had been crunched for two weeks with preparation and monitoring, responded. In the process, unintentionally, Configuration issues occurred, and the cluster caused inconsistency, which automatically stopped transaction processing. As a result, at 4:52 pm on the day, the emergency inspection began.
Park Sae-mi summarizes the situation at the time that the cluster refused to read without recognizing the data. To solve this, she asked Cockroach Labs that she could choose data from the data storage layer in the node, but she was answered that she was not sure how many weeks it took and succeeded. So she had to find another way to solve the problem while she shortened your work time.
Then, one of the nodes that were not affected by the Configuration issue occurred even in the sudden situation of descending to AWS Host Status Failure. In this situation, since the old cluster refuses to read the data, it is not to create a new cluster and move the data. You have to port the binary data with CSV, and it was concluded that you can move by combining the coackroch DB command with the Shell Script.
The problem was that the total size of the binary file was about 7,577GB, and it took 24 hours just to port it to CSV. So, using CockroachDB is open source, I looked at the source code and customized the binary command.
At 10:30 am the next day, the CockroachDB source code modification was completed, and the CRDB2CSV custom build was completed 10 times faster than before. Using this build, the binary data has been ported to CSV in an hour and a half. Nevertheless, the binaries were so large that they needed a distributed processing environment. Dev Sisters used the Tokyo region mainly, and during the porting work, all the R series instances in the Tokyo region were exhausted at about 3 pm. As I had to work quickly, this problem was taken by securing resources by demolishing some of the other infrastructure I used. At the time, 350 R5.8xlarge was drawn, and in terms of CPU, 11,200 core and 89,600BG were 89,600bg.
As a result of checking the backed up data and logs by porting the resources, it was confirmed that the data that was backed up matched 100% of the existing data. Based on this, the previous test was conducted from 10:02 pm the next day, and the check was not a problem and the inspection was completed after 31 hours and 45 minutes from the beginning of the inspection.
Since then, the inspection was terminated at 11:37 pm, but the user’s explosion resulted in overloading the platform server, such as about 700,000 API calls per minute. Naturally, the DB went to the load, resulting in a DB deadlock on the platform login server, and eventually had to perform a paleover. While responding on the platform, the engineers of the DevOps team were monitoring the Cockroach DB clusters between moving, but the situation was not good overall. The error rate was high, and there were many time-out errors that could not handle requests within the time when the DB was set. There was no clear way to monitor, and when I went beyond the critical value at 2:40 am, I eventually entered the second emergency inspection.
In the second emergency inspection, the optimization was applied to refer to the situation that occurred in the second cluster, and then prepared a new cluster and moved the data. It took 2 hours for the work, and after the data transfer, the inspection was completed at 8:30 am, which is about 36 hours and 30 minutes after the inspection time.
The first issue that occurred when the cluster operation was suspended due to an unintentional issue due to an unintentional issue, but the source code and technical document of the cockroachdb, which was eventually based on the cockroachdB, newly founded based on the cockroachDB. It was ended as it succeeded in finding a way to move quickly to the cluster and moving 100%. In the process, there were problems such as traffic overload, but this was also possible to complete the measures after moving to the third cluster after optimization and modification.
After the inspection, Dev Sisters has expanded to a 60-year-old cluster, judging that it is necessary to secure additional DB capacity, and is currently operating a total of 90 DBs. In addition, the infrastructure work process has been improved and stated to prevent configuration issues. Even if you do it within 36 hours, the worker has to share the screen and create a guideline for one or more watchers. There is also a queue server to use in an emergency.
** ■ Recover DB using the second inspection and black magic (?)
Three weeks after the first long-term inspection ended, Dev Sisters held a dinner to comfort the launch and long-term inspection. Meanwhile, the error message that a kingdom DB node went down to hardware failure was suddenly problematic, with six DB nodes down. The cause of the problem was that the cooling unit of the AWS Data Center Tokyo Region did not work due to a power outage. As a result, the server room temperature rose sharply, and in about 12 minutes, six Cookie Run: Kingdom DB became incapable of working.
Cookie Run: Kingdom uses CockroachDB, but internally, it is a key-value store, so it is stored in the disk with a constant binary format. This data was split into the right size and managed in a range unit. As six DBs down, 34 out of 25,000 Range were lost, of which two ranges were related to user data. In particular, 4 out of seven replicas of lost Range