Deep Reinforcement Learning researches in RTS and MOBA games (Around 2020)
Video games are some of the best ways to test the capability of an AI model thanks to their complex yet controlled environment and clear sets of rules. Video games often involves complicated state and action spaces, which renders the knowledge learned from training agents in the gaming environment valuable to real-world usage. RTS and MOBA are among the most popular video game genres. Real Time Strategy games feature a third-person, bird’s eye view of a map filled with resources, armies, and bases. Players in an RTS game compete against one another in real time for control of the map. Multiplayer Online Battle Arena branches off from RTS, with the difference being that players only control one hero unit. 
This article is going to cover two papers on creating AIs that can match professional human players using deep reinforcement learning. The first paper is “Grandmaster level in StarCraft II using multi-agent reinforcement learning”, which addresses the challenges of StarCraft using general-purpose learning methods that are in principle applicable to other complex domains. The second paper is called “Mastering Complex Control in MOBA Games with Deep Reinforcement Learning”, featuring a low-coupling highly scalable actor-critic network that can defeat top professional human players in an 1 v 1 MOBA game. Both papers provides made significant progresses in model design, training systems, and outcomes.
StarCraft II poses a unique set of challenges for its players. In order to win, the agent needs to be superior in both grand strategies such as maintaining economy and micro-management when skirmishes take place. This demands a highly robust model that can do complex perception on a large action state as well as decision dimension. On top of that, the model should react fast to the perceived situations, whether expected or not. The proposed model incorporates both supervised and reinforcement learning.
On the high level, the Markov Decision Process of this model includes three states: perception/map knowledge, action/decision, and actuation. The input to this model is real-time screen image just like the human player. The model immediately runs a perception process to extract information from both the main screen and the minimap. It will also identify friendly and hostile units and keep track of them. Once the details on screen are gathered, they will be sent to the action/decision layer for evaluation. Unlike Go, the information is imperfect, and the action state must live with that. The decisions will be formulated into a structured action space and issued to the units through the actuation state. All of these happens under 400ms.
The model uses multiple types of neural networks to perform different roles. Transformer architecture is used as the attention mechanism to determine relevant image sections. Deep LSTM is used as a way to infer partial information. An auto-regressive policy head is applied to transform an N-dimension action into N 1-dimensional sequences, with a pointer network to help determine unit type. Finally, a resnet retains pixel information such as minimap encoding. An action will be produced as a sequence of instructions. This design may “help with many other challenges involving long-term sequence modelling and large output spaces” according to Deepmind. 
In order to ensure model robustness and effectiveness, the researchers applied a unique “Supervised Learning+ Adversarial Learning + Multi-agent Learning” method. The model is first supervised with human knowledge so that the model can learn basic moves and strategies faster. After this initial stage reaches the “Gold” level, the researchers switched to population/multi-agent reinforcement learning.
Three different roles of agents are created:
- Main agent: The one to be deployed
- League Exploiter: To find common weaknesses from past iterations in the League
- Main Exploiter: To find common weakness of the current main agent
They incorporated a method called “fictitious self-play” by keeping all previous agents in a league. New agents will play against the league and learn from the games against all other competitors. After that, it will join the league as well. Since there is no strategy that guarantees a win, this process will enable the model to keep learning new approaches and patching weaknesses. It also ensures that new agent can defeat all previous iterations, rather than tunnel-visioned on the best-performing one.
Off-policy actor-critic reinforcement learning (ACER) is modified to serve as the weight update rule with experience replay, self-imitation learning, and policy distillation.  Updates follow UPGO (upgoing policy update), thus optimizing for sparse-reward situations such as self-imitation.
The training of AlphaStar takes 32 Google v3 TPUs and 44 days.  The model has been proven effective, achieving high win rates and EPM (effective actions per minutes). An earlier version of AlphaStar defeated two professional players in 1 vs 1 matches. Not only is this model a sign of AI matching human in complex dynamic tasks, but it also shows a future for training AIs on large action space through a long time period. In addition to that, the use of multi-agent reinforcement learning and supervised+reinforcement learning just might inspire researches in other fields that has limited human knowledge with high strategy flexibilities.
Unlike the AlphaStar paper, the second paper, “Mastering Complex Control in MOBA Games with Deep Reinforcement Learning”, emphasizes more on a low-coupling-high-scalability reinforcement learning infrastructure. Similar to the AlphaStar paper, however, this model also takes advantage of self-play learning. It also features some algorithm improvements.
Compared to conventional RTS games, MOBA games are micro-management intensive as players focus on only one hero. More specifically, MOBA players need to identify enemy abilities and cast their own abilities/equipments more precisely. Since the state and action spaces remain huge, and training system design remains a big challenge.
The proposed system includes four parts:
- The AI Server with game environments: A cluster of servers with game environments, agents, and gain information with softmax
- Dispatch module: Gather information such as penalty, feature values, etc. and pack the gathered information before sending it to the memory pool
- Memory Pool: A server running a memory efficient circular queue.
- Reinforcement Learner: A distributed training environment that pulls information from the memory pool and does RL training. Information is shared through RAM to achieve faster transmission. The updated weights and award/penalty are sent directly back to the AI server.
In regards to algorithms, the paper proposes a few design elements that help improve its robustness. These technical points includes attention mechanism, LSTM, and action mask. At the first glance, this may look a bit similar to the AlphaStar design, but the roles of these elements are indeed different due to the distinction between two genres. Attention mechanism is still a target selection technique, but mainly focuses on enemy heroes. LSTM is used for precision in casting skills, which is not as important in a conventional RTS game. Action mask incorporates correlations between the action and final output layer based on prior human player knowledge to increase training efficiency. The model input is more comprehensive thanks to its training environment. It takes in the game interface image, a unit aggregation, as well as the game state information. One other change mentioned in the paper is called “Dual Clip PPO”, which is an algorithm to restrict drastic changes within two steps.
The authors of this paper put trained agents against professional human players in 1 vs 1 battles with 5 different heroes. The trained agents crush their rivals in these scenarios. This outcome is not surprising given the well-thoughtout model design and the streamlined training infrastructure. However, it is worth noting that 1 vs 1 with the same hero is not very common for a MOBA game, and there is a lot room for future progress in this sense. Moreover, the clear training system pipeline can allow researchers to focus on algorithm design in future development on reinforcement tasks, which can be an invaluable asset for reinforcement learning. 
. Duffy, J. (2020, July 26). MOBA, FPS, RTS and more: An ESPORTS Genre Guide. Retrieved April 06, 2021, from https://thesmartwallet.com/moba-fps-rts-and-more-a-guide-to-esports-genres/?articleid=15988
. AlphaStar: Mastering the real-time strategy Game StarCraft II. (2019). Retrieved April 09, 2021, from https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., … & Silver, D. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), 350–354.Chicago
. Ye, D., Liu, Z., Sun, M., Shi, B., Zhao, P., Wu, H., … & Huang, L. (2020, April). Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, №04, pp. 6672–6679).