Creating Pokemon AI

Recap and Reflection of a Pokemon Reinforcement Learning project from a project leader perspective

Intro

During the Winter 2021 semester, I decided to lead a project team within Michigan Data Science Team. Overall, I think the project was successful; each team generated interesting results. Many enjoyed the project, tackled each work session with dedication, and finished the project with newfound knowledge. Leading PokeRL will resonate with me for years to come.

Usually, MDST projects follow a traditional ‘data science’ workflow: develop an objective, create/find a meaningful dataset, understand the data, build a model, analyze the model’s performance on objective, and iterate.

I opted to try a non-traditional, exploratory topic: Reinforcement Learning. Specifically, I tasked my team to try to create AI for a game called Pokemon Showdown, which I will call Showdown for the rest of this post.

Motivation

Quite a few reasons compelled me to choose Showdown over other games. First, Showdown is a fairly popular online game with around 15,000 users online at a given time. We can also test AI agents against real people using the ELO system. Second, Showdown is an open, unstudied game unlike chess, tetris, or poker. I didn’t want students copying from other literature to encourage fresh ideas unique to Showdown. Third, Showdown is open source and a Python library exists to interact with the website (both prod and local versions) to create AI agents. Lastly, Showdown is a zero-sum two-player imperfect information game. What does this mean? Showdown games have two players with one winner and no draws. Players in a Showdown game do not know the full state of the game (e.g. a player doesn’t know data about their opponent’s Pokemon). The search space of each game is effectively infinite with hidden branches (each move has an element of randomness associated with it, and we do not know which moves the opponent has), so a tree search based algorithm such as MCTS in AlphaGo doesn’t quite make sense. I liked the idea of a balance between chess (two-player, zero-sum) and poker (imperfect information). Regarding project feasibility, I felt that (without proof) Showdown RL algorithms have a much lower computational necessity than say the $35 billion price tag of AlphaGo. I am still not sure about the last point, but I can say for a fact that using Google Colab’s free resources is definitely not enough!

Team

PokeRL had around 10 members, ranging from freshmen who had just learned to code to a PhD student with published research and graduate machine learning courses under his belt. As I hoped for each student to have a meaningful project experience while learning new material, setting expectations and developing educational material/lectures was challenging for me.

I surveyed each member and partitioned teams based on interest, experience, and algorithms that they wished to explore. I challenged teams to create the best AI possible to be evaluated against other teams on gameday. Three teams were formed, self-named Elite4, Virn, and Solo Pokemon. I considered making my own team, but I mistakenly underestimated the time overhead of running the project, helping teams debug, discussing algorithmic details/implementation, and teaching general software/machine learning skills.

We met weekly on Sunday from 11am-2pm for seven weeks during the semester. Initially, I thought to split up each session with a 1 hour combined lecture and a 2 hour team work period. However, by the third session, I decided to shorten or cut out the lecture. Personalized focus time between each group provided a more valuable use of time. Groups varied in technical experience, and focus time allowed me to tailor my help individually. For advanced members, discussions replaced help.

Set Up

To kickstart each team, I provided each team with starter code with instructions to run and a greedy agent implementation, a simple DQN agent implementation (from the official documentation), and a tf/keras modeled player interface.

Initially, I anticipated members to run their code locally. Instead, some were met with technical difficulties. As some of those people were less experienced, we lost valuable project session time helping them debug. I felt bad - the inability to set up is frustrating, and the student would feel less motivated, especially during a long three hour work session. This mistake on my part cost us limited productivity hours before I decided to shift everything to Google Colab. The migration was not so simple because I had to figure out how to run a local server within Colab (games are played through a socket connection to a local Showdown server). These issues would have never happened if I had been more proactive in my project preparation.

Algorithms

As a baseline model, I created a greedy agent who is (almost) guaranteed to make a move that deals the most damage to the opponent’s current Pokemon. The greedy agent hovers slightly above 1000 ELO, the default player ELO, when deployed to the official Showdown server.

Solo Pokemon and Virn developed unique heuristic, rule-based algorithms. Elite4 decided to try a neural network powered reinforcement learning algorithm.

Solo Pokemon

Solo Pokemon’s algorithm attempted to determine whether to play a greedy move, switch to another Pokemon (at the beginning of each game, players start with six Pokemon), or dynamax (a feature unique to the specific game mode in Showdown). If the current Pokemon had an unfavorable speed, a type disadvantage, and a high probability of fainting (when the Pokemon is unusable for the rest of the game) based on its current health and defense stats, Solo Pokemon opted to switch to another Pokemon. Dynamax was determined based on the alive Pokemon in their party, a strong type advantage, and the base stats of the current Pokemon. In all other cases, Solo Pokemon chose a greedy move.

Virn

Virn had a simple but effective approach. If the current Pokemon had the best move against the opponent Pokemon, then the current Pokemon would use that move, dynamaxing if the move was super effective (favorable type advantage). Otherwise, switch to the Pokemon on the bench with the best move.

I think it is interesting to note that Solo Pokemon and Virn calculated the best move differently, although I won’t go into specific details as they require a bit more knowledge about the game.

Elite4

Elite4 used Deep Q-Learning with Experience Replay with Double DQN. More information about exact details can be found here: DQN paper, Double DQN paper, library implementation. This algorithm is fairly well documented and implemented, allowing the team to spend their time to experiment state vectorization, network architecture, reward calculation, opponent selection, and more. I doubt DQN is near an optimal approach to this problem, but I think DQN is a very solid choice of algorithm for a weekly, semester long team project #with less experienced students. I found it easier to build an intuition for this algorithm for those interested. Elite4 experimented with different ways to represent state based on the data given from the environment, settling on a state vector with 21 components combining Pokemon types, move powers and multipliers, statuses, and dynamax ability. They experimented with ways to represent reward and model architectures, settling on one with 7000 parameters. The model’s input is a state array, and its output is an array of action probabilities for 22 possible actions. I do not recall the exact values they used for parameters such as $\epsilon$ for the $\epsilon$-greedy policy, experience replay memory size, $\gamma$ for reward discount factor, among other tunable parameters.

To train, Elite4 first trained their model through battles against the greedy agent until they had a ~60% win rate. Then, Elite4 trained against an opponent with a fixed snapshot of its model, periodically updating the opponent’s model if the current model consistently beat the opponent. The team attempted to train on Colab, but their methods were still too time consuming and computationally intensive for reaching satisfactory results. Nevertheless, they attained a model trained for about 15 hours.

Results

Against the greedy agent, Solo Pokemon and Virn claimed a 75% win rate, while Elite4 had a 65% win rate.

The table below shows the results of a round robin tournament between the three teams.

Games won out of 1000; row represents challenger and column represents opponent

Solo Pokemon Virn Elite4
Solo Pokemon - 451 596
Virn 549 - 656
Elite4 404 344 -

Interestingly, battles were quite fair - no team was completely wiped/wiped other teams. I was suprised by Elite4’s relatively formidable results, despite the difficulties in training and challenges that the team faced throughout the semester. With current approaches, Virn is the winner of MDST PokeRL with 55% win rate over Solo Pokemon and a 65% win rate over Elite4!

Below you can watch one battle between each team.

From a human perspective, we can see the mistakes that agents are making. These mistakes are generally easier to fix for heuristic based agents but not so simple for an RL trained agent.

Further Work

I mentioned above that I was also interested in creating my own RL agent with a more sophisticated model and algorithm selection and training based on my knowledge. Although I did not have time in the semester due to my position, I decided to tackle this problem afterward. Due to my internship, I didn’t have much time to work, but I am gaining progress in my approach. My current results are not in a presentable state, but I will update this post in the future with results. I aim to get as good of an AI as possible with a solid ELO rating and acceptable performance against most active users on Showdown.

2022 Update: I ported state and network model representations to pytorch. I then used a library called mushroom_rl and implemented a training loop with it. I spun up an AWS GPU instance to put all my hard work together. Exciting stuff! A few hours into the training run, I check in anddddddddddddddddddddddddddd the code is hanging. Hmm, what’s wrong?!? I diagnose the problem as an issue in battle simulation and after further investigation, I find a concurrency bug. I thought about this bug on and off for a long, long time but have no working solution. My priorities have changed and perhaps with full-time headbashing effort, I can engineer a solution. But for now, this quest is on pause.

Either way, I don’t think reinforcement learning purely is the best way to approach this problem. Ideally, I’d collect a large game database (perhaps a good use of my scraping knowledge), pretrain an attention-based network, and add in search and/or self play based techniques at the very end.