Creating Pokemon AI
Recap and Reflection of a Pokemon Reinforcement Learning project from a project leader perspective
During the Winter 2021 semester, I decided to lead a project team within Michigan Data Science Team. Overall, I think the project was successful; each team generated interesting results. Many enjoyed the project, tackled each work session with dedication, and finished the project with newfound knowledge. This experience was highly valuable to me and taught me to become a better project lead.
Usually, MDST projects follow a traditional ‘data science’ workflow: develop an objective, create/find a meaningful dataset, understand the data, build a model, analyze model’s performance on objective, and iterate.
I opted to try a non-traditional, exploratory topic: Reinforcement Learning. Specifically, I tasked my team to try to create AI for a game called Pokemon Showdown, which I will call Showdown for the rest of this post.
Quite a few reasons compelled me to choose Showdown over other games. First, Showdown is a fairly popular online game with around 15,000 users online at a given time. We can also test AI agents against real people using the ELO system. Second, Showdown is an open, unstudied game unlike chess, tetris, or poker. I didn’t want students copying from other literature to encourage fresh ideas unique to Showdown. Third, Showdown is open source and a Python library exists to interact with the website (both prod and local versions) to create AI agents. Lastly, Showdown is a zero-sum two-player imperfect information game. What does this mean? Showdown games have two players where only one wins. Players in a Showdown game do not have knowledge of the full state of the game (e.g. a player doesn’t know data about their opponent’s Pokemon). The search space of each game is effectively infinite with hidden branches (each move has an element of randomness associated with it, and we do not know which moves the opponent has), so a tree search based algorithm such as MCTS in AlphaGo doesn’t quite make sense. I liked the idea of a balance between chess (two-player, zero-sum) and poker (imperfect information). Regarding project feasibility, I felt that (without proof) Showdown RL algorithms have a much lower computational necessity than say the $35 billion price tag of AlphaGo. I am still not sure about the last point, but I can say for a fact that using Google Colab’s free resources is definitely not enough!
PokeRL had around 10 members, ranging from freshmen who had just learned to code to a PhD student with published research and graduate machine learning courses under his belt. As I hoped for each student to have a meaningful project experience while learning knew material, setting expectations and developing educational material/lectures was challenging for me.
I surveyed each member and partitioned teams based on interest, experience, and algorithms that they wished to explore. I challenged teams to create the best AI that they could to be evaluated against other teams on the last session - gameday. Three teams were formed, self-named Elite4, Virn, and Solo Pokemon. I considered making my own team, but I mistakenly underestimated the time overhead of running the project, helping teams debug, discussing algorithmic details/implementation, and teaching general software/machine learning skills.
We met weekly on Sunday from 11am-2pm for seven weeks during the semester. Initially, I thought to split up each session with a 1 hour combined lecture and a 2 hour team work period. However, by the third session, I decided to shorten or cut out the lecture. I realized that the experience level of the members were on the lower end with most members fairly new programmers and with not as many math/cs courses under their belts. Once I understood this, I decided to shorten each lecture and focus on less technical details. I spent the gained time talking to individual groups and teaching/explaining a topic that they were interested in or were struggling to apply to their project. For students with more experience or interest in technical details, I individually explored relevant topics.
To kickstart each team, I provided each team with starter code with instructions to run and a greedy agent implementation, a simple DQN agent implementation (from the official documentation), and a tf/keras modelled player interface.
Initially, I anticipated members to run their code locally. Instead, some were met with technical difficulties. As some of those people were less experienced, we lost valuable project session time helping them debug. I felt bad - the inability to set up is frustrating, and the student would feel less motivated, especially during a long three hour work session. This mistake on my part cost us limited productivity hours before I decided to shift everything to Google Colab. The migration was not so simple because I had to figure out how to run a local server within Colab (games are player through a socket connection to a local Showdown server). These issues would have never happened if I had been more proactive in my project preparation.
As a baseline model, I created a greedy agent who is (almost) guaranteed to make a move that deals the most damage to the opponent’s current Pokemon. The greedy agent hovers slightly above 1000 ELO, the default player ELO, when deployed to the official Showdown server.
Solo Pokemon and Virn developed unique heuristic, rule-based algorithms. Elite4 decided to try a neural network powered reinforcement learning algorithm.
Solo Pokemon’s algorithm attempted to determine whether to play a greedy move, switch to another Pokemon (at the beginning of each game, players start with six Pokemon), or dynamax (a feature unique to the specific game-mode in Showdown). If the current Pokemon had a unfavorable speed, a type disadvantage, and high probability of fainting (when the Pokemon is unusable for the rest of the game) based on its current health and defense stats, Solo Pokemon opted to switch into another Pokemon. Dynamax was determined based on the alive Pokemon in their party, a strong type advantage, and the base stats of the current Pokemon. In all other cases, Solo Pokemon chose a greedy move.
Virn had a simple but effective approach. If the current Pokemon had the best move against the opponent Pokemon, then the current Pokemon would use that move, dynamaxing if the move was super affective (favorable type advantage). Otherwise, switch to the Pokemon on the bench with the best move.
I think it is interesting to note that Solo Pokemon and Virn calculated the best move differently, although I won’t go into specific details as they require a bit more knowledge about the game.
Elite4 used Deep Q-Learning with Experience Replay with Double DQN. More information about exact details can be found here: DQN paper, Double DQN paper, library implementation. This algorithm is fairly well documented and implemented, allowing the team to spend their time to experiment state vectorization, network architecture, reward calculation, opponent selection, and more. I doubt DQN is near an optimal approach to this problem, but I think DQN is a very solid choice of algorithm for a weekly, semester long team project #with less experienced students. I found it easier to build an intuition for this algorithm for those interested. Elite4 experimented with different ways to represent state based off the data given from the environment, settling on a state vector with 21 components combining Pokemon types, move powers and multipliers, statuses, and dynamax ability. They experimented with ways to represent reward and model architectures, settling on one with 7000 parameters. The model’s input is a state array and output is an array of action probabilities for 22 possible actions. I do not recall the exact values they used for parameters such as $\epsilon$ for the $\epsilon$-greedy policy, experience replay memory size, $\gamma$ for reward discount factor, among other tunable parameters.
To train, Elite4 first trained their model through battles against the greedy agent until they had a ~60% win rate. Then, Elite4 trained against an opponent with a fixed snapshot of its model, periodically updating the opponent’s model if the current model consistently beat the opponent. The team attempted to train on Colab, but their methods were still too time consuming and computationally intensive for reaching satisfactory results. Nevertheless, they attained a model trained for about 15 hours.
Against the greedy agent, Solo Pokemon and Virn claimed a 75% win rate, while Elite4 had a 65% win rate.
The table below shows results of a round robin tournament between the three teams.
Games won out of 1000; row represents challenger and column represents opponent
I thought it was cool battles were quite fair - no team was completely wiped/wiped other teams. I was suprised by Elite4’s relatively formidable results, despite the difficulties in training and challenges that the team faced throughout the semester. With current approaches, Virn is the winner of MDST PokeRL with 55% win rate over Solo Pokemon and a 65% win rate over Elite4!
Below you can watch one battle between each team.From a human perspective, we can see mistakes that agents are making. These mistakes are generally easier to fix for heuristic based agents but not so simple for an RL trained agent.
I mentioned above that I was also interested in creating my own RL agent with a more sophisticated model and algorithm selection and training based on my knowledge. Although I did not have time in the semester due to my position, I decided to tackle this problem afterwards. Due to my internship I didn’t have much time to work, but I am gaining progress in my approach. My current results are not in a presentable state, but I will update this post in the future with results. I aim to get as good of an AI as possible with a solid ELO rating and acceptable performance against most active users on Showdown.
If you want updates, feel free to contact me and ask!