RL Mahjong Engine (Japanese Riichi or Chinese Mahjong)
Overview: A 180-day curriculum to develop a Mahjong AI using reinforcement learning. The program will create a Mahjong environment (Japanese Riichi or a Chinese variant) and train a bot through selfplay with policy/value networks, carefully shaped rewards, and rule-based guidance due to the game’s complexity (imperfect information, large action space, sparse rewards). Additionally, a mimicry system will be developed to learn a specific player’s style from game logs (e.g., how a particular player discards tiles). We will utilize frameworks like RLCard (which supports Mahjong) and possibly custom simulators to handle game logic. The final output is a Mahjong AI that can play at a reasonable level, plus an imitation model that replicates a human player’s strategy for discards and hand decisions.
- Tech Stack: RLCard toolkit (for a pre-built Mahjong environment and card game RL framework), or a custom OpenAI Gym environment encoding Mahjong rules. Python-based game logic if custom (Mahjong rules engine for scoring, win detection, etc.). PyTorch for policy and value networks (likely with recurrent layers to handle partial information). Experience replay and selfplay infrastructure. Data for mimicry: game logs from platforms like Tenhou (Japanese Mahjong server) or MahjongSoul, in formats like JSON/CSV or captured via API. Logging and analysis tools to track agent decisions (since understanding why the agent discarded a tile is key for debugging).
- Key Algorithms & Concepts: Reinforcement learning in imperfect-information games – possibly use Deep CFR (counterfactual regret) or policy gradients with opponent modeling. However, simpler approach: treat it as multi-agent self-play with each agent using the same learning algorithm (like self-play in poker or Bridge). Use reward shaping: since wins are rare (a random agent may win ~0.2% of the time), give intermediate rewards for partial accomplishments (forming melds, calling Riichi, etc.) to guide learning. Policy network outputs action probabilities (draw, discard which tile, call chi/pon/kan or not, Riichi or not) and value network estimates hand-winning potential. Use masking to ensure the network only chooses legal actions each turn. For mimicry: supervised learning (classification) on a dataset of a player’s discard decisions given hand state.
- Project References: Microsoft Research’s Suphx – a superhuman Mahjong AI that used deep RL with novel techniques (global reward prediction, etc.), which demonstrates the feasibility of RL in Mahjong. RLCard’s Mahjong implementation and paper for environment and baseline algorithms. “Tenhou” platform for Japanese Mahjong – game logs from top players for mimicry. Perhaps research literature on Mahjong AI (though less abundant than chess/poker, Suphx is a main reference). Also, general imperfect information RL approaches (like DeepStack or Libratus from poker, which could inspire how to handle hidden information).
Block 1 (Days 1–15): Environment Setup and Rule Encoding
- Day 1-5: Decide on the Mahjong rule set (Japanese Riichi vs Chinese). Assume Japanese Riichi for concreteness (rich rules, but examples like Suphx are Riichi). Install and explore RLCard, which includes Mahjong. Write a small script to have two or four random agents from RLCard play Mahjong and verify the environment runs (monitor some random games to ensure the rules seem correct). Alternatively, if building a custom environment, start implementing core rules: tile representation, shuffling and dealing 13 tiles to 4 players, turns (draw and discard), basic meld calls (Chi, Pon, Kan), and win conditions (Ron/Tsumo). Given time, using RLCard’s environment can accelerate development since it handles these complexities.
- Day 6-10: Understand the observation space. Mahjong is imperfect information (each player knows only their hand, and the discards/open melds of others). Decide how to represent state for the RL agent: possibly as a concatenation of (hand tiles, known tiles, scores, round info like dora indicators, etc.). If using RLCard, see how it encodes state (likely a large vector or matrix). If custom, create an encoding (e.g., a 34-dimensional vector for each tile type count in hand and discards). Also determine the action space: actions include “discard tile X”, “declare Riichi”, “call Pon/Chi on a discard”, “declare Kan”, “Ron (win)”, or “Pass”. This is quite large and situation dependent (many actions only available in specific contexts). Plan to use action masking to enable only legal actions at a given time.
- Day 11-15: Implement reward logic. Final reward: +1 for winning (and perhaps scaled by the score points won), -1 for a loss. But extremely sparse – to make learning tractable, design auxiliary rewards: e.g., +0.1 for calling Riichi (since that usually means one is ready), +0.05 for completing a meld (Chi/Pon) if it potentially improves hand value, -0.05 for discarding a tile that someone else immediately Ron’s on (dealing into someone’s win – a bad outcome in Riichi), etc. These are tricky to balance; note them but be ready to adjust. Also, ensure the environment can signal end-of-round and provide final scores. Deliverable: A Mahjong environment ready for RL: either confirmed RLCard environment or a custom one, with observation and action spaces defined and reward structure initialized. A test run of random agents should produce episodes with rewards (mostly zeros and occasional wins). You should be able to log a sample game sequence (with actions like discards and calls) for verification.
Block 2 (Days 16–30): Baseline Policy and Model Architecture
- Day 16-20: Create a very basic policy to use as a baseline and to generate some initial data. For example, a rule-based policy that discards the tile it has the most duplicates of (or completely random). This is just to have something to compare learning against initially and to ensure the game can run to completion with meaningful results (even if random). Use this policy to simulate, say, 1000 games and record the average scores of each player. As expected, all random players should have roughly equal win rates (around 25% each) given enough games, with maybe slight variance. This sanity check ensures the environment isn’t biased or broken.
- Day 21-25: Design the neural network for the RL agent. Likely it will need to handle a sequence of moves (partial observability), so consider using an LSTM or GRU that processes each turn or the game state as it evolves. Alternatively, use a feed-forward network on a concatenated state if you encode the entire observable game state in one vector (including what has been discarded so far, which is a lot of info). A possible architecture: an input vector encoding hand composition (34 tile types), plus counts of seen tiles, plus maybe one-hot encodings of round wind, seat wind, etc. Pass through fully connected layers to output a policy over actions and a value. If the action space is large (say 34 possible discards + calls), mask invalid ones by setting their logits to -inf before softmax. Initialize this network.
- Day 26-30: Set up the training loop for self-play. Four instances of the network (or one network controlling all players if symmetric) will play against each other. Since Mahjong is 4-player, you can do self-play by having one neural agent and 3 copies of it (all learning together). Alternatively, start with one learning agent and 3 dummy agents (to ease it in). Implement an RL algorithm; a good choice is policy gradient with baseline (Advantage Actor-Critic). For each game, the agent(s) collect trajectories of (state, action, reward) – though rewards only come at end or on certain events. Given the sparse nature, using Monte Carlo rollout to end-of-game for reward assignment is feasible but high variance. Consider breaking episodes by rounds (each hand as an episode) to give more frequent updates. Deliverable: The initial RL training pipeline is running. Likely the agent at this point is mostly making random moves (policy network not yet trained), but you can verify that gradients flow and the network parameters update. This sets the stage for scaling up learning.
Block 3 (Days 31–45): Self-Play Training and Strategy Emergence
- Day 31-35: Run self-play training for many iterations. Because Mahjong games are long, even a few hundred games might take a while. Optimize by running games in parallel if possible (RLCard might support vectorized environments). After some training, check if the agent has learned some basic behaviors: one indicator – is it winning more than 25% of games against initial versions of itself or a dummy agent? Does it learn to declare Riichi when it’s one tile away from winning (tenpai)? You can measure this by tracking how often it Riichis when in tenpai versus when not – a rational agent should almost always Riichi in tenpai if it has enough value.
- Day 36-40: Introduce curriculum or phased learning. Early on, the agent might learn to simply not throw dangerous tiles to avoid losses (since a big loss is worse than not winning). To ensure it also learns to win (not just avoid defeat), you might run phases where reward shaping is adjusted: e.g., in one phase, give extra reward for winning hand, in another, focus on minimizing deal-in (furiten avoidance). Alternatively, train agents with different reward focuses and have them play together (this can diversify strategies). Another approach: fix some opponents as slightly smarter (maybe a rule-based bot that at least completes basic melds) so that the learning agent has pressure to improve.
- Day 41-45: Evaluate intermediate performance. Use metrics like average placement (1st, 2nd, 3rd, 4th) or average score per hand. See if it’s better than random. It might not be expert yet – full Mahjong proficiency is very complex – but even mild improvement like a higher tenpai (ready hand) rate or lower deal-in rate is progress. Inspect some games the agent plays: does it make obvious mistakes (e.g., discarding tiles that it just called – a sign of confusion)? Use these observations to adjust the reward or network. For instance, if it never calls Chi/Pon, maybe the reward for making a meld could be increased to encourage it. Deliverable: A report on the learning progress so far, with any noticeable skills acquired by the agent (e.g., “Agent learned to aim for melds and reaches tenpai 10% more often than random”). The training framework is now solid to incorporate more advanced techniques.
Block 4 (Days 46–60): Enhancing Strategy with Domain Knowledge
- Day 46-50: Integrate some domain knowledge to guide the agent. Mahjong has many nuances (efficiency, defense, hand value). Incorporate a heuristic module that provides suggestions or features. For example, add features for shanten (number of tiles away from a complete hand). A low shanten number is good, so maybe give a small continuous reward for reducing shanten during play. Or incorporate common defensive heuristics: if another player Riichi’d, a human would avoid discarding risky tiles; you could give a negative reward if the agent deals into someone’s win after a Riichi (this we partially did with penalty for discarding a winning tile). Adjust the training to include these heuristic-based rewards. This essentially biases the policy towards known good Mahjong practices.
- Day 51-55: Experiment with Oracle guiding (in spirit of Suphx): At certain states, use a brute-force or heuristic to nudge the agent. For instance, if the agent is clearly in a won state (it has a winning hand but hasn’t declared Ron), force that action during training games (so it learns to claim wins). Or if it can declare an obvious Kan (quad) safely, do it. This prevents the agent from missing rarely occurring but critical actions (which might not be learned due to rarity). These guided interventions should be used sparingly to not overly script the agent.
- Day 56-60: Resume training with these enhancements and observe improvements. Ideally, the agent now should consistently reach tenpai and win more often. It might start to develop preferences for certain hand types (perhaps going for flushes if it gets many same-suit tiles, etc.). Compare its performance with and without the domain-guided rewards to justify the enhancements. Deliverable: An improved RL agent that embodies some human-like strategies (due to injected domain knowledge) while still learning on its own. For example, it might now rarely discard dora (bonus tiles) early, recognizing their value – an emergent behavior possibly encouraged by reward shaping. This block shows blending expert knowledge with RL to handle complex games.
Block 5 (Days 61–75): Multi-Agent Considerations and Competition
- Day 61-65: Up to now, all four players might have been using the same learning agent (self-play). Consider introducing opponent variety to robustify the policy. Add a fixed-rule AI as one of the four players during training (e.g., an AI that always calls Pon/Chi and pushes for quick low-value hands). This exposes the learning agent to diverse styles. The learning agent might adapt by playing more defensively against an overly aggressive opponent, for instance. Alternately, if all agents are learning, occasionally swap out one agent’s policy with an older snapshot (like fictitious self-play) to stabilize training and avoid all agents converging to a local equilibrium that might be suboptimal.
- Day 66-70: Implement evaluation against a known AI or human data. If there’s a publicly available Mahjong AI (even a simple one) or some recorded games of average human players, use those to benchmark. For example, have your agent play 100 games against three copies of a baseline bot (maybe RLCard has a baseline AI). Measure win rates and average score. This will give an external measure of strength. If it’s not doing well, identify weak points: does it miss opportunities (maybe it rarely calls Ron on someone’s discard – check that)? Does it chase expensive hands at the cost of winning (common problem: agent might always go for maximum yaku and forget to just win cheaply)? Tweak the reward or training if needed (perhaps add a small reward for simply winning even a low-value hand, to balance).
- Day 71-75: Prepare the agent for competitive play. Ensure that it obeys all rules (e.g., does it correctly handle the sacred discard rule, not winning on a discard it previously passed, etc.? These are edge cases to verify). Also, add a mechanism for the agent to handle tie-breaks or draws (in Riichi, exhaustive draws happen; the agent should at least not do something silly in final turns of a round). Deliverable: A robust Mahjong RL agent that can play full games against different styles of opponents without rule violations. It's likely at least as good as a basic heuristic player by now. You should have logs of some games where the agent wins impressively or folds (defends) wisely – include those as examples of its learned behavior.
Block 6 (Days 76–90): Mimicry System Development
- Day 76-80: Shift focus to the mimicry system. Choose a target player whose style to imitate. This could be a famous Mahjong player or simply the style of an average club player. Obtain game logs for this target: for Riichi, Tenhou.net provides game records for high-ranked players (some are shared in communities), or use data from Mahjong tournaments if available. Alternatively, use the games your agent played (if you want to mimic its style as a sanity test). Assume you have a dataset of sequences of actions (especially discards and calls) from the player.
- Day 81-85: Preprocess the data. Represent each state in those games in a way similar to your RL agent’s state representation (but here we have to include the player’s hand and what they know). For each decision point of the player (usually when they discard or call), record the state and the action they took. Now train a supervised model (could reuse the policy network architecture) on this dataset: input is state, output is the action chosen by the human. This is similar to how the mimic chess engine was trained, but here with partial information. You might include some of the player’s hidden information (their hand) since the mimic model is supposed to replicate their internal decision-making. Train until the model’s action prediction accuracy plateaus. It might be low (Mahjong has many possible actions), but focus on matching important decisions: perhaps measure accuracy of discards specifically when the player is in tenpai or during riichi, etc.
- Day 86-90: Evaluate the mimic model. Let it play in the Mahjong environment (with other players maybe random or also mimic if you want four copies of that style). Does it play similarly to the human? For instance, compare distribution of waits chosen: if the human often chose to wait on edge waits (like 3-7 shapes) or preferred closed hands vs open, see if the model does too. Check specific instances: if the human had a tendency to fold (not push for win) when danger is high, does the mimic model do the same? This might require analyzing decisions in context, not just overall accuracy. It won’t be perfect, but should capture some signature moves (maybe always calls Kan when available, like some aggressive players do – if the target did that, the model likely will mimic). Deliverable: A trained mimic Mahjong model that approximates a given player’s style. You can demonstrate it by showing that in a certain logged game situation, the model predicts the same discard the player made, and does so consistently across many examples. It forms the basis for a “human-like” Mahjong AI.
Block 7 (Days 91–105): Integrating Mimic Model and RL Agent
- Day 91-95: Set up a side-by-side comparison between the RL agent and the mimic agent. Perhaps have them play in the same table along with some baseline bots. Observe differences: The RL agent might play more mathematically optimal (maximize win probability and expected value), whereas the mimic might take more risks or follow human-like intuition (like chasing a rare hand). This is interesting to document. Also, consider if you want to integrate the two – for instance, you could fine-tune the RL agent with a small portion of supervised learning from human data to see if that makes it more human-like in play (or improves it by guiding it towards good strategies). This concept of imitation-augmented RL could be an experiment: initialize a new agent with the mimic network and then let it self-play train. See if it learns faster or differently than the purely self-play one.
- Day 96-100: Polishing both agents. Ensure the mimic agent is robust: since it’s trained on one player’s decisions, if others play differently it might face situations it hasn’t seen. Possibly constrain it with some default behavior for unknown scenarios (e.g., if the model’s confidence is low, maybe fall back to a basic rule like discard safest tile). For the RL agent, finalize its hyperparameters and freeze it. Both agents should be able to play a full game without intervention now.
- Day 101-105: If possible, arrange a test where human players play against your agents (perhaps online or with colleagues). Feedback from experienced players would be valuable: they might say “The RL agent plays oddly, like breaking a winning hand for defense too early” or “The mimic bot really feels like playing against a typical club player, it even discarded safe tiles after I declared Riichi.” Such feedback can be used to qualitatively assess if the objectives were met. Deliverable: Two distinct Mahjong AI agents ready for final demonstration: one optimized by RL for winning, another mimicking a human style. Also, any insights from combining them or from human feedback are noted for the final report.
Block 8 (Days 106–120): Performance Evaluation
- Day 106-110: Evaluate the RL Mahjong bot’s performance thoroughly. Since there is no standard AI Elo for Mahjong, use relative performance. Have it play a large number of games against baseline agents (like random or a simple rule AI). It should consistently outperform them (higher average placement). If possible, also pit it against the mimic agent (if the mimic represents an average human, the RL agent ideally should do better in the long run, even if the mimic occasionally wins by luck). Track metrics: win rate, average score, rate of furiten (errors), etc. If the RL agent approaches even intermediate human performance, that’s a success. (Note: Reaching pro-level is extremely hard and likely out of reach for a 6-month solo project without huge compute, considering Suphx took significant effort).
- Day 111-115: Evaluate the mimic agent. While its goal isn’t to “win” per se, measure how well it preserves the characteristics of the target style. If the target human had a known win rate or style, compare. For example, if the human won 25% of games in the dataset, does the mimic bot also hover around that against similar opposition? More stylistically, maybe measure the distribution of waits (single, pair, sequences) it ends up with when winning, or how often it calls pon/chi compared to target. These will show if the imitation is faithful.
- Day 116-120: Summarize evaluation results. Perhaps create a small “Mahjong competition” between different agents: Random, Rule-based, RL agent, Mimic agent. After many games, rank them by performance. Ideally: RL agent comes first, mimic around second if it mimics a decent player, rule-based next, random last. If the mimic agent was of a very strong player, then mimic might even challenge RL agent. Analyze edge cases: does the RL agent ever do something illegal or absurd? (It shouldn’t if environment prevents illegal moves, but maybe strategic absurdities like calling Riichi on a bad wait with plenty of game left – which might be fine, but worth checking). Deliverable: A comprehensive evaluation document with charts/tables of performance. This will feed into your final report, demonstrating the effectiveness of the RL training and the fidelity of the mimic model. It also sets the context for how these AIs might perform in the real world (e.g., “Our RL bot would likely place in the top 30% of players on a platform like Tenhou based on its win rate” – just as a speculative conclusion supported by data).
Block 9 (Days 121–135): Deployment and Interface
- Day 121-125: Focus on deploying the Mahjong AI in a usable form. Create a command-line or simple GUI application where a user can play against the AI or watch AI vs AI games. Since Mahjong is four-player, perhaps allow the user to take one seat and AI controls the other three. Implement a display of the tiles (text-based or graphical). This is a significant UI challenge, but you can simplify by text notations (like “Player East discards 5 of Characters”). If a full GUI is too much, at least ensure the engine can play on existing online platforms: for example, write a wrapper that interfaces with a Mahjong client bot API if any exist (some online services might allow bots in private rooms).
- Day 126-130: Add features for the UX: the ability to choose which AI controls which seat (maybe you want to watch two different agents play by assigning RL bot to East and West, mimic to North, etc.). Logging options to record the game in human-readable record (so a mahjong player can review it). If the mimic agent is of a specific famous player, label it as such in the interface (“AI playing as [Player Name]”). If possible, have the AI explain its decisions (this is hard, but even a basic output like “Agent thinks win probability is X%” or listing top 3 candidate discards with probabilities could be insightful for users).
- Day 131-135: Finalize the interface and test with a few trial games. Ensure that all phases of the game are handled (wall depletion draws, repeat counters, scoring calculations). Validate that the scoring is correct by comparing a few rounds to known outcomes (scoring in Riichi is complex; RLCard likely handles it, but double-check point assignments). Also test stability: does the program run through many rounds without crashing or mis-dealing tiles? Fix any bugs. Deliverable: A functional Mahjong AI application or service where one can observe or play games with the RL and mimic agents. This “productizes” your 6-month work, allowing others (even non-developers) to experience the AI’s gameplay.
Block 10 (Days 136–150): Final Demonstrations
- Day 136-140: Demonstrate the RL Mahjong AI’s gameplay. Set up a match of four AI agents: ideally, three baseline or mimic agents and one RL agent, to highlight how the RL agent performs. Record a full game (several rounds) and then annotate key moments. For example, “South (our RL agent) folds here despite a decent hand – anticipating East’s likely win; this is a very strategic play that suggests defense” or “East (RL) calls Riichi early, maximizing pressure – a learned behavior balancing risk/reward.” If possible, find a round where the RL agent wins with a good hand and explain the sequence of decisions that led there (maybe it kept two possibilities open and got the right tile).
- Day 141-145: Demonstrate the mimic agent. Perhaps take a game record from the actual human player and a similar game played by the mimic agent, and compare side by side. Or have the mimic agent play against three copies of the RL agent. It might come last (if RL is strong), but what’s interesting is how it plays: point out a human-like quirk – “Notice the mimic bot discards safe tiles after the opponent’s Riichi, even at cost of its hand – this defensive style is exactly what our target player was known for.” If the mimic agent represents a weaker player, you might show it making a mistake that humans often make (e.g., dealing into a obvious hand); the educational angle is that the mimic replicates those errors, whereas the RL bot would not.
- Day 146-150: Prepare the final presentation that ties everything together. Explain the complexity of Mahjong and how RL was applied (cite that winning in Mahjong is tough due to 1 in ~500 random win chance if playing randomly, requiring heavy reward shaping and self-play games). Show learning curves or improvements. Present the Suphx result as context: Suphx achieved top human performance – your bot is not at that level, but you can mention what separates yours from Suphx (likely the advanced techniques and scale Suphx used). Then highlight your unique contribution: the mimic agent and the combination of RL and mimicry. Perhaps end with a short video clip of the Mahjong AI in action with some commentary. Deliverable: The polished showcase of the Mahjong AI project, including annotated game replays and slides, ready for an audience.
Block 11 (Days 151–165): Documentation and Release
- Day 151-155: Write the documentation for using the Mahjong AI. Include instructions for setting up the environment, any dependencies (if RLCard is needed or specific libraries for Mahjong logic). Provide details on how to configure the game (Riichi or Chinese rules, since that might be an option). Document the training process: how one would retrain the model or train for a different rule set – for academic completeness. Also, document the mimic training process, so others can create a mimic bot if they have game logs of a different player. Ensure to describe the file format expected for game logs and how to feed them into the training script.
- Day 156-160: Write a section on “Understanding the AI’s decisions” – basically an analysis of the learned strategy. This will serve as a guide for those interested in the AI’s style (for instance: “The RL agent values keeping dora tiles (bonus tiles) significantly – in our analysis it was 30% more likely to keep a dora in hand than a non-dora in similar situations, indicating it learned their importance for scoring.”). Also mention any limitations noticed (e.g., “The bot sometimes plays too cautiously in the endgame – likely due to heavy penalty on dealing in; adjusting that penalty could make it more balanced.”) Acknowledge that the game’s complexity means it’s not perfect, but it’s a step towards an RL Mahjong player.
- Day 161-165: Package everything for release: code, trained models for both RL and mimic agents, example game logs, documentation, and possibly a few replay files of interesting games. If open-sourcing, publish to a repository with an open license. If handing off internally, ensure all pieces are organized and delivered. Deliverable: Complete project package and documentation delivered. This includes the Mahjong environment code, the trained AI models, the mimic model and data (with any privacy considerations if using someone’s data), and a user manual. This allows others to reproduce or extend the work, marking the end of the 6-month journey.
Block 12 (Days 166–180): Future Directions and Closing
- Day 166-170: In the final report or presentation, include a section on future work. For Mahjong, a big next step could be applying the techniques to other variations (Chinese Official rules, etc.), or incorporating Monte Carlo simulations during play (some Mahjong AIs do a form of partial information MCTS by sampling opponents’ hidden tiles – something your bot hasn’t done explicitly). Also mention the possibility of using the trained model as a starting point and then employing search (like how chess engines use neural nets + tree search – Mahjong could use a similar approach where the policy net guides a depth-limited search of likely outcomes).
- Day 171-175: Reflect on the learning from this domain versus others. Mahjong, being multi-player and imperfect info, required combining ideas from poker AI (dealing with hidden info) and techniques like those used in AlphaZero (self-play). Summarize how the mimic part adds value: such techniques could apply to any game – learning not just the optimal play but how specific people play, which is useful in settings like teaching or customizing AI opponents to different skill levels (e.g., a novice mode that imitates novice mistakes so human players can practice).
- Day 176-180: Conclude the project. Possibly organize an exhibition match: your Mahjong AI against human players or against Suphx (if that were possible) just as a fun experiment. Celebrate the accomplishments: you have built, from the ground up, a working Mahjong AI that can compete and a framework to learn any player’s style from data. Final Deliverables:
-
RL Mahjong Agent: A self-play trained AI that plays Mahjong with a solid strategy, balancing offense and defense, tested in simulation.
-
Mimic Mahjong Agent: An AI that captures a human’s style of play (e.g., replicating defensive or aggressive tendencies of a given player). Both are delivered with usage instructions. The project concludes with a detailed report and demonstration, fulfilling the goal of applying deep RL to the complex domain of Mahjong, and paving the way for future enhancements (perhaps one day reaching the superhuman level achieved by systems like Suphx).
-