Presenting Athena: Follow how this transformer based chess model achieves an ELO of 2000!
I distill Stockfish 17's ability to play chess into a transformer model. I label 15.3 billion chess positions using Stockfish 17 to create a comprehensive dataset for training. This work is heavily inspired by: Ruoss et al.
The dataset consists of all unique board positions based on 10 million LiChess games. For each unique position, all legal moves were labeled with Stockfish 17 for 0.05 seconds with unbounded depth for win probability and mate-in-N.
We convert Stockfish 17's centipawn evaluation to a win percentage with a sigmoid function:
This results in 15.3 billion position and move pairs. Note that positions that mate-in-N are labeled with win probability of 0% or 100% depending on which side is potentially mated.
Athena is a 270 million parameter transformer that is trained on the Chessbenchmate dataset. Other architectures like ResNet, ViT's, and Mamba were explored. However, during initial testing, the transformer architecture yielded strong results over the other architectures.
Each position is converted from a FEN string into tokens by assigning each possible FEN character a token integer ID - a total of 78 character tokens. The UCI move is encoded by assigning all possible moves for all pieces a unique token integer ID - a total of 1968 total possible moves.
fen_tokens → [0, 34, 12, 7, ..., 45]
move_token → 843
Chessbenchmate's probability and mate-in-N pairs are encoded into evaluation bins structured where the lower you are indexed in the bins, the more likely you are to lose and vice versa. The bins are structured like so:
[ negative mates | win prob bins | positive mates | checkmate ]
# or
[-M, ..., -2, -1, wp_0, wp_1, ..., wp_(k-1), +M, ..., +2, +1, #]
We evaluate our trained models using a set of 10 thousand chess puzzles. The higher number of puzzles the model can solve the better the model should play chess.
During our training run, achieves a puzzle accuracy of 80.15% after training on 40% of Chessbenchmate, or over 6.12 billion positions. We do not train for a full epoch due to diminishing returns on invested time and compute.

Athena's puzzle accuracy decreases significantly as puzzle difficulty increases, but is still able to solve a decent chunk of the hardest puzzles.

Additionally, Athena reaches a surprisingly strong peak elo of 2000 on Lichess against other bots. You can challenge Athena or watch it play other bots here.
The original Ruoss et al. paper used a rather inelegant solution to its discussed limitation of "Indecisiveness in the face of overwhelming victory". This occurs when the agent is in a very strong position and many moves result in a high output bin. Since the agent doesn't have position history, it tends to shuffle its pieces back and forth since the resulting evaluation is still high. The authors hand off the chess game to Stockfish whenever the agent achieves an "inevitable win" position to close out the game.
We try and solve this by not only adding the mating bins, but also trying to dramatically increase the number of bins, as well as an arcsin output encoder.

Through these results, it is clear that adding mating bins does improve performance, but only by a small margin. Arcsin bin encoding doesn't significantly improve performance.
Despite Athena not achieving stronger performance than in the Ruoss et al. due to compute restraints, Athena is still a very strong chess engine. It absolutely stomps me in every game I played against it lol.