
Ever since the first Claude vs Stockfish game, whether an LLM can beat a weakened chess engine has been a question. And what is the criteria to make it smarter?
Re-introducing function calling
Previously we have discussed about the tools support from Databricks, it is natural to think that the functions are there to achieve human objective and without any loopback, the LLM would not be able to correct itself to find a move that it would beat the chess engine. In other words, if the LLM can’t figure out on its own, giving it a grading function without giving it the formula, is it smart enough to reverse engineer the algorithm and pick the best move without brute force? This is what we are going to find out in this article.
What Is A Centipawn?
A centipawn is a measurement unit to calculate the strategic features of a position in a chess game and determine which player has an advantage. Centipawn loss is a measure of how much a player's position worsens after making a move. While there is not a definite answer of which is the best move because in the end the rule of chess is to capture the King. However, we can employ a chess engine like Stockfish to search for virtually infinite depth and try to find the best move given the position and evaluate it against the current move. The numbers are determined by the points value in the chess pieces. Once again, to win a chess game we don’t accumulate the scores or minimize it. If we can minimize our own loss, and maximize the opponent’s loss, there is a better chance to capture the King.
Generally accepted value points of chess pieces. Image: ChessKid
Chess eval agentic workflow
In order to build an agentic workflow, we will create a LangGraph with function calling agent. This function will be called evaluate_llm_move and we can put it in Unity Catalog for governance purposes. The function will accept a FEN (current position of the board) and the move from LLM. It will have the following output:
Output |
Value |
llm_move |
The move by LLM |
post_move_fen |
The FEN (board position) after applying llm_move, |
llm_eval |
Centipawn score after llm_move (int) |
centipawn_loss |
Signed loss from LLM's position (int), |
move_quality |
One of ["Excellent","Good","Inaccuracy","Mistake","Blunder"] |
llm_color |
"white" or "black" |
Internally, the centipawn loss is calculated as:
Centipawn loss = stockfish eval – llm eval
There might be a little hint that how good stockfish can play in this position due to the above formula, but one thing for sure is that chess novices are not able to use this tool to improve chess skills instantly. And believe it or not, this is the winning formula for the LLM!
Databricks Agent Tracing
We have discussed how to connect an LLM with a UC function in the previous post, but simply connecting a function is not enough. We need to ensure that the function is being called in the agent tracing, and we can evaluate the output from the tracing UI.
As seen from the above screen, the function is being called in the workflow for evaluation. And we can inspect the output to ensure that it is correct and not give any error.
The Ultimate Evaluation
Once again we will deploy the agent to a model serving endpoint and the new endpoint will be ready to be consumed with OpenAI SDK, everything is following the architecture of our agentic chess app so far, except that now we are empowering the LLM with a chess evaluation tool.
The ultimate evaluation comes to whether LLM with a tool can checkmate Stockfish with weak setting, which previously scored 100% chance of losing.
Surprisingly, the new LLM, we will call it ChessMate, indeed checkmated Stockfish and won by a wide margin!
The full game with commentary from the LLM can be found below:
https://www.chess.com/analysis/game/pgn/e4xEpSeMU?tab=analysis
Conclusion
In a surprising turn, LLM understands long-term objective when coupled with the right tool and instantly gained master knowledge playing at ~2000 ELO and beat Stockfish’s weak setting. Function calling isn’t just about meeting human objectives anymore because the function by itself only says good move or bad move and the LLM needs to make a decision to attempt a good move without brute force, which would hit the rate limit of the stockfish API instantly. When designing functions, we can now think about the long-term objective that we want to achieve and allow some flexibility from the LLM to help us achieve the goals.
Appendix:
Stockfish setting:
|
Game

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.
Next Topic
Mastering Databricks Apps: From Creation to Deployment of Interactive Data Applications
Next Topic