Agentic Chess against Stockfish (1-0)

Databricks

Date : 05/13/2025

Databricks

Date : 05/13/2025

Agentic Chess against Stockfish (1-0)

Explore how an agentic LLM, ChessMate, utilized function calling and a centipawn evaluation tool in a Databricks workflow to understand chess strategy and defeat a weakened Stockfish engine. This demonstrates how LLMs can learn complex long-term objectives and achieve surprising results.

Jason Yip

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.

Agentic Chess against Stockfish (1-0)

Like the blog

Table of contents

Agentic Chess against Stockfish (1-0)

Re-introducing function calling
What Is A Centipawn?
Chess eval agentic workflow
Databricks Agent Tracing
The Ultimate Evaluation
Conclusion
Appendix:

Like the blog

Table of contents

Agentic Chess against Stockfish (1-0)

Re-introducing function calling
What Is A Centipawn?
Chess eval agentic workflow
Databricks Agent Tracing
The Ultimate Evaluation
Conclusion
Appendix:

Agentic Chess against Stockfish (1-0)

Ever since the first Claude vs Stockfish game, whether an LLM can beat a weakened chess engine has been a question. And what is the criteria to make it smarter?

Re-introducing function calling

Previously we have discussed about the tools support from Databricks, it is natural to think that the functions are there to achieve human objective and without any loopback, the LLM would not be able to correct itself to find a move that it would beat the chess engine. In other words, if the LLM can’t figure out on its own, giving it a grading function without giving it the formula, is it smart enough to reverse engineer the algorithm and pick the best move without brute force? This is what we are going to find out in this article.

What Is A Centipawn?

A centipawn is a measurement unit to calculate the strategic features of a position in a chess game and determine which player has an advantage. Centipawn loss is a measure of how much a player's position worsens after making a move. While there is not a definite answer of which is the best move because in the end the rule of chess is to capture the King. However, we can employ a chess engine like Stockfish to search for virtually infinite depth and try to find the best move given the position and evaluate it against the current move. The numbers are determined by the points value in the chess pieces. Once again, to win a chess game we don’t accumulate the scores or minimize it. If we can minimize our own loss, and maximize the opponent’s loss, there is a better chance to capture the King.

Generally accepted value points of chess pieces. Image: ChessKid

Chess eval agentic workflow

In order to build an agentic workflow, we will create a LangGraph with function calling agent. This function will be called evaluate_llm_move and we can put it in Unity Catalog for governance purposes. The function will accept a FEN (current position of the board) and the move from LLM. It will have the following output:

Output	Value
llm_move	The move by LLM
post_move_fen	The FEN (board position) after applying llm_move,
llm_eval	Centipawn score after llm_move (int)
centipawn_loss	Signed loss from LLM's position (int),
move_quality	One of ["Excellent","Good","Inaccuracy","Mistake","Blunder"]
llm_color	"white" or "black"

Internally, the centipawn loss is calculated as:

Centipawn loss = stockfish eval – llm eval

There might be a little hint that how good stockfish can play in this position due to the above formula, but one thing for sure is that chess novices are not able to use this tool to improve chess skills instantly. And believe it or not, this is the winning formula for the LLM!

Databricks Agent Tracing

We have discussed how to connect an LLM with a UC function in the previous post, but simply connecting a function is not enough. We need to ensure that the function is being called in the agent tracing, and we can evaluate the output from the tracing UI.

As seen from the above screen, the function is being called in the workflow for evaluation. And we can inspect the output to ensure that it is correct and not give any error.

The Ultimate Evaluation

Once again we will deploy the agent to a model serving endpoint and the new endpoint will be ready to be consumed with OpenAI SDK, everything is following the architecture of our agentic chess app so far, except that now we are empowering the LLM with a chess evaluation tool.

The ultimate evaluation comes to whether LLM with a tool can checkmate Stockfish with weak setting, which previously scored 100% chance of losing.

Surprisingly, the new LLM, we will call it ChessMate, indeed checkmated Stockfish and won by a wide margin!

The full game with commentary from the LLM can be found below:

https://www.chess.com/analysis/game/pgn/e4xEpSeMU?tab=analysis

Conclusion

In a surprising turn, LLM understands long-term objective when coupled with the right tool and instantly gained master knowledge playing at ~2000 ELO and beat Stockfish’s weak setting. Function calling isn’t just about meeting human objectives anymore because the function by itself only says good move or bad move and the LLM needs to make a decision to attempt a good move without brute force, which would hit the rate limit of the stockfish API instantly. When designing functions, we can now think about the long-term objective that we want to achieve and allow some flexibility from the LLM to help us achieve the goals.

Appendix:

Stockfish setting:

self.client.configure({

"Skill Level": 4,
"UCI_LimitStrength": True,
"UCI_Elo": 1400

})
result = self.client.play(board, chess.engine.Limit(time=0.1))

Game

Jason Yip

AUTHOR - FOLLOW
Jason Yip
Director of Data and AI, Tredence Inc.

Next Topic

A Practical Guide for Building Marketing Measurement in Clean Rooms

Continue reading

Next Topic

A Practical Guide for Building Marketing Measurement in Clean Rooms

Continue reading

our categories

Telecom, Media, Technology

Travel & Hospitality

Healthcare & Life Sciences

Banking & Financial Services

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

recommended articles

Mastering Databricks Apps: From Creation to Deployment of Interactive Data Applications

Blog

Mastering Databricks Apps: From Creation to Deployment of Interactive Data Applications

Architecting AI Agents with Databricks: From Vector Search to Foundation Models

Blog

Architecting AI Agents with Databricks: From Vector Search to Foundation Models

Advancing Decision Intelligence with Agentic AI: Join Tredence at Databricks Data + AI Summit 2025

Blog

Advancing Decision Intelligence with Agentic AI: Join Tredence at Databricks Data + AI Summit 2025

×

Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article

×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.