TensorZero supports inference strategies more sophisticated than just a single chat completion. Among others, it supports best-of-n sampling and mixture-of-n sampling, which sample from several variants concurrently and use another LLM call to combine the answers into a winner. See Inference-Time Optimizations in our documentation for more details.
We'll show that we can improve the performance of an LLM system by increasing the compute budget without changing the prompt or model used.
Chess puzzles are tactical challenges designed to test and improve a player's chess skills. They typically present a specific board position where the player must find the best move or sequence of moves to achieve a particular goal, such as checkmate, gaining material advantage, or forcing a draw. These puzzles are not only excellent training tools for chess players of all levels but also serve as an engaging way to assess an AI's ability to understand and apply chess strategies in complex situations.
This example uses a dataset of chess puzzles from Kaggle. After filtering for popular puzzles that were rated to be relatively easy (800-1200 rating), we produced training and testing datasets with ~130k and ~15k puzzles respectively.
Each puzzle consists of a board position and a sequence of moves that solves it. For example, here's a sample puzzle:
The puzzle consists of a sequence of moves that are "forced" on the player: the player and their opponent would each be significantly disadvantaged if they missed their next move. We can therefore evaluate an LLM by their exact match to the puzzle solution (or if they manage to achieve a checkmate). Give this puzzle a try!
Solution
- B3f7 G6f7
- G5f7 G8f7
In config/tensorzero.toml, we define a play_chess_board function that takes a chess board position and returns a sequence of moves that solves it.
The function includes a baseline variant with GPT-4o Mini (OpenAI), as well as best-of-5 and mixture-of-5 variants that query the baseline variant five times and combines the results using a final LLM call.
- Install Docker.
- Install Python 3.9+.
- Install the Python dependencies. We recommend using
uv:uv sync - Generate an API key for OpenAI (
OPENAI_API_KEY).
- Create a
.envfile with theOPENAI_API_KEYenvironment variable (see.env.examplefor an example). - Run
docker compose upto launch the TensorZero Gateway, the TensorZero UI, and a development ClickHouse database. - Run the
chess_puzzles.ipynbJupyter notebook.
The notebook will evaluate the performance of the baseline, best_of_5, and mixture_of_5 variants on the training set of chess puzzles.
Without tuning the prompt or model used, we can spend more inference tokens to boost performance. We see a ~10% relative improvement in success rate.
You might want to try diverse instructions for each LLM call.
We've included several other prompt templates in the config/functions/play_chess_board/chess_* directories.
We also can modify the variant_name variable to try out different mixture-of-n variants.
You can mix and match these in the tensorzero.toml file to try out different configurations.
We saw the diverse variant squeeze a few extra percentage points of performance from the same model and compute budget.
See for yourself if you can get better performance than our gpt-4o-mini_mixture_of_5_diverse example!
For example, you could try mixing candidates that use different LLMs.
You now have a ClickHouse database with a ton of trajectories of LLMs trying to solve chess puzzles.
Consider our library of recipes for ideas on how to use this dataset to improve further!
Since this data ended up in ClickHouse, we also included a test set at data/lichess_easy_puzzles_test.csv (use dryrun=True to avoid leaking it) to evaluate variants on held-out data.