Instruction-Following Under Uncertainty: Small LLMs for Poker Decision-Making

Overview

This project investigates whether small language models (SLMs, <4B parameters) can learn to make Game Theory Optimal (GTO) poker decisions through parameter-efficient fine-tuning. Using the PokerBench benchmark, we evaluate both zero-shot and fine-tuned performance across four model families — Qwen3, Gemma-3, LLaMA-3.2, and LLaMA-3 — and compare them against larger closed-source APIs (GPT-4o, Gemini 2.0 Flash, Claude 4.0 Haiku). Fine-tuning is performed with QLoRA (4-bit quantization), injecting LoRA adapters into attention and MLP layers while keeping the base model frozen.

Research Questions

Can fine-tuned Small LMs (<4B) develop strong instruction-following and strategic reasoning, approaching the performance of larger models?
What efficiency–performance trade-offs emerge when applying parameter-efficient fine-tuning (e.g., QLoRA) to these small models?

Dataset

We use PokerBench (Zhuang et al., 2025), an open-source poker decision-making benchmark of GTO-labeled instruction–action pairs. Each sample contains a natural-language poker scenario as the instruction and a GTO action (e.g., bet 18, raise 89, fold) as the output.

Data Type	Pre-flop Spots	Post-flop Spots
Evaluation (Benchmark)	1,000	10,000
Training (Full)	60,000	500,000

Results

All models are evaluated on two metrics:

AA (Action Accuracy): predicted action type matches ground truth type
EM (Exact Match): predicted action type AND amount both match exactly

Zero-shot: Large LLMs (via API)

Model	Overall AA	Overall EM	Preflop AA	Preflop EM	Postflop AA	Postflop EM
GPT-4o	0.647	0.579	0.648	0.576	0.460	0.320
Claude 4.0 Haiku	0.479	0.432	0.745	0.593	0.453	0.416
Gemini 2.0 Flash	0.723	0.613	0.726	0.613	0.722	0.610
LLaMA-3.1-8B-Instruct	0.275	0.194	0.360	0.360	0.267	0.178
LLaMA-2-13B-chat-hf	0.309	0.265	0.227	0.136	0.317	0.278

Zero-shot: Small LLMs (local inference)

Model	Overall AA	Overall EM	Preflop AA	Preflop EM	Postflop AA	Postflop EM
Qwen3-0.6B	0.246	0.239	0.276	0.276	0.216	0.201
Qwen3-1.7B	0.166	0.009	0.250	0.016	0.082	0.002
Gemma2-2B	0.246	0.116	0.233	0.105	0.258	0.126
LLaMA-3.2-1B-Instruct	0.227	0.222	0.225	0.225	0.229	0.219

Fine-tuned SLMs (QLoRA, 18k steps, lr=5e-5)

Model	Overall AA	Overall EM	Preflop AA	Preflop EM	Postflop AA	Postflop EM
Qwen3-0.6B	0.842	0.829	0.909	0.881	0.836	0.824
Qwen3-1.7B	0.835	0.822	0.899	0.868	0.829	0.817
Gemma-3-1B-IT	0.790	0.787	0.822	0.816	0.788	0.784
LLaMA-3.2-1B-Instruct	0.830	0.825	0.900	0.891	0.824	0.819

Fine-tuning with QLoRA improves AA and EM by 4× or more over zero-shot baselines. Fine-tuned SLMs also surpass the zero-shot LLaMA-3.1-8B baseline despite having far fewer parameters.

Project Structure

pokerbench-repo/
│
├── evaluation/
│   ├── zero_shot_slm/           # Local zero-shot evaluation of small models
│   │   ├── evaluate_llama.py    # LLaMA-3.2-1B/3B, LLaMA-3-8B
│   │   └── evaluate_qwen_gemma.py  # Qwen3-0.6B, Qwen3-1.7B, Gemma2-2B
│   └── zero_shot_llm/           # API-based evaluation of large models
│       ├── infer_openai.py      # GPT-4o
│       ├── infer_gemini.py      # Gemini 2.0 Flash
│       ├── Baseline_claude_4.0_Haiku.ipynb
│       ├── Baseline_llama3_1_8b.ipynb
│       └── Baseline_llama2_13b.ipynb
│
├── finetuning/
│   ├── FineTuning_Gemma-3-1B-IT.ipynb   # Gemma-3-1B fine-tuning
│   ├── FineTuning_Llama3.2_1B.ipynb     # LLaMA-3.2-1B fine-tuning
│   └── qwen/                             # Qwen3-0.6B and Qwen3-1.7B fine-tuning
│       ├── train.py                      # Training entry point
│       ├── test.py                       # Evaluation entry point
│       ├── models/                       # Qwen model wrappers
│       ├── poker_datasets/               # Dataset loading and preprocessing
│       └── src/                          # Training pipeline and config
│
├── simulation/
│   ├── head2head_simulator.py   # Head-to-head model comparison simulator
│   └── SIMULATION_USAGE.md
│
├── src/                         # Shared library
│   ├── clients/
│   │   ├── openai_client.py
│   │   └── gemini_client.py
│   ├── metrics.py               # AA and EM evaluation metrics
│   ├── pipeline.py              # End-to-end evaluation pipeline
│   ├── prompts.py               # Prompt templates
│   └── train_config.py          # Shared training configuration
│
├── data/
│   └── results/zero_shot/llama/ # Saved zero-shot evaluation results (JSON)
│
└── utils/
    └── __init__.py              # Re-exports from src.metrics for compatibility

References

Zhuang, R., et al. (2025). PokerBench: Training large language models to become professional poker players. arXiv:2501.08328.
Dettmers, T., et al. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv:2305.14314.
Hu, E. J., et al. (2021). LoRA: Low-rank adaptation of large language models. arXiv:2106.09685.
Yang, A., et al. (2025). Qwen3 technical report. arXiv:2505.09388.
Grattafiori, A., et al. (2024). The Llama 3 herd of models. arXiv:2407.21783.
Gemma Team, et al. (2025). Gemma 3 technical report. arXiv:2503.19786.
Brown, N., & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science, 365(6456), 885–890.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/results/zero_shot/llama		data/results/zero_shot/llama
evaluation		evaluation
finetuning		finetuning
simulation		simulation
src		src
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instruction-Following Under Uncertainty: Small LLMs for Poker Decision-Making

Overview

Research Questions

Dataset

Results

Zero-shot: Large LLMs (via API)

Zero-shot: Small LLMs (local inference)

Fine-tuned SLMs (QLoRA, 18k steps, lr=5e-5)

Project Structure

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Instruction-Following Under Uncertainty: Small LLMs for Poker Decision-Making

Overview

Research Questions

Dataset

Results

Zero-shot: Large LLMs (via API)

Zero-shot: Small LLMs (local inference)

Fine-tuned SLMs (QLoRA, 18k steps, lr=5e-5)

Project Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages