Skip to content

Kioberry/PokerBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instruction-Following Under Uncertainty: Small LLMs for Poker Decision-Making


Overview

This project investigates whether small language models (SLMs, <4B parameters) can learn to make Game Theory Optimal (GTO) poker decisions through parameter-efficient fine-tuning. Using the PokerBench benchmark, we evaluate both zero-shot and fine-tuned performance across four model families — Qwen3, Gemma-3, LLaMA-3.2, and LLaMA-3 — and compare them against larger closed-source APIs (GPT-4o, Gemini 2.0 Flash, Claude 4.0 Haiku). Fine-tuning is performed with QLoRA (4-bit quantization), injecting LoRA adapters into attention and MLP layers while keeping the base model frozen.


Research Questions

  1. Can fine-tuned Small LMs (<4B) develop strong instruction-following and strategic reasoning, approaching the performance of larger models?
  2. What efficiency–performance trade-offs emerge when applying parameter-efficient fine-tuning (e.g., QLoRA) to these small models?

Dataset

We use PokerBench (Zhuang et al., 2025), an open-source poker decision-making benchmark of GTO-labeled instruction–action pairs. Each sample contains a natural-language poker scenario as the instruction and a GTO action (e.g., bet 18, raise 89, fold) as the output.

Data Type Pre-flop Spots Post-flop Spots
Evaluation (Benchmark) 1,000 10,000
Training (Full) 60,000 500,000

Results

All models are evaluated on two metrics:

  • AA (Action Accuracy): predicted action type matches ground truth type
  • EM (Exact Match): predicted action type AND amount both match exactly

Zero-shot: Large LLMs (via API)

Model Overall AA Overall EM Preflop AA Preflop EM Postflop AA Postflop EM
GPT-4o 0.647 0.579 0.648 0.576 0.460 0.320
Claude 4.0 Haiku 0.479 0.432 0.745 0.593 0.453 0.416
Gemini 2.0 Flash 0.723 0.613 0.726 0.613 0.722 0.610
LLaMA-3.1-8B-Instruct 0.275 0.194 0.360 0.360 0.267 0.178
LLaMA-2-13B-chat-hf 0.309 0.265 0.227 0.136 0.317 0.278

Zero-shot: Small LLMs (local inference)

Model Overall AA Overall EM Preflop AA Preflop EM Postflop AA Postflop EM
Qwen3-0.6B 0.246 0.239 0.276 0.276 0.216 0.201
Qwen3-1.7B 0.166 0.009 0.250 0.016 0.082 0.002
Gemma2-2B 0.246 0.116 0.233 0.105 0.258 0.126
LLaMA-3.2-1B-Instruct 0.227 0.222 0.225 0.225 0.229 0.219

Fine-tuned SLMs (QLoRA, 18k steps, lr=5e-5)

Model Overall AA Overall EM Preflop AA Preflop EM Postflop AA Postflop EM
Qwen3-0.6B 0.842 0.829 0.909 0.881 0.836 0.824
Qwen3-1.7B 0.835 0.822 0.899 0.868 0.829 0.817
Gemma-3-1B-IT 0.790 0.787 0.822 0.816 0.788 0.784
LLaMA-3.2-1B-Instruct 0.830 0.825 0.900 0.891 0.824 0.819

Fine-tuning with QLoRA improves AA and EM by 4× or more over zero-shot baselines. Fine-tuned SLMs also surpass the zero-shot LLaMA-3.1-8B baseline despite having far fewer parameters.


Project Structure

pokerbench-repo/
│
├── evaluation/
│   ├── zero_shot_slm/           # Local zero-shot evaluation of small models
│   │   ├── evaluate_llama.py    # LLaMA-3.2-1B/3B, LLaMA-3-8B
│   │   └── evaluate_qwen_gemma.py  # Qwen3-0.6B, Qwen3-1.7B, Gemma2-2B
│   └── zero_shot_llm/           # API-based evaluation of large models
│       ├── infer_openai.py      # GPT-4o
│       ├── infer_gemini.py      # Gemini 2.0 Flash
│       ├── Baseline_claude_4.0_Haiku.ipynb
│       ├── Baseline_llama3_1_8b.ipynb
│       └── Baseline_llama2_13b.ipynb
│
├── finetuning/
│   ├── FineTuning_Gemma-3-1B-IT.ipynb   # Gemma-3-1B fine-tuning
│   ├── FineTuning_Llama3.2_1B.ipynb     # LLaMA-3.2-1B fine-tuning
│   └── qwen/                             # Qwen3-0.6B and Qwen3-1.7B fine-tuning
│       ├── train.py                      # Training entry point
│       ├── test.py                       # Evaluation entry point
│       ├── models/                       # Qwen model wrappers
│       ├── poker_datasets/               # Dataset loading and preprocessing
│       └── src/                          # Training pipeline and config
│
├── simulation/
│   ├── head2head_simulator.py   # Head-to-head model comparison simulator
│   └── SIMULATION_USAGE.md
│
├── src/                         # Shared library
│   ├── clients/
│   │   ├── openai_client.py
│   │   └── gemini_client.py
│   ├── metrics.py               # AA and EM evaluation metrics
│   ├── pipeline.py              # End-to-end evaluation pipeline
│   ├── prompts.py               # Prompt templates
│   └── train_config.py          # Shared training configuration
│
├── data/
│   └── results/zero_shot/llama/ # Saved zero-shot evaluation results (JSON)
│
└── utils/
    └── __init__.py              # Re-exports from src.metrics for compatibility

References

  • Zhuang, R., et al. (2025). PokerBench: Training large language models to become professional poker players. arXiv:2501.08328.
  • Dettmers, T., et al. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv:2305.14314.
  • Hu, E. J., et al. (2021). LoRA: Low-rank adaptation of large language models. arXiv:2106.09685.
  • Yang, A., et al. (2025). Qwen3 technical report. arXiv:2505.09388.
  • Grattafiori, A., et al. (2024). The Llama 3 herd of models. arXiv:2407.21783.
  • Gemma Team, et al. (2025). Gemma 3 technical report. arXiv:2503.19786.
  • Brown, N., & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science, 365(6456), 885–890.

About

A fine-tuning study investigating whether small language models (SLMs) can match or outperform large language models (LLMs) in domain-specific poker decision-making tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors