This project investigates whether small language models (SLMs, <4B parameters) can learn to make Game Theory Optimal (GTO) poker decisions through parameter-efficient fine-tuning. Using the PokerBench benchmark, we evaluate both zero-shot and fine-tuned performance across four model families — Qwen3, Gemma-3, LLaMA-3.2, and LLaMA-3 — and compare them against larger closed-source APIs (GPT-4o, Gemini 2.0 Flash, Claude 4.0 Haiku). Fine-tuning is performed with QLoRA (4-bit quantization), injecting LoRA adapters into attention and MLP layers while keeping the base model frozen.
- Can fine-tuned Small LMs (<4B) develop strong instruction-following and strategic reasoning, approaching the performance of larger models?
- What efficiency–performance trade-offs emerge when applying parameter-efficient fine-tuning (e.g., QLoRA) to these small models?
We use PokerBench (Zhuang et al., 2025), an open-source poker decision-making benchmark of GTO-labeled instruction–action pairs. Each sample contains a natural-language poker scenario as the instruction and a GTO action (e.g., bet 18, raise 89, fold) as the output.
| Data Type | Pre-flop Spots | Post-flop Spots |
|---|---|---|
| Evaluation (Benchmark) | 1,000 | 10,000 |
| Training (Full) | 60,000 | 500,000 |
All models are evaluated on two metrics:
- AA (Action Accuracy): predicted action type matches ground truth type
- EM (Exact Match): predicted action type AND amount both match exactly
| Model | Overall AA | Overall EM | Preflop AA | Preflop EM | Postflop AA | Postflop EM |
|---|---|---|---|---|---|---|
| GPT-4o | 0.647 | 0.579 | 0.648 | 0.576 | 0.460 | 0.320 |
| Claude 4.0 Haiku | 0.479 | 0.432 | 0.745 | 0.593 | 0.453 | 0.416 |
| Gemini 2.0 Flash | 0.723 | 0.613 | 0.726 | 0.613 | 0.722 | 0.610 |
| LLaMA-3.1-8B-Instruct | 0.275 | 0.194 | 0.360 | 0.360 | 0.267 | 0.178 |
| LLaMA-2-13B-chat-hf | 0.309 | 0.265 | 0.227 | 0.136 | 0.317 | 0.278 |
| Model | Overall AA | Overall EM | Preflop AA | Preflop EM | Postflop AA | Postflop EM |
|---|---|---|---|---|---|---|
| Qwen3-0.6B | 0.246 | 0.239 | 0.276 | 0.276 | 0.216 | 0.201 |
| Qwen3-1.7B | 0.166 | 0.009 | 0.250 | 0.016 | 0.082 | 0.002 |
| Gemma2-2B | 0.246 | 0.116 | 0.233 | 0.105 | 0.258 | 0.126 |
| LLaMA-3.2-1B-Instruct | 0.227 | 0.222 | 0.225 | 0.225 | 0.229 | 0.219 |
| Model | Overall AA | Overall EM | Preflop AA | Preflop EM | Postflop AA | Postflop EM |
|---|---|---|---|---|---|---|
| Qwen3-0.6B | 0.842 | 0.829 | 0.909 | 0.881 | 0.836 | 0.824 |
| Qwen3-1.7B | 0.835 | 0.822 | 0.899 | 0.868 | 0.829 | 0.817 |
| Gemma-3-1B-IT | 0.790 | 0.787 | 0.822 | 0.816 | 0.788 | 0.784 |
| LLaMA-3.2-1B-Instruct | 0.830 | 0.825 | 0.900 | 0.891 | 0.824 | 0.819 |
Fine-tuning with QLoRA improves AA and EM by 4× or more over zero-shot baselines. Fine-tuned SLMs also surpass the zero-shot LLaMA-3.1-8B baseline despite having far fewer parameters.
pokerbench-repo/
│
├── evaluation/
│ ├── zero_shot_slm/ # Local zero-shot evaluation of small models
│ │ ├── evaluate_llama.py # LLaMA-3.2-1B/3B, LLaMA-3-8B
│ │ └── evaluate_qwen_gemma.py # Qwen3-0.6B, Qwen3-1.7B, Gemma2-2B
│ └── zero_shot_llm/ # API-based evaluation of large models
│ ├── infer_openai.py # GPT-4o
│ ├── infer_gemini.py # Gemini 2.0 Flash
│ ├── Baseline_claude_4.0_Haiku.ipynb
│ ├── Baseline_llama3_1_8b.ipynb
│ └── Baseline_llama2_13b.ipynb
│
├── finetuning/
│ ├── FineTuning_Gemma-3-1B-IT.ipynb # Gemma-3-1B fine-tuning
│ ├── FineTuning_Llama3.2_1B.ipynb # LLaMA-3.2-1B fine-tuning
│ └── qwen/ # Qwen3-0.6B and Qwen3-1.7B fine-tuning
│ ├── train.py # Training entry point
│ ├── test.py # Evaluation entry point
│ ├── models/ # Qwen model wrappers
│ ├── poker_datasets/ # Dataset loading and preprocessing
│ └── src/ # Training pipeline and config
│
├── simulation/
│ ├── head2head_simulator.py # Head-to-head model comparison simulator
│ └── SIMULATION_USAGE.md
│
├── src/ # Shared library
│ ├── clients/
│ │ ├── openai_client.py
│ │ └── gemini_client.py
│ ├── metrics.py # AA and EM evaluation metrics
│ ├── pipeline.py # End-to-end evaluation pipeline
│ ├── prompts.py # Prompt templates
│ └── train_config.py # Shared training configuration
│
├── data/
│ └── results/zero_shot/llama/ # Saved zero-shot evaluation results (JSON)
│
└── utils/
└── __init__.py # Re-exports from src.metrics for compatibility
- Zhuang, R., et al. (2025). PokerBench: Training large language models to become professional poker players. arXiv:2501.08328.
- Dettmers, T., et al. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv:2305.14314.
- Hu, E. J., et al. (2021). LoRA: Low-rank adaptation of large language models. arXiv:2106.09685.
- Yang, A., et al. (2025). Qwen3 technical report. arXiv:2505.09388.
- Grattafiori, A., et al. (2024). The Llama 3 herd of models. arXiv:2407.21783.
- Gemma Team, et al. (2025). Gemma 3 technical report. arXiv:2503.19786.
- Brown, N., & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science, 365(6456), 885–890.