Exploration and Exploitation Errors Are Measurable for Language Model Agents

Jaden Park^1,* · Jungtaek Kim^1,* · Jongwon Jeong^1,* · Robert D. Nowak¹ · Kangwook Lee^2,3 · Yong Jae Lee¹

¹ University of Wisconsin-Madison · ² KRAFTON · ³ Ludo Robotics
_{^* Equal Contribution}

This repository provides a deterministic framework for measuring exploration and exploitation behavior from language-model agent trajectories alone. Each episode pairs a partially observed 2D grid map with an unknown symbolic task DAG. The release includes environment generation, policy-agnostic error metrics, heuristic baselines, LLM agents, harness-engineered memory variants, and visualization/export utilities.

What This Framework Exposes

A controllable 3 x 3 evaluation grid over ExplorationLevel x TaskDAGDifficulty
Symbolic task nodes that suppress semantic priors and isolate in-environment reasoning
Per-trajectory metrics for success, exploration_error, and exploitation_error
Built-in heuristic agents, direct LLM agents, and memory/harness variants
Export utilities for PNG, PDF, GIF, *.trajectory.json, and *.llm_trace.jsonl

Axis	Values	Main effect
`ExplorationLevel`	`high`, `medium`, `low`	grid footprint, corridor width, and spatial exploration burden
`TaskDAGDifficulty`	`easy`, `medium`, `hard`	symbolic dependency depth/branching and exploitation burden

The framework is designed for settings where we want to distinguish failure to discover useful information from failure to use already discovered information. This is the same abstraction that motivates the paper: AI coding, workflow automation, and embodied-style navigation all require both.

Main Empirical Takeaways

Low exploration error is a strong predictor of success in the current frontier-model sweep (R^2 = 0.947).
Agents with similar success rates can still exhibit qualitatively different behaviors.

On the same episode, Claude Haiku 4.5 and Gemini 3.1 Flash Lite display clearly different exploration and exploitation patterns.
Claude Haiku 4.5 shows moderate exploration and exploitation errors and still reaches a successful trajectory, whereas Gemini 3.1 Flash Lite shows large exploration error and ends in a failed trajectory.
Prompt steering changes failure modes: exploration-focused prompts reduce exploration error, while exploitation-focused prompts reduce exploitation error.
Simple harness engineering helps substantially. In our experiments, GPT-4.1 improves from 63.0% to 92.6% success, and Gemini 3.1 Flash Lite improves from 51.9% to 88.9%.

Quickstart

Install the package:

# editable install
pip install -e .

# standard install
pip install .

Set up API keys for LLM-backed runs:

bash scripts/setup_llm_keys.sh
source ~/.symbolic_environment_llm_keys

Run one LLM-controlled episode:

symbolic-environment simulate \
  --exploration-level high \
  --task-dag-difficulty hard \
  --seed 0 \
  --agent llm_agent \
  --llm-provider azure \
  --llm-model gpt-4.1 \
  --llm-prompt-set reasoning-exploration \
  --gif

Compare multiple LLM agents on the exact same generated episode:

symbolic-environment generate \
  --exploration-level medium \
  --task-dag-difficulty hard \
  --seed 7 \
  --output outputs/medium-hard-s0007.episode.json

symbolic-environment simulate \
  --episode outputs/medium-hard-s0007.episode.json \
  --agent llm_agent \
  --llm-provider azure \
  --llm-model gpt-4.1 \
  --gif

symbolic-environment simulate \
  --episode outputs/medium-hard-s0007.episode.json \
  --agent llm_memory_agent \
  --llm-provider azure \
  --llm-model gpt-4.1 \
  --gif

For the memory-augmented harness variant, use llm_memory_agent. For Azure, --llm-model should be the deployment name.

Reproducing The Standard Evaluation Suite

The standard generated evaluation suite is the full 3 x 3 matrix over exploration level and task-DAG difficulty.

bash scripts/run_all_llm.sh --num-seeds 3
python scripts/analyze_all_results.py

scripts/run_all_llm.sh auto-sources ~/.symbolic_environment_llm_keys when that file exists. The analysis script summarizes success rate, average steps, exploration error rate, and exploitation error rate for each evaluation cell.

Sample Trajectories

All examples below use exploration=high, task_dag=hard, and seed=2.

🎞️ Expand multi-model GIF gallery

Claude Opus 4.6

Claude Haiku 4.5

Gemini 3.1 Pro

Gemini 3.1 Flash Lite

GPT-5.4

GPT-4.1 Nano

Output Artifacts

Each run writes a common stem under outputs/:

*.episode.json: generated environment instance
*.trajectory.json: action and observation trajectory
*.metrics.json: success and exploration/exploitation summaries
*.llm_trace.jsonl: model-facing prompt/response traces for LLM agents
*.png, *.pdf, *.gif: visualizations

Repository Layout

assets/: README figures and sampled media used for documentation
src/symbolic_environment/: environment, generation, metrics, agents, prompts, visualization
scripts/: evaluation runners and summary utilities
tests/: regression and unit tests for the environment, agents, metrics, and CLI

Citation

If you find this work useful, please cite our paper:

@article{park2026exploration,
  title={Exploration and Exploitation Errors Are Measurable for Language Model Agents}, 
  author={Jaden Park and Jungtaek Kim and Jongwon Jeong and Robert D. Nowak and Kangwook Lee and Yong Jae Lee},
  journal={arXiv preprint arXiv:2604.13151},
  year={2026}
}

License

This repository is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
scripts		scripts
src/symbolic_environment		src/symbolic_environment
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploration and Exploitation Errors Are Measurable for Language Model Agents

What This Framework Exposes

Main Empirical Takeaways

Quickstart

Reproducing The Standard Evaluation Suite

Sample Trajectories

Output Artifacts

Repository Layout

Citation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploration and Exploitation Errors Are Measurable for Language Model Agents

What This Framework Exposes

Main Empirical Takeaways

Quickstart

Reproducing The Standard Evaluation Suite

Sample Trajectories

Output Artifacts

Repository Layout

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages