Jaden Park1,* · Jungtaek Kim1,* · Jongwon Jeong1,* · Robert D. Nowak1 · Kangwook Lee2,3 · Yong Jae Lee1
1 University of Wisconsin-Madison · 2 KRAFTON · 3 Ludo Robotics
* Equal Contribution
🏛️ arXiv · 📄 Project Page · 💻 GitHub
This repository provides a deterministic framework for measuring exploration and exploitation behavior from language-model agent trajectories alone. Each episode pairs a partially observed 2D grid map with an unknown symbolic task DAG. The release includes environment generation, policy-agnostic error metrics, heuristic baselines, LLM agents, harness-engineered memory variants, and visualization/export utilities.
- A controllable
3 x 3evaluation grid overExplorationLevel x TaskDAGDifficulty - Symbolic task nodes that suppress semantic priors and isolate in-environment reasoning
- Per-trajectory metrics for
success,exploration_error, andexploitation_error - Built-in heuristic agents, direct LLM agents, and memory/harness variants
- Export utilities for
PNG,PDF,GIF,*.trajectory.json, and*.llm_trace.jsonl
| Axis | Values | Main effect |
|---|---|---|
ExplorationLevel |
high, medium, low |
grid footprint, corridor width, and spatial exploration burden |
TaskDAGDifficulty |
easy, medium, hard |
symbolic dependency depth/branching and exploitation burden |
The framework is designed for settings where we want to distinguish failure to discover useful information from failure to use already discovered information. This is the same abstraction that motivates the paper: AI coding, workflow automation, and embodied-style navigation all require both.
- Low exploration error is a strong predictor of success in the current frontier-model sweep (
R^2 = 0.947). - Agents with similar success rates can still exhibit qualitatively different behaviors.
-
On the same episode, Claude Haiku 4.5 and Gemini 3.1 Flash Lite display clearly different exploration and exploitation patterns.
-
Claude Haiku 4.5 shows moderate exploration and exploitation errors and still reaches a successful trajectory, whereas Gemini 3.1 Flash Lite shows large exploration error and ends in a failed trajectory.
-
Prompt steering changes failure modes: exploration-focused prompts reduce exploration error, while exploitation-focused prompts reduce exploitation error.
-
Simple harness engineering helps substantially. In our experiments, GPT-4.1 improves from
63.0%to92.6%success, and Gemini 3.1 Flash Lite improves from51.9%to88.9%.
Install the package:
# editable install
pip install -e .
# standard install
pip install .Set up API keys for LLM-backed runs:
bash scripts/setup_llm_keys.sh
source ~/.symbolic_environment_llm_keysRun one LLM-controlled episode:
symbolic-environment simulate \
--exploration-level high \
--task-dag-difficulty hard \
--seed 0 \
--agent llm_agent \
--llm-provider azure \
--llm-model gpt-4.1 \
--llm-prompt-set reasoning-exploration \
--gifCompare multiple LLM agents on the exact same generated episode:
symbolic-environment generate \
--exploration-level medium \
--task-dag-difficulty hard \
--seed 7 \
--output outputs/medium-hard-s0007.episode.json
symbolic-environment simulate \
--episode outputs/medium-hard-s0007.episode.json \
--agent llm_agent \
--llm-provider azure \
--llm-model gpt-4.1 \
--gif
symbolic-environment simulate \
--episode outputs/medium-hard-s0007.episode.json \
--agent llm_memory_agent \
--llm-provider azure \
--llm-model gpt-4.1 \
--gifFor the memory-augmented harness variant, use llm_memory_agent. For Azure, --llm-model should be the deployment name.
The standard generated evaluation suite is the full 3 x 3 matrix over exploration level and task-DAG difficulty.
bash scripts/run_all_llm.sh --num-seeds 3
python scripts/analyze_all_results.pyscripts/run_all_llm.sh auto-sources ~/.symbolic_environment_llm_keys when that file exists. The analysis script summarizes success rate, average steps, exploration error rate, and exploitation error rate for each evaluation cell.
All examples below use exploration=high, task_dag=hard, and seed=2.
🎞️ Expand multi-model GIF gallery
Each run writes a common stem under outputs/:
*.episode.json: generated environment instance*.trajectory.json: action and observation trajectory*.metrics.json: success and exploration/exploitation summaries*.llm_trace.jsonl: model-facing prompt/response traces for LLM agents*.png,*.pdf,*.gif: visualizations
assets/: README figures and sampled media used for documentationsrc/symbolic_environment/: environment, generation, metrics, agents, prompts, visualizationscripts/: evaluation runners and summary utilitiestests/: regression and unit tests for the environment, agents, metrics, and CLI
If you find this work useful, please cite our paper:
@article{park2026exploration,
title={Exploration and Exploitation Errors Are Measurable for Language Model Agents},
author={Jaden Park and Jungtaek Kim and Jongwon Jeong and Robert D. Nowak and Kangwook Lee and Yong Jae Lee},
journal={arXiv preprint arXiv:2604.13151},
year={2026}
}This repository is released under the MIT License.








