Skip to content

jjj-madison/measurable-explore-exploit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploration and Exploitation Errors Are Measurable for Language Model Agents

Jaden Park1,* · Jungtaek Kim1,* · Jongwon Jeong1,* · Robert D. Nowak1 · Kangwook Lee2,3 · Yong Jae Lee1

1 University of Wisconsin-Madison · 2 KRAFTON · 3 Ludo Robotics
* Equal Contribution

🏛️ arXiv · 📄 Project Page · 💻 GitHub

Overview of the partially observed grid and symbolic task DAG framework.

This repository provides a deterministic framework for measuring exploration and exploitation behavior from language-model agent trajectories alone. Each episode pairs a partially observed 2D grid map with an unknown symbolic task DAG. The release includes environment generation, policy-agnostic error metrics, heuristic baselines, LLM agents, harness-engineered memory variants, and visualization/export utilities.

What This Framework Exposes

  • A controllable 3 x 3 evaluation grid over ExplorationLevel x TaskDAGDifficulty
  • Symbolic task nodes that suppress semantic priors and isolate in-environment reasoning
  • Per-trajectory metrics for success, exploration_error, and exploitation_error
  • Built-in heuristic agents, direct LLM agents, and memory/harness variants
  • Export utilities for PNG, PDF, GIF, *.trajectory.json, and *.llm_trace.jsonl
Axis Values Main effect
ExplorationLevel high, medium, low grid footprint, corridor width, and spatial exploration burden
TaskDAGDifficulty easy, medium, hard symbolic dependency depth/branching and exploitation burden

The framework is designed for settings where we want to distinguish failure to discover useful information from failure to use already discovered information. This is the same abstraction that motivates the paper: AI coding, workflow automation, and embodied-style navigation all require both.

Main Empirical Takeaways

Success rate versus exploration error across language models.

  • Low exploration error is a strong predictor of success in the current frontier-model sweep (R^2 = 0.947).
  • Agents with similar success rates can still exhibit qualitatively different behaviors.

Side-by-side comparison of Claude Haiku 4.5 and Gemini 3.1 Flash Lite trajectories on the same task.

  • On the same episode, Claude Haiku 4.5 and Gemini 3.1 Flash Lite display clearly different exploration and exploitation patterns.

  • Claude Haiku 4.5 shows moderate exploration and exploitation errors and still reaches a successful trajectory, whereas Gemini 3.1 Flash Lite shows large exploration error and ends in a failed trajectory.

  • Prompt steering changes failure modes: exploration-focused prompts reduce exploration error, while exploitation-focused prompts reduce exploitation error.

  • Simple harness engineering helps substantially. In our experiments, GPT-4.1 improves from 63.0% to 92.6% success, and Gemini 3.1 Flash Lite improves from 51.9% to 88.9%.

Quickstart

Install the package:

# editable install
pip install -e .

# standard install
pip install .

Set up API keys for LLM-backed runs:

bash scripts/setup_llm_keys.sh
source ~/.symbolic_environment_llm_keys

Run one LLM-controlled episode:

symbolic-environment simulate \
  --exploration-level high \
  --task-dag-difficulty hard \
  --seed 0 \
  --agent llm_agent \
  --llm-provider azure \
  --llm-model gpt-4.1 \
  --llm-prompt-set reasoning-exploration \
  --gif

Compare multiple LLM agents on the exact same generated episode:

symbolic-environment generate \
  --exploration-level medium \
  --task-dag-difficulty hard \
  --seed 7 \
  --output outputs/medium-hard-s0007.episode.json

symbolic-environment simulate \
  --episode outputs/medium-hard-s0007.episode.json \
  --agent llm_agent \
  --llm-provider azure \
  --llm-model gpt-4.1 \
  --gif

symbolic-environment simulate \
  --episode outputs/medium-hard-s0007.episode.json \
  --agent llm_memory_agent \
  --llm-provider azure \
  --llm-model gpt-4.1 \
  --gif

For the memory-augmented harness variant, use llm_memory_agent. For Azure, --llm-model should be the deployment name.

Reproducing The Standard Evaluation Suite

The standard generated evaluation suite is the full 3 x 3 matrix over exploration level and task-DAG difficulty.

bash scripts/run_all_llm.sh --num-seeds 3
python scripts/analyze_all_results.py

scripts/run_all_llm.sh auto-sources ~/.symbolic_environment_llm_keys when that file exists. The analysis script summarizes success rate, average steps, exploration error rate, and exploitation error rate for each evaluation cell.

Sample Trajectories

All examples below use exploration=high, task_dag=hard, and seed=2.

🎞️ Expand multi-model GIF gallery

Claude Opus 4.6
Sample Claude Opus 4.6 trajectory.

Claude Haiku 4.5
Sample Claude Haiku 4.5 trajectory.

Gemini 3.1 Pro
Sample Gemini 3.1 Pro trajectory.

Gemini 3.1 Flash Lite
Sample Gemini 3.1 Flash Lite trajectory.

GPT-5.4
Sample GPT-5.4 trajectory.

GPT-4.1 Nano
Sample GPT-4.1 Nano trajectory.

Output Artifacts

Each run writes a common stem under outputs/:

  • *.episode.json: generated environment instance
  • *.trajectory.json: action and observation trajectory
  • *.metrics.json: success and exploration/exploitation summaries
  • *.llm_trace.jsonl: model-facing prompt/response traces for LLM agents
  • *.png, *.pdf, *.gif: visualizations

Repository Layout

  • assets/: README figures and sampled media used for documentation
  • src/symbolic_environment/: environment, generation, metrics, agents, prompts, visualization
  • scripts/: evaluation runners and summary utilities
  • tests/: regression and unit tests for the environment, agents, metrics, and CLI

Citation

If you find this work useful, please cite our paper:

@article{park2026exploration,
  title={Exploration and Exploitation Errors Are Measurable for Language Model Agents}, 
  author={Jaden Park and Jungtaek Kim and Jongwon Jeong and Robert D. Nowak and Kangwook Lee and Yong Jae Lee},
  journal={arXiv preprint arXiv:2604.13151},
  year={2026}
}

License

This repository is released under the MIT License.

About

[arXiv] Exploration and Exploitation Errors Are Measurable for Language Model Agents

Resources

License

Stars

Watchers

Forks

Contributors