This is the official implementation of ProPlay: Procedural Pre-play for Self-Evolving LLM Agents.
ProPlay addresses the problem of self-evolving agents in partially observable environments, where agents must continually refine the internal understanding of environmental dynamics. It introduces a preplay framework built on an evolving procedural world model that encourages continual information exchange between planning and memory under a unified architecture.
ProPlay represents environment knowledge as a procedure graph where:
- Nodes are abstracted procedures induced from successful task trajectories.
- Directed edges (procedure transitions) encode induced causal transitions among task stages.
- Reliability record embeddings on each edge track how consistently a transition contributed to success on semantically similar tasks.
Each episode follows a three-phase loop:
- Pre-play: Before acting, ProPlay queries the procedure world model to construct a procedural trajectory as structured soft guidance.
- Execute: The agent acts under this guidance while retaining full freedom to deviate and explore.
- Refine: After execution, new procedures are induced from successful trajectory fragments, and the world model is refined for future task episodes.
This episodic query–execute–refine loop enables ProPlay to progressively internalize environment dynamics, combining the strengths of memory (consolidated procedural knowledge) and planning (task-specific trajectory lookahead) based on a unified world model.
| Benchmark | Domain | Implementation |
|---|---|---|
| ScienceWorld | Text-based scientific reasoning (23 task types) | benchmarks/sciworld/ |
| PlanCraft | Minecraft crafting (187 tasks, 3 difficulty levels) | benchmarks/plancraft/ |
| τ-bench | Customer service tool use (retail & airline) | benchmarks/taubench/ |
proplay/
├── proplay/ # Core library (benchmark-agnostic)
│ ├── graph.py # WorkflowGraph: nodes, edges, reliability record embeddings
│ ├── env.py # BaseEnv interface
│ └── llm.py # LLMClient (OpenAI-compatible)
│
├── benchmarks/
│ ├── sciworld/
│ │ ├── router.py # SciWorldEnv (AgentGym REST wrapper)
│ │ ├── agent.py # ProPlay agent: think/act loop
│ │ ├── induction.py # Procedure induction from episode summaries
│ │ ├── preplay.py # Pre-play trajectory construction and graph recording
│ │ ├── prompts.py # LLM prompt templates
│ │ ├── prompt/ # preplay_instruction.txt, preplay_one_shot.txt
│ │ └── pipeline.py # End-to-end evaluation loop
│ ├── plancraft/
│ │ ├── router.py # PlanCraft environment wrapper
│ │ ├── agent.py # ProPlay agent for Minecraft crafting
│ │ ├── graph.py # WorkflowGraph (plancraft copy)
│ │ ├── induction.py # Recipe library induction
│ │ ├── preplay.py # Pre-play for recipe ordering
│ │ ├── prompts.py # LLM prompt templates
│ │ ├── prompt/ # preplay_instruction.txt, preplay_one_shot.txt
│ │ └── pipeline.py # Evaluation loop
│ └── taubench/
│ ├── router.py # tau_bench.envs.get_env wrapper (retail, airline)
│ ├── agent.py # ProPlay agent with tool-calling support
│ ├── graph.py # WorkflowGraph (taubench copy)
│ ├── induction.py # Workflow induction from tool trajectories
│ ├── preplay.py # Pre-play for tool ordering
│ ├── prompts.py # LLM prompt templates
│ ├── prompt/ # preplay_instruction.txt, preplay_one_shot.txt
│ ├── llm_client.py # LLM client with tool-calling extension
│ └── pipeline.py # Evaluation loop
│
├── data/
│ ├── sciworld/
│ │ ├── gen_online_splits.py # Generate online evaluation split (shuffled)
│ │ └── splits/ # Generated
│ ├── plancraft/
│ │ ├── gen_splits.py # Generate evaluation split from plancraft package
│ │ └── splits/ # Generated
│ └── taubench/
│ └── load_data.py # Data loading utilities (data bundled in package)
│
├── prompts/ # Source copies of pre-play prompt text files
│ ├── sciworld/
│ ├── plancraft/
│ └── taubench/
│
└── scripts/
├── run_sciworld.sh
├── run_plancraft.sh
└── run_taubench.sh
git clone <this-repo>
cd proplay
pip install -e .
# Benchmark-specific dependencies
pip install -e ".[sciworld]" # ScienceWorld (agentenv-sciworld + scienceworld)
pip install -e ".[plancraft]" # PlanCraft
# τ-bench — install from source
pip install git+https://github.com/sierra-research/tau-benchGenerate task splits before running ProPlay:
# ScienceWorld
cd data/sciworld
python gen_online_splits.py # → splits/online_shuffled_ids.json
# PlanCraft (reads val/test splits directly from the installed plancraft package)
cd data/plancraft
python gen_splits.py # → splits/merged_187_by_complexity.jsonτ-bench task data is bundled with the tau_bench package — no preprocessing needed.
# Start AgentGym SciWorld server
python -m uvicorn agentenv_sciworld.server:app --host 0.0.0.0 --port <your_port> &
export OPENAI_API_KEY=<your_key>
bash scripts/run_sciworld.shexport OPENAI_API_KEY=<your_key>
bash scripts/run_plancraft.shexport OPENAI_API_KEY=<your_key>
DOMAIN=retail bash scripts/run_taubench.sh
DOMAIN=airline bash scripts/run_taubench.shFor questions, please contact yma7@nd.edu.
