| title | OpenEnv-WolfeClick Environment | ||||
|---|---|---|---|---|---|
| emoji | 🎮 | ||||
| colorFrom | blue | ||||
| colorTo | gray | ||||
| sdk | docker | ||||
| app_port | 7860 | ||||
| tags |
|
An OpenEnv-compatible environment for training LLMs to play competitive Pokemon Showdown battles using GRPO.
Competitive Pokemon has hidden information, constrained legal actions, long-term resource tradeoffs, and an active opponent. This repo turns that setting into a trainable RL environment with a reset() / step() loop, shaped rewards, an OpenEnv server wrapper, and a GRPO training pipeline.
Try the live demo — watch a GRPO-trained model play a full battle turn by turn.
git clone https://github.com/Atharva2099/OpenEnv-WolfeClick.git
cd OpenEnv-WolfeClick
pip install -e .
# Run a battle with random actions (needs local Pokemon Showdown on port 8000)
python examples/run_single_episode.py
# Watch a trained model battle
python examples/watch_model_battle.py --revision grpo-qwen3-4b-run3src/smogon_rl/ Core environment: state formatting, action validation,
reward shaping, poke-env client
env/ OpenEnv server package (env.server.app:app)
examples/ Runnable scripts for local battles
trainer.ipynb Colab: rollout collection + GRPO training
watch_battle.ipynb Colab: run one live watched battle
benchmarks/ Checkpoint comparison notebook + results
record_battle.py Record a battle to JSON for replay
space_app.py Gradio HF Space battle viewer
openenv.yaml OpenEnv deployment config
Dockerfile HF Spaces Docker deployment
Each turn the model receives a structured markdown state:
| Section | Contents |
|---|---|
| Part A: Active Field | Active Pokemon for both sides — HP, status, ability, item, stat modifiers, opponent speed range |
| Part B: Full Self Roster | All 6 team Pokemon with HP, status, item, and known moves (type + base power) |
| Part C: Opponent History | Every revealed opponent Pokemon — last known HP, status, moves, items, abilities |
The model outputs one JSON action:
{"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}Up to 4 moves and 5 switches are available per turn. The environment validates the action, executes it in a real Showdown battle, and returns the next state + shaped reward.
Dense reward signal tied to battle progress:
| Component | Signal |
|---|---|
| Damage dealt | +1.0 per 10% opponent HP reduced |
| Damage taken | -1.0 per 10% self HP lost |
| Knockouts | +3.0 per opponent faint, -3.0 per self faint |
| Healing | +1.0 per 10% healed (capped 3.0/battle) |
| Setup | +0.5 per stat stage gained (capped 2.0/mon) |
| Type effectiveness | +0.5 super effective, -1.0 immune |
| Illegal action | -10.0 for hallucinated moves/Pokemon |
| Step penalty | -0.05 per turn (anti-stall) |
Base Model (Qwen3-4B-Instruct)
|
[JSON Warm-up SFT] establish legal action baseline
|
[Rollout Collection] live Pokemon Showdown battles
|
[GRPO Training] optimize policy on real trajectories
|
LoRA Checkpoint ---> Hugging Face Hub
- Start local Pokemon Showdown in Colab
- Collect rollout trajectories from live battles
- Store prompt, chosen action, and environment reward
- Train a LoRA adapter with GRPO on real trajectories
- Benchmark checkpoints against each other
Pokemon Showdown (Node.js, port 8000)
| WebSocket
PokeEnvClient (async background loop)
|-- RLPlayer (queue-driven)
|-- RandomPlayer (opponent)
|
PokemonShowdownEnv (sync wrapper: reset/step)
|-- state_formatter -> markdown state for LLM
|-- action_space -> JSON validation + matching
|-- reward calculator -> shaped multi-component reward
|
OpenEnv Server (FastAPI on port 8001)
Model repo: Atharva2099/openenv-smogon-rl
| Checkpoint | Description |
|---|---|
grpo-qwen3-4b-run1 |
First GRPO training run |
grpo-qwen3-4b-run2 |
Second run, tuned reward shaping |
grpo-qwen3-4b-run3 |
Third run, best performing |
| Notebook | Purpose |
|---|---|
trainer.ipynb |
Rollout collection + GRPO training (Colab GPU) |
watch_battle.ipynb |
Run one live watched battle |
benchmarks/benchmark.ipynb |
Compare checkpoint performance |
The environment follows the OpenEnv standard. Config:
# openenv.yaml
spec_version: 1
name: openenv-wolfeclick
type: space
runtime: fastapi
app: env.server.app:app
port: 8001Server package: env/server/app.py, env/server/environment.py, env/models.py
The Dockerfile builds a lightweight Gradio app that replays pre-recorded model battles:
docker build -t wolfeclick . && docker run -p 7860:7860 wolfeclickMIT