This project trains a large language model (LLM) to ask optimal clarifying questions when given ambiguous coding problems - but under a strict budget on how many questions it can ask. The interaction is framed as a Constrained Markov Decision Process (CMDP), where the agent must maximise code correctness while satisfying a hard constraint on average question count. It uses Reinforcement Learning with PPO-Lagrangian on top of Qwen2.5-Coder-7B-Instruct with LoRA adapters, evaluated on the HumanEvalComm benchmark.
Contributors: Abhinav Rajput · Acey Vogelstein · Deepali Balakrishna Ksheersagar
- What This Project Does
- Hardware Requirements
- Installation
- API Keys
- Project Structure
- Dataset
- System Prompts
- Training
- Evaluation
- Results
- Key Takeaways
- Trained Model Checkpoints
- Configuration Reference
- Key Design Decisions
When an LLM is given a vague coding task like "Return list with elements incremented by a number", it can either:
- Guess and write code (fast, but may be wrong)
- Ask a clarifying question like "What number should each element be incremented by?" (gets better information, but costs the user a turn)
This project trains the model to make that decision optimally - asking only when it is truly worth it - under a configurable budget d₁ that caps the average number of questions per problem.
We train three policies:
- d₁ = 0 - the agent learns to never ask; it must guess from the degraded spec alone
- d₁ = 0.5 - the agent may ask at most 0.5 questions on average per problem
- d₁ = 1 - the agent may ask at most 1 question on average per problem
These three policies trace a Pareto frontier: a family of policies ranging from "never ask" to "ask freely", each optimal for its budget.
The training algorithm is PPO-Lagrangian, where a Lagrange multiplier λ₁ automatically learns how expensive each question should be to satisfy the budget. No manual penalty tuning is needed.
| Requirement | Minimum | Recommended |
|---|---|---|
| GPUs | 2× A100-40GB | 2× A100-40GB |
| CPU RAM | 64 GB | 128 GB |
| Disk | 50 GB free | 100 GB free |
| Internet | Required (HuggingFace + OpenAI API) | - |
GPU layout:
cuda:0- Policy training (Qwen2.5-Coder-7B + LoRA + value heads + optimizer, ~19 GB)cuda:1- Rollout inference + frozen reference model (~17 GB)
To change GPU assignment, edit model.train_device and model.rollout_device in configs/default.yaml.
# Clone the repository
git clone <repo-url>
cd rl_llm_multiturn_project
# Create and activate a virtual environment (Python 3.10+)
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtInstall CUDA-enabled PyTorch separately if needed (the version in requirements.txt is CPU-safe but may not match your CUDA driver):
# Example for CUDA 12.1
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu121Verify GPU access:
python3 -c "import torch; print(torch.cuda.device_count(), 'GPUs available')"Download the base model: The base model (Qwen2.5-Coder-7B-Instruct) is openly available on HuggingFace - no access request or token needed:
python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; AutoTokenizer.from_pretrained('Qwen/Qwen2.5-Coder-7B-Instruct'); AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-Coder-7B-Instruct', torch_dtype='bfloat16')"This project uses two external APIs:
The user simulator (GPT-4o-mini) answers the agent's clarifying questions during every training episode. This is the only API cost during training (~$5–10 total for a full run).
export OPENAI_API_KEY="sk-..."Or add it to a .env file in the project root (never commit this file):
OPENAI_API_KEY=sk-...
Then load it before running:
source .env # if using a .env fileCost estimate: GPT-4o-mini charges ~$0.15 per 1M input tokens. With 32 episodes/iteration × 80 iterations × ~2 questions/episode × ~200 tokens/call ≈ 1M tokens ≈ **$0.15 per d₁ setting**.
Qwen2.5-Coder-7B-Instruct is not gated - no HuggingFace token is needed to download it.
rl_llm_multiturn_project/
│
├── configs/
│ └── default.yaml # All hyperparameters (model, training, constraints)
│
├── data/
│ ├── raw/ # Downloaded raw data (auto-populated, gitignored)
│ └── processed/ # Preprocessed problem files (gitignored)
│
├── src/
│ ├── data/
│ │ ├── dataset.py # Problem dataclass + HumanEvalComm loader
│ │ └── augmentation.py # MBPP degradation via GPT-4o (one-time script)
│ │
│ ├── environment/
│ │ ├── env.py # ClarificationEnv: the RL state machine
│ │ ├── user_simulator.py # GPT-4o-mini wrapper (async, with atomic question counting)
│ │ └── code_executor.py # Sandboxed Python runner → pass@1 score
│ │
│ ├── models/
│ │ ├── agent.py # Qwen2.5-Coder-7B + LoRA: generate() and score()
│ │ └── value_heads.py # Three MLP value heads (reward, q_cost, t_cost)
│ │
│ ├── training/
│ │ ├── rollout.py # Async episode collection + RolloutBuffer
│ │ ├── ppo.py # PPO loss, GAE, KL penalty, entropy bonus
│ │ ├── lagrangian.py # Lagrange multiplier dual update
│ │ └── trainer.py # PPOLagrangianTrainer: main training loop
│ │
│ └── evaluation/
│ └── evaluator.py # pass@1 metrics, Pareto frontier, plots
│
├── scripts/
│ ├── train.py # Training entry point
│ ├── evaluate.py # Evaluation entry point
│ ├── smoke_test.py # End-to-end pipeline validation
│ └── baseline_eval.py # Base model coding ability test (supports --checkpoint)
│
├── checkpoints/ # Saved model checkpoints (gitignored)
│ ├── d1_0/
│ │ ├── iter_0019/ # Checkpoint every 20 iterations
│ │ ├── iter_0039/
│ │ ├── best/ # Checkpoint at peak mid-training eval reward
│ │ └── final/ # Final checkpoint after all iterations
│ └── d1_1/
│ ├── iter_0019/
│ ├── best/
│ └── final/
│
├── outputs/ # Eval results, plots (gitignored)
└── requirements.txt
The dataset is HumanEvalComm - 164 Python coding problems from HumanEval, each with multiple degraded versions of the problem specification.
It is automatically downloaded from HuggingFace the first time you run training or evaluation. No manual download is needed.
What the degradations look like:
| Type | Field | What changes |
|---|---|---|
| Ambiguity | prompt1a |
Specific values replaced with vague terms ("by 1" → "by a number") |
| Inconsistency | prompt1c |
Examples contradict the description |
| Incompleteness | prompt1p |
All examples and details stripped; only a stub remains |
| Ambiguity + Inconsistency | prompt2ac |
Both combined |
| Ambiguity + Incompleteness | prompt2ap |
Both combined |
| Inconsistency + Incompleteness | prompt2cp |
Both combined |
| All three | prompt3acp |
Ambiguity + Inconsistency + Incompleteness |
Train/test split: The split is stratified at the base problem level - problems are grouped by their rarest available variant, then each group is split proportionally (~60% eval, ~40% train). This guarantees all 7 degradation types appear in both train and eval sets. All variants of a base problem go to the same set (no leakage). Enforced in src/data/dataset.py.
Which variants are used for training is controlled by data.use_variants in the config:
data:
use_variants:
- prompt1a
- prompt1c
- prompt1p
- prompt2ac
- prompt2ap
- prompt2cp
- prompt3acpThere are two system prompts in this project. Both are defined in source code, not config files.
Located in src/environment/env.py - this is what the agent sees at every turn. Prompts are formatted using the model's chat template (<|im_start|>/<|im_end|> for Qwen):
You are a coding assistant. Given a coding task below, you must either:
- Ask a clarifying question by responding with [ASK] followed by your question.
- Write your Python solution by responding with [ANSWER] followed by the code.
Important rules:
- Respond with ONLY one action per turn ([ASK] or [ANSWER]).
- When you have enough information, write the code.
- Do not explain your reasoning - just output the action directly.
The full conversation history is appended below this prompt at each turn.
Located in src/environment/user_simulator.py - this is what GPT-4o-mini receives:
Mode: count (default) - the simulator counts atomic questions and returns QUESTION_COUNT: N:
You are a helpful assistant who holds the complete specification for a coding problem.
Full specification:
{original_prompt}
Note: The agent may refer to the function by a different name (e.g., "candidate").
Treat any function name the agent uses as referring to the function described above.
Rules:
1. Answer ONLY the specific question(s) the agent asks. Do not volunteer extra information.
2. Do not reveal test cases or the full solution.
3. Keep your answer brief and factual.
4. After your answer, on a new line write EXACTLY:
QUESTION_COUNT: N
where N is the number of distinct atomic questions you identified in the agent's message.
Counting rules for N:
- Each distinct piece of information being requested = 1.
- "What is X and what is Y?" = 2.
- Conjunctions like "and", "also", "additionally" between separate requests = separate questions.
Mode: truncate - only the first question (up to the first ?) is passed to the simulator, cost_q is always 1:
You are a helpful assistant who holds the complete specification for a coding problem.
Full specification:
{original_prompt}
Note: The agent may refer to the function by a different name (e.g., "candidate").
Treat any function name the agent uses as referring to the function described above.
Rules:
1. Answer ONLY the single question below. Do not volunteer extra information.
2. Do not reveal test cases or the full solution.
3. Keep your answer brief and factual.
Switch between modes via environment.multi_question_mode in configs/default.yaml.
Why two modes? The agent may learn to pack multiple questions into one [ASK] turn. In count mode, each atomic question is charged separately (2 questions in one turn = cost_q = 2). In truncate mode, only the first question is answered and cost_q = 1 always.
# Train d₁=0 policy (agent learns to never ask)
python scripts/train.py --d1 0
# Train d₁=0.5 policy (agent may ask ≤0.5 questions on average)
python scripts/train.py --d1 0.5 constraint.d1=0.5
# Train d₁=1 policy (agent may ask ≤1 question on average)
python scripts/train.py --d1 1Train all three sequentially:
python scripts/train.py --d1 0 && python scripts/train.py --d1 0.5 constraint.d1=0.5 && python scripts/train.py --d1 1python scripts/train.py --d1 1 --resume checkpoints/d1_1/iter_0039# Reduce iterations for a quick smoke-test
python scripts/train.py --d1 1 training.n_iterations=5 training.rollout_batch_size=16
# Change the multi-question mode
python scripts/train.py --d1 1 environment.multi_question_mode=truncate
# Increase the question budget
python scripts/train.py --d1 2 constraint.d1=2Each iteration consists of:
-
Rollout collection (~5–6 min): 32 episodes collected per iteration. The agent sees a degraded problem spec, generates
[ASK]or[ANSWER]actions, and the environment responds. API calls to GPT-4o-mini happen asynchronously. Code execution happens in a subprocess sandbox. -
Advantage computation: GAE advantages are computed for three return streams: reward (pass@1), question cost, and turn cost.
-
PPO update (~3 min): 4 epochs over the buffer in mini-batches of 16. The Lagrangian advantage
A_lag = A_reward - λ₁·A_q - λ₂·A_tis used. Only LoRA weights are updated. -
Lagrange update:
λ₁ += 0.01 × (avg_questions - d₁). If the agent asked too many questions,λ₁rises - making questions more expensive next iteration. -
Logging: Per-iteration stats printed to stdout.
Expected training output:
iter= 0 | reward=0.7070 | q=0.75 (budget=0) | λ₁=0.0375 | ppo=-0.0036 | vf=1.5297 | approx_kl=0.0058 | kl_tok=0.0000 | kl_seq=-0.0062 | kl_max=0.1864 | t=475s
iter= 1 | reward=0.7069 | q=0.89 (budget=0) | λ₁=0.0820 | ppo=0.0161 | vf=0.6088 | approx_kl=0.0064 | kl_tok=0.0002 | kl_seq=0.0016 | kl_max=0.2715 | t=537s
...
Key metrics to monitor:
approx_kl: should be 0.01-0.05 (if ~3+, log-prob computation has a mismatch)kl_tok: per-token KL from reference model, should stay near 0kl_max: worst-case sequence KL, should stay under 0.5reward: should trend upward over many iterationsq: average questions per episode, should converge toward the budget
Total training time estimate:
- ~8–9 min/iteration × 80 iterations = ~11 hours per d₁ setting
- d₁=0, d₁=0.5, and d₁=1 = ~33 hours total (run sequentially)
Per-iteration logs are saved alongside each checkpoint as log.json:
checkpoints/d1_1/iter_0079/log.json
python scripts/evaluate.py --checkpoint checkpoints/d1_1/final --d1 1Results are saved to outputs/eval/results_d1_1.json.
After all three policies (d₁=0, d₁=0.5, d₁=1) are trained, run the sweep in either of two modes:
# Pareto over best-eval checkpoints (default) - uses checkpoints/d1_N/best/
python scripts/evaluate.py --sweep --output_dir outputs/pareto
# Pareto over final checkpoints - uses checkpoints/d1_N/final/
python scripts/evaluate.py --sweep --sweep_mode final --output_dir outputs/pareto--sweep_mode best uses the checkpoint saved when mid-training eval reward peaked (best generalisation). --sweep_mode final uses the end-of-training checkpoint. If the preferred checkpoint does not exist, the sweep falls back to the other, then to the latest iter_*.
Both modes can write to the same --output_dir - output files are named with the mode suffix:
| Mode | JSON | Plot |
|---|---|---|
best |
pareto_best.json |
pareto_frontier_best.png |
final |
pareto_final.json |
pareto_frontier_final.png |
Each sweep:
- Finds all
checkpoints/d1_N/directories - Runs each on the 100 held-out eval problems (greedy decoding, temperature=0)
- Prints a results table
- Saves a PNG plot (avg_questions on x-axis, pass@1 on y-axis, each point labelled with its d₁)
Example output table:
d1 pass@1 avg_q avg_turns n
---- -------- ------- --------- ------
0 0.4120 0.00 1.00 400
1 0.5830 0.97 1.92 400
Eval set: 469 problems (full HumanEvalComm eval split, all 7 degradation types) · Eval temp: 0.0 (greedy) · Model: Qwen2.5-Coder-7B-Instruct + LoRA (rank 16)
Two evaluation modes are reported:
- MT pass@1: multi-turn mode, where the agent may ask clarifying questions before answering
- ST pass@1: single-turn mode, where the agent is told to write code immediately with no option to ask questions; measures raw code generation ability as a lower bound
| Policy | Checkpoint | MT pass@1 | 95% CI | Avg Qs | Δ vs Baseline | p |
|---|---|---|---|---|---|---|
| Baseline (untrained) | - | 0.685 | [0.648, 0.721] | 0.855 | - | - |
| D1=1 | iter_0029 | 0.747 | [0.713, 0.781] | 0.846 | +6.2pp | p < 0.0001 |
| D1=0.5 | iter_0069 | 0.660 | [0.624, 0.696] | 0.480 | −2.5pp | p = 0.173 (n.s.) |
| D1=0 | iter_0049 | 0.604 | [0.567, 0.642] | 0.337 | −8.0pp | p < 0.0001 |
The D1=1 policy is the only trained policy that strictly outperforms the untrained baseline. D1=0.5 is statistically indistinguishable from baseline. D1=0 significantly hurts performance.
| Type | n | Baseline MT | D1=1 MT | Gain | p |
|---|---|---|---|---|---|
| 1a (alias ambiguity) | 100 | 0.743 | 0.765 | +2.2pp | p = 0.384 |
| 1c (context ambiguity) | 99 | 0.833 | 0.877 | +4.4pp | p = 0.027 |
| 1p (partial spec) | 100 | 0.636 | 0.716 | +8.0pp | p = 0.006 |
| 2ac (alias + context) | 99 | 0.621 | 0.702 | +8.1pp | p = 0.002 |
| 2ap (alias + partial) | 43 | 0.561 | 0.607 | +4.6pp | p = 0.316 |
| 2cp (context + partial) | 21 | 0.502 | 0.706 | +20.4pp | p = 0.007 |
| 3acp (all three) | 7 | 0.654 | 0.724 | +7.0pp | p = 0.474 |
| Overall | 469 | 0.685 | 0.747 | +6.2pp | p < 0.0001 |
The largest gains are on the most degraded problem types (2cp, 2ac, 1p), exactly where clarification is most valuable.
| Comparison | ΔMT | 95% CI | p |
|---|---|---|---|
| D1=1 vs Baseline | +6.24pp | [+3.46pp, +9.03pp] | p < 0.0001 |
| D1=0.5 vs Baseline | −2.48pp | [−6.04pp, +1.08pp] | p = 0.173 ✗ |
| D1=0 vs Baseline | −8.03pp | [−11.56pp, −4.50pp] | p < 0.0001 |
| D1=0.5 vs D1=1 | −8.72pp | [−12.21pp, −5.24pp] | p < 0.0001 |
| D1=0 vs D1=1 | −14.27pp | [−17.93pp, −10.62pp] | p < 0.0001 |
| D1=0 vs D1=0.5 | −5.55pp | [−8.41pp, −2.70pp] | p = 0.0001 |
Each unit reduction in avg questions costs progressively more pass@1:
| Policy | Avg Qs | MT pass@1 | Δ pass@1 per −0.1 Qs |
|---|---|---|---|
| Baseline | 0.855 | 0.685 | - |
| D1=1 (iter_0029) | 0.846 | 0.747 | +6.9pp (RL gain, not constraint cost) |
| D1=0.5 (iter_0069) | 0.480 | 0.660 | −2.4pp per −0.1 Qs |
| D1=0 (iter_0049) | 0.337 | 0.604 | −3.9pp per −0.1 Qs |
The curve turns negative at D1=0: restricting questions below the model's natural rate hurts more than never training at all.
1. RL training significantly improves multi-turn performance The D1=1 policy (iter_0029) achieves a +6.24pp gain over the untrained baseline (paired t=4.39, p<0.0001, 95% CI [+3.5pp, +9.0pp]). This is the scientifically honest D1=1 result, chosen as the last checkpoint before memorization-driven instability sets in after iter ~50.
2. Gains are universal but concentrated on harder problem types Every degradation type improves at iter_0029. The largest gains are on the most degraded problems: 2cp (+20.4pp, p=0.007), 2ac (+8.1pp, p=0.002), and 1p (+8.0pp, p=0.006). Gains on 1a (+2.2pp) and 2ap (+4.6pp) exist but are not individually significant.
3. The model learns when to ask, not just how much to ask Average questions barely changes (0.855 → 0.846). The improvement comes from asking more strategically: fewer questions on easy types (1a: −0.07, 1c: −0.03) where clarification adds noise, and slightly more on hard types (2ap: +0.07, 2cp: +0.05) where it genuinely helps.
4. Answer quality also improves (0-question episodes) When the D1=1 trained model answers directly (0 questions), its pass@1 is 0.792 vs 0.754 for baseline. RL training improved code generation quality, not just clarification strategy.
5. The Lagrange constraint barely activated for D1=1 λ₁ ≈ 0.000 throughout almost all of training. Budget=1 is effectively never binding; the model naturally asks ≤1 question on average. The constraint machinery works, but the natural question rate is already near the budget.
6. Original prompt outperforms few-shot prompt at baseline The untrained model with the original prompt achieves MT pass@1 = 0.685, vs 0.586 for the few-shot prompt baseline, a 9.9pp advantage before any training. Few-shot examples appear to hurt the model's natural clarification behavior.
7. D1=1 shows instability and overfitting beginning in the mid-30s At iter 34, avg_Qs spikes to 1.094 (first and only budget breach), λ₁ briefly activates, and both avg_Qs and training reward become volatile. HumanEvalComm has 64 base problems × 302 training problems; by iter ~34, the model sees each base problem ~34 times on average, enough to memorize answers and stop asking. Iter_0029 is the last checkpoint before this instability.
8. D1=0.5 constraint is statistically significant at −8.7pp vs D1=1 The D1=0.5 policy (iter_0069) achieves pass@1=0.660 vs D1=1's 0.747, an −8.7pp gap (p<0.0001, 95% CI [−12.21pp, −5.24pp]). The constraint is fully internalized (avg_Qs=0.480, exactly on budget). The cost is concentrated on types requiring clarification most: 1p (−14.2pp) and 2cp (−23.3pp).
9. D1=0 training collapses pass@1 to below baseline Mid-train evals show pass@1 collapsing from 0.734 (iter 29) → 0.550 (iter 49) as λ₁ climbs and avg_Qs → 0. Full eval confirms MT pass@1 = 0.604, below the untrained baseline (0.685). Training with budget=0 is counterproductive.
10. ST pass@1 is effectively invariant to training (0.614–0.624 across all policies) RL training does not meaningfully change raw coding ability. Across baseline, D1=1, D1=0.5, and D1=0, ST pass@1 ranges only 0.614–0.624 (+1.0pp across the full spectrum). The LoRA update affects question-asking behavior and MT strategy but not underlying code generation competence.
11. Asking 2+ questions is a failure signal Across all policies, episodes with 2+ questions score lower than 1-question episodes. For D1=0.5: 2-Qs→0.588, 3-Qs→0.278, 6-Qs→0.000. Multiple questions indicate either poorly targeted asks, a problem genuinely unsolvable through clarification, or the model failing to use the answers it receives. A well-calibrated policy should rarely need more than 1 question per episode.
12. Diminishing (and eventually negative) returns to restricting questions Each unit of question budget sacrificed costs progressively more pass@1. Going from D1=1 → D1=0.5 (−0.37 avg_Qs) costs 8.7pp, or ~2.4pp per −0.1 Qs. Going from D1=0.5 → D1=0 (−0.14 avg_Qs) costs another 5.6pp, or ~3.9pp per −0.1 Qs, a 60% steeper rate. The curve turns negative at D1=0: MT pass@1 (0.604) falls below the untrained baseline (0.685). The optimal operating point appears near D1=1 (≈ the model's natural question rate), where RL can improve question quality without paying a constraint cost.
13. D1=0 questions actively hurt performance (MT < ST) D1=0 MT pass@1 (0.604) is below D1=0 ST pass@1 (0.624). When the D1=0 model asks a question (avg 0.337 per episode despite budget=0), the answer makes things worse. By-Qs breakdown: 0-Qs → MT=0.641 (n=346); 1-Qs → MT=0.537 (n=109). The policy learned to suppress questions but not to suppress them entirely; the residual questions are poorly targeted or the model fails to use the clarifications it receives.
| Budget | Model | Description |
|---|---|---|
d1=0 |
Never asks - guesses from degraded spec alone | |
d1=0.5 |
Asks sparingly - at most 0.5 questions on average | |
d1=1 |
Asks when worth it - at most 1 question on average |
All checkpoints are saved under checkpoints/ (gitignored):
checkpoints/
├── d1_0/
│ ├── iter_0019/ # Saved every 20 iterations
│ │ ├── adapter_config.json
│ │ ├── adapter_model.safetensors ← LoRA weights (~80 MB)
│ │ ├── tokenizer.json
│ │ ├── value_heads.pt ← Three MLP heads
│ │ ├── dual_variables.pt ← λ₁, λ₂ values
│ │ ├── train_state.pt ← Optimizer + scheduler state
│ │ └── log.json ← Training log up to this point
│ ├── iter_0039/
│ ├── best/ ← Overwritten whenever mid-training eval reward improves
│ └── final/ ← Checkpoint after the last training iteration
└── d1_1/
├── iter_0019/
├── best/
└── final/
| File | Contents | Size |
|---|---|---|
adapter_model.safetensors |
LoRA adapter weights (the only trained LLM parameters) | ~80 MB |
adapter_config.json |
LoRA configuration (rank, alpha, target modules) | <1 KB |
tokenizer.json |
Tokenizer files (copied from base model) | ~10 MB |
value_heads.pt |
Three MLP value heads (reward, q_cost, t_cost) | ~50 MB |
dual_variables.pt |
Current λ₁ and λ₂ values | <1 KB |
train_state.pt |
Optimizer and LR scheduler state (for resuming) | ~500 MB |
log.json |
Per-iteration training metrics | ~100 KB |
Note: The base Qwen2.5-Coder-7B weights (~14 GB) are NOT saved - they are downloaded from HuggingFace and remain frozen. Only the LoRA adapter (~80 MB) is saved. To deploy a checkpoint, you need both the base model and the LoRA adapter.
from omegaconf import OmegaConf
from src.models.agent import Agent
cfg = OmegaConf.load("configs/default.yaml")
agent = Agent(cfg)
agent.load_lora("checkpoints/d1_1/final")
# Generate a response
action_text, _, _, _, _ = agent.generate(prompt)
print(action_text) # "[ASK] What number should each element be incremented by?"All configuration lives in configs/default.yaml. Every value can be overridden at the command line using OmegaConf dot-notation (e.g., training.n_iterations=40).
model:
name: Qwen/Qwen2.5-Coder-7B-Instruct # HuggingFace model ID
dtype: bfloat16 # bf16 for A100s
lora_rank: 16 # LoRA rank (~40M trainable params)
lora_alpha: 32 # LoRA scaling factor
lora_dropout: 0.0 # Set to 0 to avoid train/eval mode mismatch
lora_target_modules: # Which linear layers to apply LoRA to
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
gradient_checkpointing: true # Reduces activation memory from ~10GB to ~3GB
train_device: cuda:0 # GPU for PPO update
rollout_device: cuda:1 # GPU for rollout inferenceenvironment:
max_turns: 6 # Hard cap on conversation length
max_new_tokens: 512 # Max tokens per agent action
max_seq_len: 2048 # Max total prompt length (tokens)
rollout_temperature: 0.8 # Exploration temperature during rollout
multi_question_mode: count # "count" or "truncate" (see Section 7)
efficiency_alpha: 0.025 # Small bonus for fewer turns (tiebreaker vs waste)
efficiency_beta: 0.025 # Small bonus for fewer questionsuser_simulator:
model: gpt-4o-mini # OpenAI model for the user simulator
temperature: 0.0 # Deterministic responses
max_tokens: 300
max_concurrent_api: 15 # Max parallel API calls (rate limit safety)code_executor:
timeout: 10.0 # seconds per subprocess execution
partial_credit: true # reward = fraction of passing assertions, not binarytraining:
rollout_batch_size: 32 # Episodes per iteration
ppo_epochs: 4 # PPO update passes per batch
ppo_mini_batch_size: 8 # Mini-batch size per GPU
clip_epsilon: 0.2 # PPO clip range
gamma: 1.0 # Discount factor (1.0 = no discounting)
gae_lambda: 0.95 # GAE smoothing
kl_coeff: 0.25 # KL penalty (keeps policy near reference)
target_kl: 0.05 # Early-exit PPO epoch when approx_kl exceeds this
entropy_coeff: 0.01 # Entropy bonus (prevents collapse)
lr_policy: 5.0e-6 # LoRA learning rate
lr_value: 1.0e-4 # Value head learning rate
optimizer: adamw_8bit # 8-bit AdamW (saves ~8 GB vs full)
warmup_steps: 20
n_iterations: 80 # Training iterations per d₁ setting
save_interval: 5 # Save checkpoint every N iterations
eval_interval: 10 # Run eval every N iterationsconstraint:
d1: 1 # Question budget (set to 0 or 1 for this run)
lambda_init: 0.0 # Starting value for λ₁
lambda_max: 10.0 # Maximum value for λ₁
lr_lambda: 0.1 # Lagrange multiplier step size
d2: 4 # Turn budget (soft, secondary constraint)
lambda2_init: 0.0
lambda2_max: 5.0
lr_lambda2: 0.005data:
hf_dataset: jie-jw-wu/HumanEvalComm # HuggingFace dataset identifier
eval_size: 100 # Held-out base problems for evaluation
seed: 42
use_variants: # Degradation types used for training
- prompt1a
- prompt1c
- prompt1p
- prompt2ac
- prompt2ap
- prompt2cp
- prompt3acpWhy Qwen2.5-Coder-7B and not Llama-3.1-8B? The HumanEvalComm paper shows that code-specialized models (CodeQwen, DeepSeek Coder) significantly outperform general-purpose models on degraded specs. Qwen2.5-Coder-7B scores ~70% on standard HumanEval vs ~55% for Llama-3.1-8B. It's also similar size (~14GB bf16), same LoRA config, and not gated on HuggingFace. We verified that Llama-3.1-8B scored 0% on smoke test episodes; Qwen Coder provides a much stronger coding baseline for PPO to build on.
Why LoRA and not full fine-tuning? Full fine-tuning on 7B with AdamW would require ~56 GB for optimizer states alone. LoRA limits trainable parameters to ~40M, reducing optimizer memory to ~800 MB. The frozen base weights also prevent catastrophic forgetting of Python syntax knowledge.
Why constrained prefix decoding?
The base model sometimes outputs code without the required [ASK] or [ANSWER] prefix, resulting in malformed actions and zero reward. Constrained prefix decoding forces every generation to start with one of the two valid prefixes. The model still chooses which prefix by comparing their log-probs given the prompt - so the decision is learned, not random. This eliminates wasted training iterations on formatting errors.
Why PPO-Lagrangian and not fixed-penalty RL? A fixed penalty requires hand-tuning - you don't know in advance how large the penalty needs to be to achieve exactly d₁=1 question on average. PPO-Lagrangian finds this value automatically via dual ascent: if the agent asks too many questions, λ₁ rises until it stops; if it asks too few, λ₁ falls. This also enables sweeping multiple d₁ values without re-tuning.
Why two Lagrange multipliers?
λ₁ enforces the question budget d₁. λ₂ is a soft secondary constraint on turns. Turn cost exists because asking many focused questions across many turns is still expensive, even if question count is low.
Multi-question handling (the multi_question_mode setting):
When the agent writes [ASK] What is X? And what is Y?, it is asking two questions in one turn. In count mode (default), the user simulator counts 2 atomic questions and the environment charges cost_q = 2. This prevents the agent from exploiting the budget by batching questions. In truncate mode, only the first question is answered and cost_q = 1 always. The default count mode is more principled but depends on GPT-4o-mini counting accurately; truncate mode is simpler but restricts the agent's action space. Switch with environment.multi_question_mode=truncate.
Function name handling in code execution:
Degraded specs sometimes rename functions to candidate, but test cases use the original name. The code executor aliases the last top-level function in the agent's code to the expected entry_point. This avoids breaking helper functions (e.g., is_prime defined before is_multiply_prime). Additionally, helper functions from the degraded spec (e.g., poly for find_zero) are automatically extracted and prepended to the test program so tests can reference them.
String output quoting in test assertions:
Some test case outputs are bare strings (e.g., fdcb not 'fdcb'). The executor detects these and wraps them in repr() so assertions compare against the correct type. Template-based test relations (using $demo$ and $input$ placeholders) are also expanded correctly.
Function name handling in the user simulator:
The simulator's system prompt tells GPT-4o-mini to treat any function name the agent uses (e.g., candidate) as referring to the function in the original spec. Without this, the simulator would say "I don't know about candidate" when the original spec defines decode_cyclic.
Why partial credit for code evaluation? Binary pass/fail gives a flat reward landscape. Partial credit (fraction of assertions passing) gives smoother gradients - an agent that gets 8/10 tests right receives reward 0.8, not 0. This significantly stabilises PPO training.
Why compute old_log_probs via forward pass instead of generate() scores?
This was the most critical debugging finding in the project. HuggingFace's model.generate() returns scores that have been processed by internal logit processors - in Qwen2.5-Coder-7B, this resulted in 152,063 out of 152,064 vocabulary tokens being set to -inf, even with top_k=0. The model's own generation config applies aggressive filtering that cannot be disabled through standard parameters.
This caused a systematic log-prob mismatch: during rollout, log_softmax over the filtered scores normalized over ~1-4 tokens, while score() during PPO update used a clean forward pass normalizing over the full 152k vocabulary. The same token received different log-probs depending on which path computed them, inflating approx_kl to ~3 and permanently saturating PPO's clipping mechanism.
The fix was to stop using generate() scores entirely for log-prob computation. After generate() produces the token sequence, a separate forward pass computes old_log_probs from raw model logits - the exact same method score() uses. This brings approx_kl down to ~0.007.
The debugging process went through three stages:
- Prefix scoring mismatch brought approx_kl from ~12 to ~3 (rollout scored only 2 of 4-5 prefix tokens; fix: both sides skip prefix and score only continuation tokens)
- False leads -
top_k=0, reducing PPO epochs, disabling LoRA dropout - none addressed the remaining ~3 - Root cause - the thorough debug script revealed the logit processor filtering, leading to the forward pass fix that resolved the issue completely
Why a stratified train/eval split?
Rare degradation types (prompt3acp = 12 problems, prompt2cp = 35 problems) could end up entirely in one set with a random split. The stratified split groups problems by their rarest variant and splits each group proportionally, guaranteeing all 7 types appear in both sets.
Why ast.literal_eval and not json.loads for test cases?
HumanEvalComm stores test cases as Python-literal strings (single-quoted dicts), not JSON. json.loads will fail on them. Always use ast.literal_eval.