To Ask or Not to Ask: Strategic Clarification in LLM Agents via CMDPs

This project trains a large language model (LLM) to ask optimal clarifying questions when given ambiguous coding problems - but under a strict budget on how many questions it can ask. The interaction is framed as a Constrained Markov Decision Process (CMDP), where the agent must maximise code correctness while satisfying a hard constraint on average question count. It uses Reinforcement Learning with PPO-Lagrangian on top of Qwen2.5-Coder-7B-Instruct with LoRA adapters, evaluated on the HumanEvalComm benchmark.

Contributors: Abhinav Rajput · Acey Vogelstein · Deepali Balakrishna Ksheersagar

NYU Center for Data Science

What This Project Does

When an LLM is given a vague coding task like "Return list with elements incremented by a number", it can either:

Guess and write code (fast, but may be wrong)
Ask a clarifying question like "What number should each element be incremented by?" (gets better information, but costs the user a turn)

This project trains the model to make that decision optimally - asking only when it is truly worth it - under a configurable budget d₁ that caps the average number of questions per problem.

We train three policies:

d₁ = 0 - the agent learns to never ask; it must guess from the degraded spec alone
d₁ = 0.5 - the agent may ask at most 0.5 questions on average per problem
d₁ = 1 - the agent may ask at most 1 question on average per problem

These three policies trace a Pareto frontier: a family of policies ranging from "never ask" to "ask freely", each optimal for its budget.

The training algorithm is PPO-Lagrangian, where a Lagrange multiplier λ₁ automatically learns how expensive each question should be to satisfy the budget. No manual penalty tuning is needed.

Hardware Requirements

Requirement	Minimum	Recommended
GPUs	2× A100-40GB	2× A100-40GB
CPU RAM	64 GB	128 GB
Disk	50 GB free	100 GB free
Internet	Required (HuggingFace + OpenAI API)	-

GPU layout:

cuda:0 - Policy training (Qwen2.5-Coder-7B + LoRA + value heads + optimizer, ~19 GB)
cuda:1 - Rollout inference + frozen reference model (~17 GB)

To change GPU assignment, edit model.train_device and model.rollout_device in configs/default.yaml.

Installation

# Clone the repository
git clone <repo-url>
cd rl_llm_multiturn_project

# Create and activate a virtual environment (Python 3.10+)
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Install CUDA-enabled PyTorch separately if needed (the version in requirements.txt is CPU-safe but may not match your CUDA driver):

# Example for CUDA 12.1
pip install torch==2.3.0 --index-url https://download.pytorch.org/whl/cu121

Verify GPU access:

python3 -c "import torch; print(torch.cuda.device_count(), 'GPUs available')"

Download the base model: The base model (Qwen2.5-Coder-7B-Instruct) is openly available on HuggingFace - no access request or token needed:

python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; AutoTokenizer.from_pretrained('Qwen/Qwen2.5-Coder-7B-Instruct'); AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-Coder-7B-Instruct', torch_dtype='bfloat16')"

API Keys

This project uses two external APIs:

OpenAI API (required for training)

The user simulator (GPT-4o-mini) answers the agent's clarifying questions during every training episode. This is the only API cost during training (~$5–10 total for a full run).

export OPENAI_API_KEY="sk-..."

Or add it to a .env file in the project root (never commit this file):

OPENAI_API_KEY=sk-...

Then load it before running:

source .env  # if using a .env file

Cost estimate: GPT-4o-mini charges ~$0.15 per 1M input tokens. With 32 episodes/iteration × 80 iterations × ~2 questions/episode × ~200 tokens/call ≈ 1M tokens ≈ **$0.15 per d₁ setting**.

HuggingFace Token (not required)

Qwen2.5-Coder-7B-Instruct is not gated - no HuggingFace token is needed to download it.

Project Structure

rl_llm_multiturn_project/
│
├── configs/
│   └── default.yaml          # All hyperparameters (model, training, constraints)
│
├── data/
│   ├── raw/                  # Downloaded raw data (auto-populated, gitignored)
│   └── processed/            # Preprocessed problem files (gitignored)
│
├── src/
│   ├── data/
│   │   ├── dataset.py        # Problem dataclass + HumanEvalComm loader
│   │   └── augmentation.py   # MBPP degradation via GPT-4o (one-time script)
│   │
│   ├── environment/
│   │   ├── env.py            # ClarificationEnv: the RL state machine
│   │   ├── user_simulator.py # GPT-4o-mini wrapper (async, with atomic question counting)
│   │   └── code_executor.py  # Sandboxed Python runner → pass@1 score
│   │
│   ├── models/
│   │   ├── agent.py          # Qwen2.5-Coder-7B + LoRA: generate() and score()
│   │   └── value_heads.py    # Three MLP value heads (reward, q_cost, t_cost)
│   │
│   ├── training/
│   │   ├── rollout.py        # Async episode collection + RolloutBuffer
│   │   ├── ppo.py            # PPO loss, GAE, KL penalty, entropy bonus
│   │   ├── lagrangian.py     # Lagrange multiplier dual update
│   │   └── trainer.py        # PPOLagrangianTrainer: main training loop
│   │
│   └── evaluation/
│       └── evaluator.py      # pass@1 metrics, Pareto frontier, plots
│
├── scripts/
│   ├── train.py              # Training entry point
│   ├── evaluate.py           # Evaluation entry point
│   ├── smoke_test.py         # End-to-end pipeline validation
│   └── baseline_eval.py      # Base model coding ability test (supports --checkpoint)
│
├── checkpoints/              # Saved model checkpoints (gitignored)
│   ├── d1_0/
│   │   ├── iter_0019/        # Checkpoint every 20 iterations
│   │   ├── iter_0039/
│   │   ├── best/             # Checkpoint at peak mid-training eval reward
│   │   └── final/            # Final checkpoint after all iterations
│   └── d1_1/
│       ├── iter_0019/
│       ├── best/
│       └── final/
│
├── outputs/                  # Eval results, plots (gitignored)
└── requirements.txt

Dataset

The dataset is HumanEvalComm - 164 Python coding problems from HumanEval, each with multiple degraded versions of the problem specification.

It is automatically downloaded from HuggingFace the first time you run training or evaluation. No manual download is needed.

What the degradations look like:

Type	Field	What changes
Ambiguity	`prompt1a`	Specific values replaced with vague terms ("by 1" → "by a number")
Inconsistency	`prompt1c`	Examples contradict the description
Incompleteness	`prompt1p`	All examples and details stripped; only a stub remains
Ambiguity + Inconsistency	`prompt2ac`	Both combined
Ambiguity + Incompleteness	`prompt2ap`	Both combined
Inconsistency + Incompleteness	`prompt2cp`	Both combined
All three	`prompt3acp`	Ambiguity + Inconsistency + Incompleteness

Train/test split: The split is stratified at the base problem level - problems are grouped by their rarest available variant, then each group is split proportionally (~60% eval, ~40% train). This guarantees all 7 degradation types appear in both train and eval sets. All variants of a base problem go to the same set (no leakage). Enforced in src/data/dataset.py.

Which variants are used for training is controlled by data.use_variants in the config:

data:
  use_variants:
    - prompt1a
    - prompt1c
    - prompt1p
    - prompt2ac
    - prompt2ap
    - prompt2cp
    - prompt3acp

System Prompts

There are two system prompts in this project. Both are defined in source code, not config files.

Agent System Prompt

Located in src/environment/env.py - this is what the agent sees at every turn. Prompts are formatted using the model's chat template (<|im_start|>/<|im_end|> for Qwen):

You are a coding assistant. Given a coding task below, you must either:
  - Ask a clarifying question by responding with [ASK] followed by your question.
  - Write your Python solution by responding with [ANSWER] followed by the code.

Important rules:
  - Respond with ONLY one action per turn ([ASK] or [ANSWER]).
  - When you have enough information, write the code.
  - Do not explain your reasoning - just output the action directly.

The full conversation history is appended below this prompt at each turn.

User Simulator System Prompt

Located in src/environment/user_simulator.py - this is what GPT-4o-mini receives:

Mode: count (default) - the simulator counts atomic questions and returns QUESTION_COUNT: N:

You are a helpful assistant who holds the complete specification for a coding problem.

Full specification:
{original_prompt}

Note: The agent may refer to the function by a different name (e.g., "candidate").
Treat any function name the agent uses as referring to the function described above.

Rules:
1. Answer ONLY the specific question(s) the agent asks. Do not volunteer extra information.
2. Do not reveal test cases or the full solution.
3. Keep your answer brief and factual.
4. After your answer, on a new line write EXACTLY:
   QUESTION_COUNT: N
   where N is the number of distinct atomic questions you identified in the agent's message.

Counting rules for N:
- Each distinct piece of information being requested = 1.
- "What is X and what is Y?" = 2.
- Conjunctions like "and", "also", "additionally" between separate requests = separate questions.

Mode: truncate - only the first question (up to the first ?) is passed to the simulator, cost_q is always 1:

You are a helpful assistant who holds the complete specification for a coding problem.

Full specification:
{original_prompt}

Note: The agent may refer to the function by a different name (e.g., "candidate").
Treat any function name the agent uses as referring to the function described above.

Rules:
1. Answer ONLY the single question below. Do not volunteer extra information.
2. Do not reveal test cases or the full solution.
3. Keep your answer brief and factual.

Switch between modes via environment.multi_question_mode in configs/default.yaml.

Why two modes? The agent may learn to pack multiple questions into one [ASK] turn. In count mode, each atomic question is charged separately (2 questions in one turn = cost_q = 2). In truncate mode, only the first question is answered and cost_q = 1 always.

Training

Quick Start

# Train d₁=0 policy (agent learns to never ask)
python scripts/train.py --d1 0

# Train d₁=0.5 policy (agent may ask ≤0.5 questions on average)
python scripts/train.py --d1 0.5 constraint.d1=0.5

# Train d₁=1 policy (agent may ask ≤1 question on average)
python scripts/train.py --d1 1

Train all three sequentially:

python scripts/train.py --d1 0 && python scripts/train.py --d1 0.5 constraint.d1=0.5 && python scripts/train.py --d1 1

Resume from a Checkpoint

python scripts/train.py --d1 1 --resume checkpoints/d1_1/iter_0039

Override Config Values at the Command Line

# Reduce iterations for a quick smoke-test
python scripts/train.py --d1 1 training.n_iterations=5 training.rollout_batch_size=16

# Change the multi-question mode
python scripts/train.py --d1 1 environment.multi_question_mode=truncate

# Increase the question budget
python scripts/train.py --d1 2 constraint.d1=2

What Happens During Training

Each iteration consists of:

Rollout collection (~5–6 min): 32 episodes collected per iteration. The agent sees a degraded problem spec, generates [ASK] or [ANSWER] actions, and the environment responds. API calls to GPT-4o-mini happen asynchronously. Code execution happens in a subprocess sandbox.
Advantage computation: GAE advantages are computed for three return streams: reward (pass@1), question cost, and turn cost.
PPO update (~3 min): 4 epochs over the buffer in mini-batches of 16. The Lagrangian advantage A_lag = A_reward - λ₁·A_q - λ₂·A_t is used. Only LoRA weights are updated.
Lagrange update: λ₁ += 0.01 × (avg_questions - d₁). If the agent asked too many questions, λ₁ rises - making questions more expensive next iteration.
Logging: Per-iteration stats printed to stdout.

Expected training output:

iter=   0 | reward=0.7070 | q=0.75 (budget=0) | λ₁=0.0375 | ppo=-0.0036 | vf=1.5297 | approx_kl=0.0058 | kl_tok=0.0000 | kl_seq=-0.0062 | kl_max=0.1864 | t=475s
iter=   1 | reward=0.7069 | q=0.89 (budget=0) | λ₁=0.0820 | ppo=0.0161 | vf=0.6088 | approx_kl=0.0064 | kl_tok=0.0002 | kl_seq=0.0016 | kl_max=0.2715 | t=537s
...

Key metrics to monitor:

approx_kl: should be 0.01-0.05 (if ~3+, log-prob computation has a mismatch)
kl_tok: per-token KL from reference model, should stay near 0
kl_max: worst-case sequence KL, should stay under 0.5
reward: should trend upward over many iterations
q: average questions per episode, should converge toward the budget

Total training time estimate:

~8–9 min/iteration × 80 iterations = ~11 hours per d₁ setting
d₁=0, d₁=0.5, and d₁=1 = ~33 hours total (run sequentially)

Training Logs

Per-iteration logs are saved alongside each checkpoint as log.json:

checkpoints/d1_1/iter_0079/log.json

Evaluation

Evaluate a Single Checkpoint

python scripts/evaluate.py --checkpoint checkpoints/d1_1/final --d1 1

Results are saved to outputs/eval/results_d1_1.json.

Evaluate All Policies and Plot the Pareto Frontier

After all three policies (d₁=0, d₁=0.5, d₁=1) are trained, run the sweep in either of two modes:

# Pareto over best-eval checkpoints (default) - uses checkpoints/d1_N/best/
python scripts/evaluate.py --sweep --output_dir outputs/pareto

# Pareto over final checkpoints - uses checkpoints/d1_N/final/
python scripts/evaluate.py --sweep --sweep_mode final --output_dir outputs/pareto

--sweep_mode best uses the checkpoint saved when mid-training eval reward peaked (best generalisation). --sweep_mode final uses the end-of-training checkpoint. If the preferred checkpoint does not exist, the sweep falls back to the other, then to the latest iter_*.

Both modes can write to the same --output_dir - output files are named with the mode suffix:

Mode	JSON	Plot
`best`	`pareto_best.json`	`pareto_frontier_best.png`
`final`	`pareto_final.json`	`pareto_frontier_final.png`

Each sweep:

Finds all checkpoints/d1_N/ directories
Runs each on the 100 held-out eval problems (greedy decoding, temperature=0)
Prints a results table
Saves a PNG plot (avg_questions on x-axis, pass@1 on y-axis, each point labelled with its d₁)

Example output table:

  d1    pass@1    avg_q  avg_turns       n
----  --------  -------  ---------  ------
   0    0.4120     0.00       1.00     400
   1    0.5830     0.97       1.92     400

Results

Eval set: 469 problems (full HumanEvalComm eval split, all 7 degradation types) · Eval temp: 0.0 (greedy) · Model: Qwen2.5-Coder-7B-Instruct + LoRA (rank 16)

Two evaluation modes are reported:

MT pass@1: multi-turn mode, where the agent may ask clarifying questions before answering
ST pass@1: single-turn mode, where the agent is told to write code immediately with no option to ask questions; measures raw code generation ability as a lower bound

Overall Summary

Policy	Checkpoint	MT pass@1	95% CI	Avg Qs	Δ vs Baseline	p
Baseline (untrained)	-	0.685	[0.648, 0.721]	0.855	-	-
D1=1	iter_0029	0.747	[0.713, 0.781]	0.846	+6.2pp	p < 0.0001
D1=0.5	iter_0069	0.660	[0.624, 0.696]	0.480	−2.5pp	p = 0.173 (n.s.)
D1=0	iter_0049	0.604	[0.567, 0.642]	0.337	−8.0pp	p < 0.0001

The D1=1 policy is the only trained policy that strictly outperforms the untrained baseline. D1=0.5 is statistically indistinguishable from baseline. D1=0 significantly hurts performance.

By Degradation Type (D1=1 vs Baseline)

Type	n	Baseline MT	D1=1 MT	Gain	p
1a (alias ambiguity)	100	0.743	0.765	+2.2pp	p = 0.384
1c (context ambiguity)	99	0.833	0.877	+4.4pp	p = 0.027
1p (partial spec)	100	0.636	0.716	+8.0pp	p = 0.006
2ac (alias + context)	99	0.621	0.702	+8.1pp	p = 0.002
2ap (alias + partial)	43	0.561	0.607	+4.6pp	p = 0.316
2cp (context + partial)	21	0.502	0.706	+20.4pp	p = 0.007
3acp (all three)	7	0.654	0.724	+7.0pp	p = 0.474
Overall	469	0.685	0.747	+6.2pp	p < 0.0001

The largest gains are on the most degraded problem types (2cp, 2ac, 1p), exactly where clarification is most valuable.

All Pairwise Comparisons (paired t-test, n=469)

Comparison	ΔMT	95% CI	p
D1=1 vs Baseline	+6.24pp	[+3.46pp, +9.03pp]	p < 0.0001
D1=0.5 vs Baseline	−2.48pp	[−6.04pp, +1.08pp]	p = 0.173 ✗
D1=0 vs Baseline	−8.03pp	[−11.56pp, −4.50pp]	p < 0.0001
D1=0.5 vs D1=1	−8.72pp	[−12.21pp, −5.24pp]	p < 0.0001
D1=0 vs D1=1	−14.27pp	[−17.93pp, −10.62pp]	p < 0.0001
D1=0 vs D1=0.5	−5.55pp	[−8.41pp, −2.70pp]	p = 0.0001

Diminishing Returns: Question Budget vs. MT pass@1

Each unit reduction in avg questions costs progressively more pass@1:

Policy	Avg Qs	MT pass@1	Δ pass@1 per −0.1 Qs
Baseline	0.855	0.685	-
D1=1 (iter_0029)	0.846	0.747	+6.9pp (RL gain, not constraint cost)
D1=0.5 (iter_0069)	0.480	0.660	−2.4pp per −0.1 Qs
D1=0 (iter_0049)	0.337	0.604	−3.9pp per −0.1 Qs

The curve turns negative at D1=0: restricting questions below the model's natural rate hurts more than never training at all.

Key Takeaways

1. RL training significantly improves multi-turn performance The D1=1 policy (iter_0029) achieves a +6.24pp gain over the untrained baseline (paired t=4.39, p<0.0001, 95% CI [+3.5pp, +9.0pp]). This is the scientifically honest D1=1 result, chosen as the last checkpoint before memorization-driven instability sets in after iter ~50.

2. Gains are universal but concentrated on harder problem types Every degradation type improves at iter_0029. The largest gains are on the most degraded problems: 2cp (+20.4pp, p=0.007), 2ac (+8.1pp, p=0.002), and 1p (+8.0pp, p=0.006). Gains on 1a (+2.2pp) and 2ap (+4.6pp) exist but are not individually significant.

3. The model learns when to ask, not just how much to ask Average questions barely changes (0.855 → 0.846). The improvement comes from asking more strategically: fewer questions on easy types (1a: −0.07, 1c: −0.03) where clarification adds noise, and slightly more on hard types (2ap: +0.07, 2cp: +0.05) where it genuinely helps.

4. Answer quality also improves (0-question episodes) When the D1=1 trained model answers directly (0 questions), its pass@1 is 0.792 vs 0.754 for baseline. RL training improved code generation quality, not just clarification strategy.

5. The Lagrange constraint barely activated for D1=1 λ₁ ≈ 0.000 throughout almost all of training. Budget=1 is effectively never binding; the model naturally asks ≤1 question on average. The constraint machinery works, but the natural question rate is already near the budget.

6. Original prompt outperforms few-shot prompt at baseline The untrained model with the original prompt achieves MT pass@1 = 0.685, vs 0.586 for the few-shot prompt baseline, a 9.9pp advantage before any training. Few-shot examples appear to hurt the model's natural clarification behavior.

7. D1=1 shows instability and overfitting beginning in the mid-30s At iter 34, avg_Qs spikes to 1.094 (first and only budget breach), λ₁ briefly activates, and both avg_Qs and training reward become volatile. HumanEvalComm has 64 base problems × 302 training problems; by iter ~34, the model sees each base problem ~34 times on average, enough to memorize answers and stop asking. Iter_0029 is the last checkpoint before this instability.

8. D1=0.5 constraint is statistically significant at −8.7pp vs D1=1 The D1=0.5 policy (iter_0069) achieves pass@1=0.660 vs D1=1's 0.747, an −8.7pp gap (p<0.0001, 95% CI [−12.21pp, −5.24pp]). The constraint is fully internalized (avg_Qs=0.480, exactly on budget). The cost is concentrated on types requiring clarification most: 1p (−14.2pp) and 2cp (−23.3pp).

9. D1=0 training collapses pass@1 to below baseline Mid-train evals show pass@1 collapsing from 0.734 (iter 29) → 0.550 (iter 49) as λ₁ climbs and avg_Qs → 0. Full eval confirms MT pass@1 = 0.604, below the untrained baseline (0.685). Training with budget=0 is counterproductive.

10. ST pass@1 is effectively invariant to training (0.614–0.624 across all policies) RL training does not meaningfully change raw coding ability. Across baseline, D1=1, D1=0.5, and D1=0, ST pass@1 ranges only 0.614–0.624 (+1.0pp across the full spectrum). The LoRA update affects question-asking behavior and MT strategy but not underlying code generation competence.

11. Asking 2+ questions is a failure signal Across all policies, episodes with 2+ questions score lower than 1-question episodes. For D1=0.5: 2-Qs→0.588, 3-Qs→0.278, 6-Qs→0.000. Multiple questions indicate either poorly targeted asks, a problem genuinely unsolvable through clarification, or the model failing to use the answers it receives. A well-calibrated policy should rarely need more than 1 question per episode.

12. Diminishing (and eventually negative) returns to restricting questions Each unit of question budget sacrificed costs progressively more pass@1. Going from D1=1 → D1=0.5 (−0.37 avg_Qs) costs 8.7pp, or ~2.4pp per −0.1 Qs. Going from D1=0.5 → D1=0 (−0.14 avg_Qs) costs another 5.6pp, or ~3.9pp per −0.1 Qs, a 60% steeper rate. The curve turns negative at D1=0: MT pass@1 (0.604) falls below the untrained baseline (0.685). The optimal operating point appears near D1=1 (≈ the model's natural question rate), where RL can improve question quality without paying a constraint cost.

13. D1=0 questions actively hurt performance (MT < ST) D1=0 MT pass@1 (0.604) is below D1=0 ST pass@1 (0.624). When the D1=0 model asks a question (avg 0.337 per episode despite budget=0), the answer makes things worse. By-Qs breakdown: 0-Qs → MT=0.641 (n=346); 1-Qs → MT=0.537 (n=109). The policy learned to suppress questions but not to suppress them entirely; the residual questions are poorly targeted or the model fails to use the clarifications it receives.

Trained Model Checkpoints

Hugging Face

Budget	Model	Description
`d1=0`		Never asks - guesses from degraded spec alone
`d1=0.5`		Asks sparingly - at most 0.5 questions on average
`d1=1`		Asks when worth it - at most 1 question on average

Local Checkpoints

All checkpoints are saved under checkpoints/ (gitignored):

checkpoints/
├── d1_0/
│   ├── iter_0019/          # Saved every 20 iterations
│   │   ├── adapter_config.json
│   │   ├── adapter_model.safetensors   ← LoRA weights (~80 MB)
│   │   ├── tokenizer.json
│   │   ├── value_heads.pt              ← Three MLP heads
│   │   ├── dual_variables.pt           ← λ₁, λ₂ values
│   │   ├── train_state.pt              ← Optimizer + scheduler state
│   │   └── log.json                    ← Training log up to this point
│   ├── iter_0039/
│   ├── best/               ← Overwritten whenever mid-training eval reward improves
│   └── final/              ← Checkpoint after the last training iteration
└── d1_1/
    ├── iter_0019/
    ├── best/
    └── final/

What is Saved

File	Contents	Size
`adapter_model.safetensors`	LoRA adapter weights (the only trained LLM parameters)	~80 MB
`adapter_config.json`	LoRA configuration (rank, alpha, target modules)	<1 KB
`tokenizer.json`	Tokenizer files (copied from base model)	~10 MB
`value_heads.pt`	Three MLP value heads (reward, q_cost, t_cost)	~50 MB
`dual_variables.pt`	Current λ₁ and λ₂ values	<1 KB
`train_state.pt`	Optimizer and LR scheduler state (for resuming)	~500 MB
`log.json`	Per-iteration training metrics	~100 KB

Note: The base Qwen2.5-Coder-7B weights (~14 GB) are NOT saved - they are downloaded from HuggingFace and remain frozen. Only the LoRA adapter (~80 MB) is saved. To deploy a checkpoint, you need both the base model and the LoRA adapter.

Loading a Checkpoint for Inference

from omegaconf import OmegaConf
from src.models.agent import Agent

cfg = OmegaConf.load("configs/default.yaml")
agent = Agent(cfg)
agent.load_lora("checkpoints/d1_1/final")

# Generate a response
action_text, _, _, _, _ = agent.generate(prompt)
print(action_text)  # "[ASK] What number should each element be incremented by?"

Configuration Reference

All configuration lives in configs/default.yaml. Every value can be overridden at the command line using OmegaConf dot-notation (e.g., training.n_iterations=40).

Model

model:
  name: Qwen/Qwen2.5-Coder-7B-Instruct      # HuggingFace model ID
  dtype: bfloat16                            # bf16 for A100s
  lora_rank: 16                              # LoRA rank (~40M trainable params)
  lora_alpha: 32                             # LoRA scaling factor
  lora_dropout: 0.0                           # Set to 0 to avoid train/eval mode mismatch
  lora_target_modules:                       # Which linear layers to apply LoRA to
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
  gradient_checkpointing: true               # Reduces activation memory from ~10GB to ~3GB
  train_device: cuda:0                       # GPU for PPO update
  rollout_device: cuda:1                     # GPU for rollout inference

Environment

environment:
  max_turns: 6                               # Hard cap on conversation length
  max_new_tokens: 512                        # Max tokens per agent action
  max_seq_len: 2048                          # Max total prompt length (tokens)
  rollout_temperature: 0.8                   # Exploration temperature during rollout
  multi_question_mode: count                 # "count" or "truncate" (see Section 7)
  efficiency_alpha: 0.025                    # Small bonus for fewer turns (tiebreaker vs waste)
  efficiency_beta: 0.025                     # Small bonus for fewer questions

User Simulator

user_simulator:
  model: gpt-4o-mini                         # OpenAI model for the user simulator
  temperature: 0.0                           # Deterministic responses
  max_tokens: 300
  max_concurrent_api: 15                     # Max parallel API calls (rate limit safety)

Code Executor

code_executor:
  timeout: 10.0             # seconds per subprocess execution
  partial_credit: true      # reward = fraction of passing assertions, not binary

Training

training:
  rollout_batch_size: 32                     # Episodes per iteration
  ppo_epochs: 4                              # PPO update passes per batch
  ppo_mini_batch_size: 8                     # Mini-batch size per GPU
  clip_epsilon: 0.2                          # PPO clip range
  gamma: 1.0                                 # Discount factor (1.0 = no discounting)
  gae_lambda: 0.95                           # GAE smoothing
  kl_coeff: 0.25                             # KL penalty (keeps policy near reference)
  target_kl: 0.05                            # Early-exit PPO epoch when approx_kl exceeds this
  entropy_coeff: 0.01                        # Entropy bonus (prevents collapse)
  lr_policy: 5.0e-6                          # LoRA learning rate
  lr_value: 1.0e-4                           # Value head learning rate
  optimizer: adamw_8bit                      # 8-bit AdamW (saves ~8 GB vs full)
  warmup_steps: 20
  n_iterations: 80                           # Training iterations per d₁ setting
  save_interval: 5                           # Save checkpoint every N iterations
  eval_interval: 10                          # Run eval every N iterations

Constraint (CMDP)

constraint:
  d1: 1                                      # Question budget (set to 0 or 1 for this run)
  lambda_init: 0.0                           # Starting value for λ₁
  lambda_max: 10.0                           # Maximum value for λ₁
  lr_lambda: 0.1                             # Lagrange multiplier step size
  d2: 4                                      # Turn budget (soft, secondary constraint)
  lambda2_init: 0.0
  lambda2_max: 5.0
  lr_lambda2: 0.005

Data

data:
  hf_dataset: jie-jw-wu/HumanEvalComm       # HuggingFace dataset identifier
  eval_size: 100                             # Held-out base problems for evaluation
  seed: 42
  use_variants:                              # Degradation types used for training
    - prompt1a
    - prompt1c
    - prompt1p
    - prompt2ac
    - prompt2ap
    - prompt2cp
    - prompt3acp

Key Design Decisions

Why Qwen2.5-Coder-7B and not Llama-3.1-8B? The HumanEvalComm paper shows that code-specialized models (CodeQwen, DeepSeek Coder) significantly outperform general-purpose models on degraded specs. Qwen2.5-Coder-7B scores ~70% on standard HumanEval vs ~55% for Llama-3.1-8B. It's also similar size (~14GB bf16), same LoRA config, and not gated on HuggingFace. We verified that Llama-3.1-8B scored 0% on smoke test episodes; Qwen Coder provides a much stronger coding baseline for PPO to build on.

Why LoRA and not full fine-tuning? Full fine-tuning on 7B with AdamW would require ~56 GB for optimizer states alone. LoRA limits trainable parameters to ~40M, reducing optimizer memory to ~800 MB. The frozen base weights also prevent catastrophic forgetting of Python syntax knowledge.

Why constrained prefix decoding? The base model sometimes outputs code without the required [ASK] or [ANSWER] prefix, resulting in malformed actions and zero reward. Constrained prefix decoding forces every generation to start with one of the two valid prefixes. The model still chooses which prefix by comparing their log-probs given the prompt - so the decision is learned, not random. This eliminates wasted training iterations on formatting errors.

Why PPO-Lagrangian and not fixed-penalty RL? A fixed penalty requires hand-tuning - you don't know in advance how large the penalty needs to be to achieve exactly d₁=1 question on average. PPO-Lagrangian finds this value automatically via dual ascent: if the agent asks too many questions, λ₁ rises until it stops; if it asks too few, λ₁ falls. This also enables sweeping multiple d₁ values without re-tuning.

Why two Lagrange multipliers? λ₁ enforces the question budget d₁. λ₂ is a soft secondary constraint on turns. Turn cost exists because asking many focused questions across many turns is still expensive, even if question count is low.

Multi-question handling (the multi_question_mode setting): When the agent writes [ASK] What is X? And what is Y?, it is asking two questions in one turn. In count mode (default), the user simulator counts 2 atomic questions and the environment charges cost_q = 2. This prevents the agent from exploiting the budget by batching questions. In truncate mode, only the first question is answered and cost_q = 1 always. The default count mode is more principled but depends on GPT-4o-mini counting accurately; truncate mode is simpler but restricts the agent's action space. Switch with environment.multi_question_mode=truncate.

Function name handling in code execution: Degraded specs sometimes rename functions to candidate, but test cases use the original name. The code executor aliases the last top-level function in the agent's code to the expected entry_point. This avoids breaking helper functions (e.g., is_prime defined before is_multiply_prime). Additionally, helper functions from the degraded spec (e.g., poly for find_zero) are automatically extracted and prepended to the test program so tests can reference them.

String output quoting in test assertions: Some test case outputs are bare strings (e.g., fdcb not 'fdcb'). The executor detects these and wraps them in repr() so assertions compare against the correct type. Template-based test relations (using $demo$ and $input$ placeholders) are also expanded correctly.

Function name handling in the user simulator: The simulator's system prompt tells GPT-4o-mini to treat any function name the agent uses (e.g., candidate) as referring to the function in the original spec. Without this, the simulator would say "I don't know about candidate" when the original spec defines decode_cyclic.

Why partial credit for code evaluation? Binary pass/fail gives a flat reward landscape. Partial credit (fraction of assertions passing) gives smoother gradients - an agent that gets 8/10 tests right receives reward 0.8, not 0. This significantly stabilises PPO training.

Why compute old_log_probs via forward pass instead of generate() scores? This was the most critical debugging finding in the project. HuggingFace's model.generate() returns scores that have been processed by internal logit processors - in Qwen2.5-Coder-7B, this resulted in 152,063 out of 152,064 vocabulary tokens being set to -inf, even with top_k=0. The model's own generation config applies aggressive filtering that cannot be disabled through standard parameters.

This caused a systematic log-prob mismatch: during rollout, log_softmax over the filtered scores normalized over ~1-4 tokens, while score() during PPO update used a clean forward pass normalizing over the full 152k vocabulary. The same token received different log-probs depending on which path computed them, inflating approx_kl to ~3 and permanently saturating PPO's clipping mechanism.

The fix was to stop using generate() scores entirely for log-prob computation. After generate() produces the token sequence, a separate forward pass computes old_log_probs from raw model logits - the exact same method score() uses. This brings approx_kl down to ~0.007.

The debugging process went through three stages:

Prefix scoring mismatch brought approx_kl from ~12 to ~3 (rollout scored only 2 of 4-5 prefix tokens; fix: both sides skip prefix and score only continuation tokens)
False leads - top_k=0, reducing PPO epochs, disabling LoRA dropout - none addressed the remaining ~3
Root cause - the thorough debug script revealed the logit processor filtering, leading to the forward pass fix that resolved the issue completely

Why a stratified train/eval split? Rare degradation types (prompt3acp = 12 problems, prompt2cp = 35 problems) could end up entirely in one set with a random split. The stratified split groups problems by their rarest variant and splits each group proportionally, guaranteeing all 7 types appear in both sets.

Why ast.literal_eval and not json.loads for test cases? HumanEvalComm stores test cases as Python-literal strings (single-quoted dicts), not JSON. json.loads will fail on them. Always use ast.literal_eval.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
configs		configs
data		data
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

To Ask or Not to Ask: Strategic Clarification in LLM Agents via CMDPs

Table of Contents

What This Project Does

Hardware Requirements

Installation

API Keys

OpenAI API (required for training)

HuggingFace Token (not required)

Project Structure

Dataset

System Prompts

Agent System Prompt

User Simulator System Prompt

Training

Quick Start

Resume from a Checkpoint

Override Config Values at the Command Line

What Happens During Training

Training Logs

Evaluation

Evaluate a Single Checkpoint

Evaluate All Policies and Plot the Pareto Frontier

Results

Overall Summary

By Degradation Type (D1=1 vs Baseline)

All Pairwise Comparisons (paired t-test, n=469)

Diminishing Returns: Question Budget vs. MT pass@1

Key Takeaways

Trained Model Checkpoints

Hugging Face

Local Checkpoints

What is Saved

Loading a Checkpoint for Inference

Configuration Reference

Model

Environment

User Simulator

Code Executor

Training

Constraint (CMDP)

Data

Key Design Decisions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages