feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL#9
Conversation
Adds a `compute_scale` config knob (1.0 = H100 baseline; 0.5 = H200 ~= 2x H100) that lets the LLM pretraining and RL tasks run more efficiently on larger GPUs without changing results, and is offered optionally for the other tasks. - LLM pretraining (11 tasks): at 0.5, compute 4->2 GPUs with BATCH_SIZE x2 / GRAD_ACCUM /2 (via a per-entry `h200` env block). The macro-batch (BATCH x GRAD_ACCUM) is invariant, so results are unchanged. - LLM RL (4 tasks): at 0.5, compute 2->1 GPU and train.sh -> train_1gpu.sh (vLLM tensor_model_parallel_size 2->1). The global train_batch_size is unchanged, so results are unchanged. - Other tasks: fractional (compute<=1) entries scale for denser packing; a multi-GPU (compute>1) task without an `h200` block is left untouched and prints a warning (cutting its GPU count alone would change its global batch). Implementation: - scale_test_cmd_entries() in agent/tools.py applies the scale / h200 override; test_cmd entries now honor a per-entry `env` dict, injected in all execution backends (apptainer, docker, local, SLURM, and the rootless-docker session). - the standalone GPU scheduler mirrors the compute transform. - base.py / cli.py / discover_agent.py forward compute_scale from config; configs/react.yaml documents the knob. Note: the Harbor export path runs tasks independently and does not consume compute_scale.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 479d90b504
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "package": "verl" | ||
| "package": "verl", | ||
| "h200": { | ||
| "cmd": "scripts/train_1gpu.sh", |
There was a problem hiding this comment.
Keep the H200 advantage run on the same dataset
When compute_scale is enabled this override switches llm-rl-advantage from scripts/train.sh to scripts/train_1gpu.sh, but that script trains on /root/data/deepmath/train_lv3-5.parquet while the H100 script on lines 14-18 of tasks/llm-rl-advantage/scripts/train.sh uses /root/data/deepmath/train_5k.parquet. This means H200-mode results for this task are no longer comparable to the baseline configuration; the 1-GPU script should preserve the same training split and only change GPU/memory-related settings.
Useful? React with 👍 / 👎.
| for k, v in (cmd_entry.get("env") or {}).items(): | ||
| docker_cmd.extend(["-e", f"{k}={v}"]) |
There was a problem hiding this comment.
Avoid leaking per-entry env into Docker sessions
When rootless Docker runs a grouped set of commands, _run_docker_entries_in_session launches one long-lived container using only cmd_entries[0]; adding that entry's env to the base docker run means any variables present only on the first entry remain in the container environment for later docker exec commands that do not explicitly override them. In grouped tasks with mixed per-entry env, this can silently run later experiments with stale settings, so the session container should not inherit per-command env beyond the exec-specific -e values.
Useful? React with 👍 / 👎.
| if compute <= 1.0: | ||
| # Write even when `compute` was implicit (default 1.0) so this | ||
| # matches scheduler._scaled_compute, which scales the default. | ||
| entry["compute"] = compute * scale |
There was a problem hiding this comment.
Do not halve every single-GPU task by default
With compute_scale: 0.5, this rewrites every non-h200 entry with compute: 1.0 to 0.5, so grouped tasks outside the LLM families are no longer left as-is: for example, configs such as cv-classification-loss have multiple compute: 1.0 commands in the same group, and the GPU packers will now co-locate formerly dedicated single-GPU runs on one GPU. That can introduce OOMs/timeouts or different runtime conditions for tasks that were not explicitly retuned for H200, so the fallback scaling should be opt-in or limited to entries known to be safe.
Useful? React with 👍 / 👎.
… fix advantage data
Make the RL tasks' H200 profile use the same per-entry `env` mechanism as the
pretrain tasks instead of switching scripts:
- train.sh is parametrised: TP_SIZE / MAX_TOKEN_LEN_PER_GPU / GPU_MEM_UTIL, with
the H100 defaults (2 / 17408 / 0.4).
- the h200 block becomes {compute: 1, env: {TP_SIZE: 1, MAX_TOKEN_LEN_PER_GPU:
20480, GPU_MEM_UTIL: 0.5}}, matching the old train_1gpu.sh.
- train_1gpu.sh is removed (4 files).
Also fix llm-rl-advantage training data: it is a 200-step task but train.sh
pointed at deepmath/train_5k (a 5K subset), while its real 1-GPU baseline and
the other 200-step tasks use deepmath/train_lv3-5 (30K). Corrected to
train_lv3-5.
feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL
Summary
Reintroduces a
compute_scaleconfig knob so the benchmark can run efficiently on GPUs larger than the H100 the tasks are tuned on (e.g. H200 ≈ 2× H100), in response to #4. Setcompute_scale: 0.5for H200; the default1.0(H100) is a strict no-op.The macro-batch of every adapted task is held constant, so results are unchanged — only the GPU count / per-GPU batch / wall-clock differ. Both task families are adapted through the same per-entry
h200block of shape{compute, env}— no script switching.Design
LLM pretraining (11
llm-pretrain-*tasks)At
0.5the training command drops from 4 → 2 GPUs and compensates per GPU:BATCH_SIZE×2,GRAD_ACCUM÷2.The macro-batch
= BATCH_SIZE × GRAD_ACCUMis invariant (the script divides grad-accum by world size internally) — e.g.llm-pretrain-loss: H10032×16=512equals H20064×8=512. The larger per-GPU micro-batch uses the H200's extra memory; the gradient is mathematically identical, so the metric is unchanged.LLM RL (4
llm-rl-*tasks)At
0.5the command stays the sametrain.shbut drops from 2 → 1 GPU and applies its H200 profile via env (same mechanism as pretrain — no separate script):TP_SIZE=1— vLLMtensor_model_parallel_size2 → 1MAX_TOKEN_LEN_PER_GPU=20480— per-GPU dynamic-batch token budget (17408 → 20480)GPU_MEM_UTIL=0.5—gpu_memory_utilization(0.4 → 0.5)train.shis parametrised with the H100 values as defaults; the oldtrain_1gpu.shscript-switch variants are removed. The globaldata.train_batch_size(128 prompts) is unchanged, so the RL result is unchanged.Other tasks (optional)
compute_scaleis offered for every task but is only really needed for the two families above:compute ≤ 1, e.g. eval jobs and the RL-control tasks):compute ×= scalefor denser packing — never changes a single job's result.h200block (compute > 1, e.g.cv-vae-loss): left untouched plus a stderr warning, since cutting a data-parallel job's GPU count alone would change its global batch / result.Also: fix
llm-rl-advantagetraining dataWhile unifying the RL scripts, found a data drift.
llm-rl-advantageis a 200-step task, but itstrain.shpointed atdeepmath/train_5k(a 5K random subset), whereas its real single-GPU baseline — the leaderboard was produced withtrain_1gpu.shon H200 — and the other two 200-step tasks (kl-estimator,reward-normalization) usedeepmath/train_lv3-5(30K). Correctedtrain.shtotrain_lv3-5. (The 100-stepimportance-samplingcorrectly keepstrain_5k.)Implementation
scale_test_cmd_entries()(agent/tools.py): single source of truth for the transform (deep-copies entries, strips theh200helper key).envdict, injected last in all execution backends: apptainer, docker, local, SLURM (via_build_container_cmd), and the rootless-docker long-lived session.base.py/cli.py/discover_agent.pyforwardcompute_scalefrom config;configs/react.yamldocuments the knob.Not covered
The Harbor export path (
harbor/+datasets/.../score_task.py) runs tasks independently of the native runtime, so it does not consumecompute_scale.Testing
Static checks: entry resolution (h200 override for both families; fractional scaling; multi-GPU untouched;
1.0no-op), per-entry env rendered into apptainer/docker commands, tools/scheduler transform parity, parametrisedtrain.shpassesbash -n, and all task configs pass edit-range validation.Refs #4