Skip to content

feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL#9

Merged
Imbernoulli merged 3 commits into
mainfrom
compute-scale-h200
Jun 4, 2026
Merged

feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL#9
Imbernoulli merged 3 commits into
mainfrom
compute-scale-h200

Conversation

@Imbernoulli

@Imbernoulli Imbernoulli commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Summary

Reintroduces a compute_scale config knob so the benchmark can run efficiently on GPUs larger than the H100 the tasks are tuned on (e.g. H200 ≈ 2× H100), in response to #4. Set compute_scale: 0.5 for H200; the default 1.0 (H100) is a strict no-op.

The macro-batch of every adapted task is held constant, so results are unchanged — only the GPU count / per-GPU batch / wall-clock differ. Both task families are adapted through the same per-entry h200 block of shape {compute, env} — no script switching.

Design

LLM pretraining (11 llm-pretrain-* tasks)

At 0.5 the training command drops from 4 → 2 GPUs and compensates per GPU: BATCH_SIZE ×2, GRAD_ACCUM ÷2.

The macro-batch = BATCH_SIZE × GRAD_ACCUM is invariant (the script divides grad-accum by world size internally) — e.g. llm-pretrain-loss: H100 32×16=512 equals H200 64×8=512. The larger per-GPU micro-batch uses the H200's extra memory; the gradient is mathematically identical, so the metric is unchanged.

"h200": { "compute": 2.0, "env": { "BATCH_SIZE": "64", "GRAD_ACCUM": "8" } }

LLM RL (4 llm-rl-* tasks)

At 0.5 the command stays the same train.sh but drops from 2 → 1 GPU and applies its H200 profile via env (same mechanism as pretrain — no separate script):

  • TP_SIZE=1 — vLLM tensor_model_parallel_size 2 → 1
  • MAX_TOKEN_LEN_PER_GPU=20480 — per-GPU dynamic-batch token budget (17408 → 20480)
  • GPU_MEM_UTIL=0.5gpu_memory_utilization (0.4 → 0.5)

train.sh is parametrised with the H100 values as defaults; the old train_1gpu.sh script-switch variants are removed. The global data.train_batch_size (128 prompts) is unchanged, so the RL result is unchanged.

"h200": { "compute": 1, "env": { "TP_SIZE": "1", "MAX_TOKEN_LEN_PER_GPU": "20480", "GPU_MEM_UTIL": "0.5" } }

Other tasks (optional)

compute_scale is offered for every task but is only really needed for the two families above:

  • fractional / single-GPU entries (compute ≤ 1, e.g. eval jobs and the RL-control tasks): compute ×= scale for denser packing — never changes a single job's result.
  • multi-GPU without an h200 block (compute > 1, e.g. cv-vae-loss): left untouched plus a stderr warning, since cutting a data-parallel job's GPU count alone would change its global batch / result.

Also: fix llm-rl-advantage training data

While unifying the RL scripts, found a data drift. llm-rl-advantage is a 200-step task, but its train.sh pointed at deepmath/train_5k (a 5K random subset), whereas its real single-GPU baseline — the leaderboard was produced with train_1gpu.sh on H200 — and the other two 200-step tasks (kl-estimator, reward-normalization) use deepmath/train_lv3-5 (30K). Corrected train.sh to train_lv3-5. (The 100-step importance-sampling correctly keeps train_5k.)

Implementation

  • scale_test_cmd_entries() (agent/tools.py): single source of truth for the transform (deep-copies entries, strips the h200 helper key).
  • test_cmd entries now honor a per-entry env dict, injected last in all execution backends: apptainer, docker, local, SLURM (via _build_container_cmd), and the rootless-docker long-lived session.
  • the standalone GPU scheduler mirrors the compute-number transform so its bin-packing matches actual demand.
  • base.py / cli.py / discover_agent.py forward compute_scale from config; configs/react.yaml documents the knob.

Not covered

The Harbor export path (harbor/ + datasets/.../score_task.py) runs tasks independently of the native runtime, so it does not consume compute_scale.

Testing

Static checks: entry resolution (h200 override for both families; fractional scaling; multi-GPU untouched; 1.0 no-op), per-entry env rendered into apptainer/docker commands, tools/scheduler transform parity, parametrised train.sh passes bash -n, and all task configs pass edit-range validation.

Refs #4

Adds a `compute_scale` config knob (1.0 = H100 baseline; 0.5 = H200 ~= 2x H100)
that lets the LLM pretraining and RL tasks run more efficiently on larger GPUs
without changing results, and is offered optionally for the other tasks.

- LLM pretraining (11 tasks): at 0.5, compute 4->2 GPUs with BATCH_SIZE x2 /
  GRAD_ACCUM /2 (via a per-entry `h200` env block). The macro-batch
  (BATCH x GRAD_ACCUM) is invariant, so results are unchanged.
- LLM RL (4 tasks): at 0.5, compute 2->1 GPU and train.sh -> train_1gpu.sh
  (vLLM tensor_model_parallel_size 2->1). The global train_batch_size is
  unchanged, so results are unchanged.
- Other tasks: fractional (compute<=1) entries scale for denser packing; a
  multi-GPU (compute>1) task without an `h200` block is left untouched and
  prints a warning (cutting its GPU count alone would change its global batch).

Implementation:
- scale_test_cmd_entries() in agent/tools.py applies the scale / h200 override;
  test_cmd entries now honor a per-entry `env` dict, injected in all execution
  backends (apptainer, docker, local, SLURM, and the rootless-docker session).
- the standalone GPU scheduler mirrors the compute transform.
- base.py / cli.py / discover_agent.py forward compute_scale from config;
  configs/react.yaml documents the knob.

Note: the Harbor export path runs tasks independently and does not consume
compute_scale.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 479d90b504

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tasks/llm-rl-advantage/config.json Outdated
"package": "verl"
"package": "verl",
"h200": {
"cmd": "scripts/train_1gpu.sh",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep the H200 advantage run on the same dataset

When compute_scale is enabled this override switches llm-rl-advantage from scripts/train.sh to scripts/train_1gpu.sh, but that script trains on /root/data/deepmath/train_lv3-5.parquet while the H100 script on lines 14-18 of tasks/llm-rl-advantage/scripts/train.sh uses /root/data/deepmath/train_5k.parquet. This means H200-mode results for this task are no longer comparable to the baseline configuration; the 1-GPU script should preserve the same training split and only change GPU/memory-related settings.

Useful? React with 👍 / 👎.

Comment on lines +1619 to +1620
for k, v in (cmd_entry.get("env") or {}).items():
docker_cmd.extend(["-e", f"{k}={v}"])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid leaking per-entry env into Docker sessions

When rootless Docker runs a grouped set of commands, _run_docker_entries_in_session launches one long-lived container using only cmd_entries[0]; adding that entry's env to the base docker run means any variables present only on the first entry remain in the container environment for later docker exec commands that do not explicitly override them. In grouped tasks with mixed per-entry env, this can silently run later experiments with stale settings, so the session container should not inherit per-command env beyond the exec-specific -e values.

Useful? React with 👍 / 👎.

Comment on lines +282 to +285
if compute <= 1.0:
# Write even when `compute` was implicit (default 1.0) so this
# matches scheduler._scaled_compute, which scales the default.
entry["compute"] = compute * scale

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not halve every single-GPU task by default

With compute_scale: 0.5, this rewrites every non-h200 entry with compute: 1.0 to 0.5, so grouped tasks outside the LLM families are no longer left as-is: for example, configs such as cv-classification-loss have multiple compute: 1.0 commands in the same group, and the GPU packers will now co-locate formerly dedicated single-GPU runs on one GPU. That can introduce OOMs/timeouts or different runtime conditions for tasks that were not explicitly retuned for H200, so the fallback scaling should be opt-in or limited to entries known to be safe.

Useful? React with 👍 / 👎.

… fix advantage data

Make the RL tasks' H200 profile use the same per-entry `env` mechanism as the
pretrain tasks instead of switching scripts:
- train.sh is parametrised: TP_SIZE / MAX_TOKEN_LEN_PER_GPU / GPU_MEM_UTIL, with
  the H100 defaults (2 / 17408 / 0.4).
- the h200 block becomes {compute: 1, env: {TP_SIZE: 1, MAX_TOKEN_LEN_PER_GPU:
  20480, GPU_MEM_UTIL: 0.5}}, matching the old train_1gpu.sh.
- train_1gpu.sh is removed (4 files).

Also fix llm-rl-advantage training data: it is a 200-step task but train.sh
pointed at deepmath/train_5k (a 5K subset), while its real 1-GPU baseline and
the other 200-step tasks use deepmath/train_lv3-5 (30K). Corrected to
train_lv3-5.
@Imbernoulli Imbernoulli merged commit 13b1d61 into main Jun 4, 2026
@Imbernoulli Imbernoulli deleted the compute-scale-h200 branch June 4, 2026 08:59
Imbernoulli added a commit that referenced this pull request Jun 5, 2026
feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL
@Imbernoulli Imbernoulli mentioned this pull request Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant