feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL by Imbernoulli · Pull Request #9 · Imbernoulli/MLS-Bench

Imbernoulli · 2026-06-04T08:25:22Z

Summary

Reintroduces a compute_scale config knob so the benchmark can run efficiently on GPUs larger than the H100 the tasks are tuned on (e.g. H200 ≈ 2× H100), in response to #4. Set compute_scale: 0.5 for H200; the default 1.0 (H100) is a strict no-op.

The macro-batch of every adapted task is held constant, so results are unchanged — only the GPU count / per-GPU batch / wall-clock differ. Both task families are adapted through the same per-entry h200 block of shape {compute, env} — no script switching.

Design

LLM pretraining (11 `llm-pretrain-*` tasks)

At 0.5 the training command drops from 4 → 2 GPUs and compensates per GPU: BATCH_SIZE ×2, GRAD_ACCUM ÷2.

The macro-batch = BATCH_SIZE × GRAD_ACCUM is invariant (the script divides grad-accum by world size internally) — e.g. llm-pretrain-loss: H100 32×16=512 equals H200 64×8=512. The larger per-GPU micro-batch uses the H200's extra memory; the gradient is mathematically identical, so the metric is unchanged.

"h200": { "compute": 2.0, "env": { "BATCH_SIZE": "64", "GRAD_ACCUM": "8" } }

LLM RL (4 `llm-rl-*` tasks)

At 0.5 the command stays the same train.sh but drops from 2 → 1 GPU and applies its H200 profile via env (same mechanism as pretrain — no separate script):

TP_SIZE=1 — vLLM tensor_model_parallel_size 2 → 1
MAX_TOKEN_LEN_PER_GPU=20480 — per-GPU dynamic-batch token budget (17408 → 20480)
GPU_MEM_UTIL=0.5 — gpu_memory_utilization (0.4 → 0.5)

train.sh is parametrised with the H100 values as defaults; the old train_1gpu.sh script-switch variants are removed. The global data.train_batch_size (128 prompts) is unchanged, so the RL result is unchanged.

"h200": { "compute": 1, "env": { "TP_SIZE": "1", "MAX_TOKEN_LEN_PER_GPU": "20480", "GPU_MEM_UTIL": "0.5" } }

Other tasks (optional)

compute_scale is offered for every task but is only really needed for the two families above:

fractional / single-GPU entries (compute ≤ 1, e.g. eval jobs and the RL-control tasks): compute ×= scale for denser packing — never changes a single job's result.
multi-GPU without an h200 block (compute > 1, e.g. cv-vae-loss): left untouched plus a stderr warning, since cutting a data-parallel job's GPU count alone would change its global batch / result.

Also: fix `llm-rl-advantage` training data

While unifying the RL scripts, found a data drift. llm-rl-advantage is a 200-step task, but its train.sh pointed at deepmath/train_5k (a 5K random subset), whereas its real single-GPU baseline — the leaderboard was produced with train_1gpu.sh on H200 — and the other two 200-step tasks (kl-estimator, reward-normalization) use deepmath/train_lv3-5 (30K). Corrected train.sh to train_lv3-5. (The 100-step importance-sampling correctly keeps train_5k.)

Implementation

scale_test_cmd_entries() (agent/tools.py): single source of truth for the transform (deep-copies entries, strips the h200 helper key).
test_cmd entries now honor a per-entry env dict, injected last in all execution backends: apptainer, docker, local, SLURM (via _build_container_cmd), and the rootless-docker long-lived session.
the standalone GPU scheduler mirrors the compute-number transform so its bin-packing matches actual demand.
base.py / cli.py / discover_agent.py forward compute_scale from config; configs/react.yaml documents the knob.

Not covered

The Harbor export path (harbor/ + datasets/.../score_task.py) runs tasks independently of the native runtime, so it does not consume compute_scale.

Testing

Static checks: entry resolution (h200 override for both families; fractional scaling; multi-GPU untouched; 1.0 no-op), per-entry env rendered into apptainer/docker commands, tools/scheduler transform parity, parametrised train.sh passes bash -n, and all task configs pass edit-range validation.

Refs #4

Adds a `compute_scale` config knob (1.0 = H100 baseline; 0.5 = H200 ~= 2x H100) that lets the LLM pretraining and RL tasks run more efficiently on larger GPUs without changing results, and is offered optionally for the other tasks. - LLM pretraining (11 tasks): at 0.5, compute 4->2 GPUs with BATCH_SIZE x2 / GRAD_ACCUM /2 (via a per-entry `h200` env block). The macro-batch (BATCH x GRAD_ACCUM) is invariant, so results are unchanged. - LLM RL (4 tasks): at 0.5, compute 2->1 GPU and train.sh -> train_1gpu.sh (vLLM tensor_model_parallel_size 2->1). The global train_batch_size is unchanged, so results are unchanged. - Other tasks: fractional (compute<=1) entries scale for denser packing; a multi-GPU (compute>1) task without an `h200` block is left untouched and prints a warning (cutting its GPU count alone would change its global batch). Implementation: - scale_test_cmd_entries() in agent/tools.py applies the scale / h200 override; test_cmd entries now honor a per-entry `env` dict, injected in all execution backends (apptainer, docker, local, SLURM, and the rootless-docker session). - the standalone GPU scheduler mirrors the compute transform. - base.py / cli.py / discover_agent.py forward compute_scale from config; configs/react.yaml documents the knob. Note: the Harbor export path runs tasks independently and does not consume compute_scale.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 479d90b504

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-04T08:33:03Z

-      "package": "verl"
+      "package": "verl",
+      "h200": {
+        "cmd": "scripts/train_1gpu.sh",


Keep the H200 advantage run on the same dataset

When compute_scale is enabled this override switches llm-rl-advantage from scripts/train.sh to scripts/train_1gpu.sh, but that script trains on /root/data/deepmath/train_lv3-5.parquet while the H100 script on lines 14-18 of tasks/llm-rl-advantage/scripts/train.sh uses /root/data/deepmath/train_5k.parquet. This means H200-mode results for this task are no longer comparable to the baseline configuration; the 1-GPU script should preserve the same training split and only change GPU/memory-related settings.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-04T08:33:03Z

+        for k, v in (cmd_entry.get("env") or {}).items():
+            docker_cmd.extend(["-e", f"{k}={v}"])


Avoid leaking per-entry env into Docker sessions

When rootless Docker runs a grouped set of commands, _run_docker_entries_in_session launches one long-lived container using only cmd_entries[0]; adding that entry's env to the base docker run means any variables present only on the first entry remain in the container environment for later docker exec commands that do not explicitly override them. In grouped tasks with mixed per-entry env, this can silently run later experiments with stale settings, so the session container should not inherit per-command env beyond the exec-specific -e values.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-04T08:33:03Z

+            if compute <= 1.0:
+                # Write even when `compute` was implicit (default 1.0) so this
+                # matches scheduler._scaled_compute, which scales the default.
+                entry["compute"] = compute * scale


Do not halve every single-GPU task by default

With compute_scale: 0.5, this rewrites every non-h200 entry with compute: 1.0 to 0.5, so grouped tasks outside the LLM families are no longer left as-is: for example, configs such as cv-classification-loss have multiple compute: 1.0 commands in the same group, and the GPU packers will now co-locate formerly dedicated single-GPU runs on one GPU. That can introduce OOMs/timeouts or different runtime conditions for tasks that were not explicitly retuned for H200, so the fallback scaling should be opt-in or limited to entries known to be safe.

Useful? React with 👍 / 👎.

… fix advantage data Make the RL tasks' H200 profile use the same per-entry `env` mechanism as the pretrain tasks instead of switching scripts: - train.sh is parametrised: TP_SIZE / MAX_TOKEN_LEN_PER_GPU / GPU_MEM_UTIL, with the H100 defaults (2 / 17408 / 0.4). - the h200 block becomes {compute: 1, env: {TP_SIZE: 1, MAX_TOKEN_LEN_PER_GPU: 20480, GPU_MEM_UTIL: 0.5}}, matching the old train_1gpu.sh. - train_1gpu.sh is removed (4 files). Also fix llm-rl-advantage training data: it is a 200-step task but train.sh pointed at deepmath/train_5k (a 5K subset), while its real 1-GPU baseline and the other 200-step tasks use deepmath/train_lv3-5 (30K). Corrected to train_lv3-5.

feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL

Imbernoulli added 2 commits June 4, 2026 01:24

docs(README): link issue #4 and PR #9 in the compute_scale news entry

479d90b

Imbernoulli mentioned this pull request Jun 4, 2026

Adaptation to different GPU types #4

Open

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

Imbernoulli merged commit 13b1d61 into main Jun 4, 2026

Imbernoulli deleted the compute-scale-h200 branch June 4, 2026 08:59

Imbernoulli added a commit that referenced this pull request Jun 5, 2026

docs(README): link issue #4 and PR #9 in the compute_scale news entry

bc7f911

Imbernoulli added a commit that referenced this pull request Jun 5, 2026

Merge pull request #9 from Imbernoulli/compute-scale-h200

09881e9

feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL

Imbernoulli mentioned this pull request Jun 9, 2026

Some Possible Bugs #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL#9

feat(compute_scale): GPU hardware scaler + H200 mode for LLM pretrain/RL#9
Imbernoulli merged 3 commits into
mainfrom
compute-scale-h200

Imbernoulli commented Jun 4, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		for k, v in (cmd_entry.get("env") or {}).items():
		docker_cmd.extend(["-e", f"{k}={v}"])

Conversation

Imbernoulli commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

LLM pretraining (11 llm-pretrain-* tasks)

LLM RL (4 llm-rl-* tasks)

Other tasks (optional)

Also: fix llm-rl-advantage training data

Implementation

Not covered

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Imbernoulli commented Jun 4, 2026 •

edited

Loading

LLM pretraining (11 `llm-pretrain-*` tasks)

LLM RL (4 `llm-rl-*` tasks)

Also: fix `llm-rl-advantage` training data