feat: multinode first class integration #251

cquil11 · 2025-11-21T20:56:42Z

Design Doc: Multi-Node First Class Integration

Disclaimer: No changes in this PR affect any performance for either AMD or NVIDIA.

Test runs:

runner model sweep: https://github.com/InferenceMAX/InferenceMAX/actions/runs/19829647492
e2e test full sweep 1k1k sglang only: https://github.com/InferenceMAX/InferenceMAX/actions/runs/19830636342
e2e test full sweep 1k1k trt only: https://github.com/InferenceMAX/InferenceMAX/actions/runs/19830655526
full sweep 1k1k big one: https://github.com/InferenceMAX/InferenceMAX/actions/runs/19966863539
e2e test with new GB200 updates: https://github.com/InferenceMAX/InferenceMAX/actions/runs/19910261653

Introduction

Currently, multi-node benchmarks (in particular, GB200 benchmarks) are second-class. That is, they are executed in a fundamentally different way than single node benchmarks within the InferenceMAX framework.

For instance, while all single node benchmarks follow the rule of one single scenario per GitHub Actions runner, the runners/launch_gb200-nv.sh script launches all configurations at once (which are hard-coded into the Bash script) and allows SLURM to handle the scheduling.

This PR seeks to standardize multinode benchmarks such that:

The amount of logic and number of variables that are hard coded inside the low-level Bash scripts is kept to a minimum
All multinode scenarios can be represented by a configuration in a master configuration file
One GitHub Actions runner runs one single benchmark scenario (at one time)
- This makes debugging much easier by isolating runs within the GitHub actions framework
In general, they are executed in a similar fashion to single node benchmarks

This design doc will explain the proposed architecture from "highest" to "lowest" levels of abstraction:

Master config changes, config parsing script
Benchmark template / workflow changes
Runner / benchmark Bash script changes
Physical runner changes

Proposal

Master Configs

The first step in supporting multinode configs as first-class citizens in InferenceMAX is allowing developers to define them in the master configs. This provides a strict template in which to pass information down to the actual execution of runner and benchmarks.

Below is an example config employing the structure that multinode configurations will use (note the new fields are highlighted in bold):

dsr1-fp4-gb200-dynamo-trt:
  image: nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3
  model: deepseek-r1-fp4
  model-prefix: dsr1
  runner: gb200
  precision: fp4
  framework: dynamo-trt
  multinode: true
  disagg: true
  seq-len-configs:
  - isl: 1024
    osl: 1024
    search-space:
    - spec-decoding: "mtp"
      conc-list: [ 1, 2, 4, 8, 16, 36 ]
      prefill:
        num-worker: 1
        tp: 4
        ep: 4
        dp-attn: false
        additional-settings:
        - "PREFILL_MAX_NUM_TOKENS=4608"
        - "PREFILL_MAX_BATCH_SIZE=4"
      decode:
        num-worker: 4
        tp: 8
        ep: 8
        dp-attn: false
        additional-settings:
        - "DECODE_MAX_NUM_TOKENS=128"
        - "DECODE_MAX_BATCH_SIZE=32"
        - "DECODE_GPU_MEM_FRACTION=0.9"
        - "DECODE_MTP_SIZE=3"
        - "DECODE_EPLB_NUM_SLOTS=0"

Below is a description of the additions:

Field	Type	Description
`multinode`	`bool`	Will be added to all new configs AND all existing configs to indicate whether or not a config is multinode; `true` indicates multinode, `false` indicates single node
`spec-decoding`	`string` (optional)	One of `mtp`, `draft_models`, or `none` (defaults to `none`)
`prefill` / `decode`	`dict`	Nested JSON objects holding the config information specific to the prefill/decode instances, respectively
`num-worker`	`int`	The number of prefill/decode workers
`tp` / `ep`	`int`	The TP and EP for the prefill/decode instances
`dp-attn`	`bool`	Whether DP attention should be enabled for the prefill/decode instances
`additional-settings`	`list[string]`	A list of strings representing additional environment variables the developers want to be passed to the underlying Bash script. This is split up between prefill/decode so as to discourage developers from passing in environment variables unrelated to prefill/decode. However, no explicit validation will be done on these.
`conc-list`	`list[int]`	A list of concurrencies to run for this scenario. Note: this will be available for single node configs as well. The developer will now get a choice whether to specify the full list of concurrencies or the classic `conc-start` and `conc-end` with a step factor of 2. The motivation for this is because, with the complexity added when creating multi-node recipes, it is not always as simple as sweeping across a "neat" distribution of concurrencies.

Motivation for `additional-settings`

As we know, multinode runs are significantly more complex than single node as they introduce advanced techniques such as PD disaggregation, KV cache offload, and more. Furthermore, different serving frameworks such as vLLM, Dynamo + TRT, and SGLang will all have slightly different parameters that must be set.

Therefore, it is not practical to strictly define all of the settings (as required fields in the master config entries) required for, say, Dynamo when that field may not be relevant to say, SGLang.

In this proposal, we seek to define a standard set of variables needed to run multinode benchmarks (such as num-worker, max-num-tokens, etc.) and provide an additional section that allows developers to pass arbitrary values as environment variables to the runner launch script.

While this adds a bit of complexity both in the master configs as well as the scripts responsible for parsing the configs, we believe it is ultimately necessary as we add more and more configurations to InferenceMAX. We believe this can also be useful for some single node configs in the future.

The overarching point of this is to try and define as much information as possible in the "source of truth" master configs. InferenceMAX prides itself on transparency and ease of understanding/reproducibility – having settings defined at the top level config rather than as complex logic at the Bash script level is important.

Master Config Parsing Scripts

Whenever additions are made to the master configs structure, corresponding changes must be made in the utils/generate_sweep_configs.py script.

In particular, the changes that must be made as part of this proposal are as follows:

Validation logic must be added for the new input configs mentioned above
- All configurations must have multinode [bool]
- If multinode, then check for multinode specific fields (both required and optional)
- If not multinode, then check for single node specific fields
Validation for outputs must be updated to support new configs
- Recall: output validation is in place to maintain strict integrity with what is expected from the benchmark-tmpl.yml workflow files
Logic must be added to the runner-model-sweep and full-sweep functions to parse the correct configs from the configs and pass them as JSON objects to stdout (for consumption by the workflow files)
- Will have options --single-node or --multi-node to get either the single node configs or multi node configs
- We believe it is probably best to keep single/multi node configs separate and not mix them as they will still have some inherent differences when running

Additionally, this PR will include splitting up the single parsing script into two: one for validation and one for the core logic. This is necessary because the single file is getting a bit overwhelming to maintain.

Benchmark Template / Workflow

There are two approaches to consider for integrating the workflows to support multinode:

Option 1: Combined Template

Combine benchmark-tmpl.yml and benchmark-multinode-tmpl.yml into a single workflow template. Inputs to this file will be three-fold:

Single node specific inputs
Multinode specific inputs
Common arguments

The common arguments will be required while the mode specific arguments will be optional. Then, jobs/steps are run conditioned upon whether or not it is a single or multinode benchmark.

Option 2: Separate Templates

Keep benchmark-tmpl.yml and benchmark-multinode-tmpl.yml separate, with separate inputs and slightly different logic.

Analysis

Option	Pros	Cons
Option 1	Single "point of entry" to run jobs on runners; top-level scheduler workflows don't need to split calls	Adds complex logic to template file
Option 2	Straightforward logic in benchmark template script	Adds complexity at top level workflow files; requires splitting single and multi node as separate jobs

Recommendation

We lean towards Option 2. While it adds more lines of code in the top level workflows, we believe single and multinode runs are inherently different and should be split up accordingly.

The structure will be as follows:

name: "Full Sweep Scheduler - 1k1k"

on:
  workflow_dispatch:
  schedule:
    - cron: "0 0 * * *"

jobs:
  # Get DeepSeek configs and store as separate output variables for single and multi node
  get-dsr1-configs:
    ...

  # Get GPT OSS configs and store as separate output variables for single and multi node
  get-gptoss-configs:
    ...

  benchmark-dsr1-single-node:
    needs: get-dsr1-configs
    uses: ./.github/workflows/benchmark-tmpl.yml
    name: dsr1 1k1k / ...

  benchmark-gptoss-single-node:
    needs: get-gptoss-configs
    uses: ./.github/workflows/benchmark-tmpl.yml
    name: gptoss 1k1k / ...

  benchmark-dsr1-multi-node:
    needs: get-dsr1-configs
    uses: ./.github/workflows/benchmark-multinode-tmpl.yml
    name: dsr1 1k1k / ...

  benchmark-gptoss-multi-node:
    needs: get-gptoss-configs
    uses: ./.github/workflows/benchmark-multinode-tmpl.yml
    name: gptoss 1k1k / ...

  collect-dsr1-results:
    needs: [benchmark-dsr1-single-node, benchmark-dsr1-multi-node]
    if: ${{ always() }}
    uses: ./.github/workflows/collect-results.yml
    secrets: inherit
    with:
      exp-name: "dsr1_1k1k"

  collect-gptoss-results:
    needs: [benchmark-gptoss-single-node, benchmark-gptoss-multi-node]
    if: ${{ always() }}
    uses: ./.github/workflows/collect-results.yml
    secrets: inherit
    with:
      exp-name: "gptoss_1k1k"

  calc-success-rate:
    needs: [benchmark-dsr1, benchmark-gptoss, benchmark-gb200]
    if: ${{ always() }}
    runs-on: ubuntu-latest
    ...

Concurrency Handling

We propose running all concurrencies for each scenario in the same run, on the same inference server. The reason we opt for this instead of the traditional one concurrency per server is the time it takes to spin up disaggregated inference servers. We don't want to waste an insane amount of compute ($40-50 USD) for each concurrency.

This will require some changes to the bench_serving script to allow multiple concurrencies to be run at once.

Runner/Bash Scripts

Recall that PR #227 standardizes the architecture in which all single node benchmarks are run. Specifically, benchmark-tmpl.yml uses the name of the runner, e.g., b200-nvd_0 to decide which runners/launch_XXXX.sh script to invoke, which subsequently launches a container. The entrypoint to this container is a benchmarks/MODEL_PRECISION_GPU_FRAMEWORK.sh script, based on the associated inputs. As such, this benchmarks/ script will run entirely inside a container.

We will follow a similar approach for multinode integration. However, the runners/ script will not launch a container that the entire benchmark runs in, it will just call a script specific to the model, precision, GPU, framework combination that will kick off a series of scripts starting up the prefill, decode, server, and benchmark processes.

Physical Runners

Recall that currently, the only multinode job (running on GB200) launches a single script that submits ALL scenarios at once to SLURM. This is all done on one single GitHub Actions runner. This massively oversubscribes and takes advantage of the SLURM scheduler (however, we note that on the current GB200 cluster, the scheduling method is FIFO).

As mentioned previously, it is beneficial to have one scenario per job per runner (for debugging purposes, testing, reproducibility, etc). However, we would still like to oversubscribe and take advantage of the SLURM scheduler.

Proposal

Add three additional GitHub Actions runners listening on the login node.

Current GB200 NVL72 configurations average 8.4 nodes per job. With 4 runners submitting jobs concurrently on our 18-node cluster, this results in an oversubscription factor of approximately 1.86.

We propose starting at 4 and decreasing/increasing as necessary.

Note: This will require either getting sudo access on the GB200 cluster (to add the new gharunner users) or getting an SRE to do this for us.

Tradeoffs

This architecture runs one job per runner, oversubscribing up to 4 jobs. The tradeoff here is that each scenario is run in random order, and since all jobs are not submitted at once, we cannot take full advantage of traditional SLURM scheduling techniques.

As mentioned previously, the tradeoff here is that we will achieve better visibility into the multi-node runs and these runs will be easier to debug, as each run will be isolated to a single job.

Frontend Considerations

There are no frontend considerations.

Testing Workflows

As part of this PR, we will make all the necessary changes to make tests for multinode first class as well. We will get rid of the specific workaround logic for multinode (GB200).

Follow-Ups

In designing this proposal, we realize that there are a lot of downstream scripts and programs that we should upstream into the InferenceMAX repo (at the very least as a submodule). Again, InferenceMAX prides itself on being transparent and easily reproducible. When someone has to go to a completely separate repo and sift through a stack trace of 8 Dynamo files, we are going against this principle.

We should try to minimize the amount of git clone-ing that we do at all costs.

In particular, below is a list of files that should be upstreamed as follow up PRs:

Kimbo's bench_serving script
- We should standardize this and add it (the actual script) to the upstream InferenceMAX repo
Dynamo scripts for PD-disagg
- We should upstream that SBATCH and python scripts, or perhaps even re-write them to make them more straightforward
- Right now, they are quite confusing with many levels of indirection

…alidation logic and config parser

gp

This reverts commit efcb4e4.

cquil11 added 13 commits November 24, 2025 08:40

adding initial changes to master configs; adding initial updates to v…

e605477

…alidation logic and config parser

adding new gb200 script

039cf15

adding integration to gb200 runner script and workflow files

f0d851d

revert and correct name of 1k1k scheduler workflow

c967d83

adding runners.yaml to workflow invocation

1b1b7a4

toJson on conc since it is now a list

1c03192

correctly sending conc list to multnode

9fa8e92

hotfix

fc77648

correct env var to MAX batch size

06ecd9c

set -x

2401e65

debugging with dynmao fork

4104b15

debugging with dynmao fork pt 2

d2a5c9f

experiment

9ca96b1

cquil11 force-pushed the multinode-integration branch from 1dd76e5 to 9ca96b1 Compare November 24, 2025 14:40

cquil11 added 16 commits November 24, 2025 12:54

adding separate script for launching

379ccd4

changing filenames

cc24ba5

ntasks per node

cfd9890

making the spec-decoding output required

6cf098c

updating ntasks per node

34d8dc3

gp

test

bccc2f1

test

09b9ae5

conc list quoted

56ccdcd

get rid of debug code

a81e309

testing support for dsr1

6d818e5

testing support for dsr1 test

9c8a245

testing support for dsr1 test

5959a3c

testing support for dsr1 test

a23315b

testing support for dsr1 test

a809af2

testing

dbc5a37

some changes to generate sweeps

ef127b2

cquil11 temporarily deployed to fork-pr-validation December 4, 2025 15:08 — with GitHub Actions Inactive

Revert "sglang: add fp8 8k1k and fp4 1k1k (#274)" (#283)

21ec133

This reverts commit efcb4e4.

cquil11 temporarily deployed to fork-pr-validation December 4, 2025 16:26 — with GitHub Actions Inactive

Merge branch 'main' into multinode-integration

228c8c9

cquil11 temporarily deployed to fork-pr-validation December 4, 2025 16:47 — with GitHub Actions Inactive

cquil11 added 2 commits December 4, 2025 11:26

Merge branch 'main' into multinode-integration

0034119

Merge branch 'main' into multinode-integration

1989874

cquil11 mentioned this pull request Dec 5, 2025

Frontend missing some information / showing incorrect information after #269 refresh #289

Closed

cquil11 added 11 commits December 4, 2025 20:44

get rid of ntasks per node required env var for sglang

ecc2025

bug fix

b8d6b23

bug fix missing amd

caa7197

bug fix missing amd pt 2

2d55f70

Merge branch 'main' into multinode-integration

d7b36ea

add served model name to summary

3032a57

add served model name to summary pt 2

34b257f

add served model name to summary pt 3

38814e1

fix max model len bug

54d0e42

add readme

34870e3

add image to json result

ca1c279

cquil11 merged commit 93e1b3c into main Dec 5, 2025
7 of 8 checks passed

cquil11 deleted the multinode-integration branch December 5, 2025 20:51

cquil11 restored the multinode-integration branch December 5, 2025 20:52

This was referenced Dec 5, 2025

H200 TRT not displayed #297

Closed

show exactly what framework (vllm,trtllm,sglang) legend #127

Closed

functionstackx added this to InferenceMAX Board Dec 7, 2025

github-project-automation bot moved this to Done in InferenceMAX Board Dec 7, 2025

functionstackx added the code-health label Dec 7, 2025

functionstackx temporarily deployed to fork-pr-validation December 7, 2025 21:21 — with GitHub Actions Inactive

cquil11 mentioned this pull request Dec 10, 2025

feat: performance changelog triggered runs (as opposed to nightly) #267

Merged

functionstackx deleted the multinode-integration branch January 11, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: multinode first class integration #251

feat: multinode first class integration #251

Uh oh!

cquil11 commented Nov 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat: multinode first class integration #251

feat: multinode first class integration #251

Uh oh!

Conversation

cquil11 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design Doc: Multi-Node First Class Integration

Introduction

Proposal

Master Configs

Motivation for additional-settings

Master Config Parsing Scripts

Benchmark Template / Workflow

Option 1: Combined Template

Option 2: Separate Templates

Analysis

Recommendation

Concurrency Handling

Runner/Bash Scripts

Physical Runners

Proposal

Tradeoffs

Frontend Considerations

Testing Workflows

Follow-Ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cquil11 commented Nov 21, 2025 •

edited

Loading

Motivation for `additional-settings`