Skip to content

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Nov 21, 2025

Design Doc: Multi-Node First Class Integration

Disclaimer: No changes in this PR affect any performance for either AMD or NVIDIA.

Test runs:

Introduction

Currently, multi-node benchmarks (in particular, GB200 benchmarks) are second-class. That is, they are executed in a fundamentally different way than single node benchmarks within the InferenceMAX framework.

For instance, while all single node benchmarks follow the rule of one single scenario per GitHub Actions runner, the runners/launch_gb200-nv.sh script launches all configurations at once (which are hard-coded into the Bash script) and allows SLURM to handle the scheduling.

This PR seeks to standardize multinode benchmarks such that:

  • The amount of logic and number of variables that are hard coded inside the low-level Bash scripts is kept to a minimum
  • All multinode scenarios can be represented by a configuration in a master configuration file
  • One GitHub Actions runner runs one single benchmark scenario (at one time)
    • This makes debugging much easier by isolating runs within the GitHub actions framework
  • In general, they are executed in a similar fashion to single node benchmarks

This design doc will explain the proposed architecture from "highest" to "lowest" levels of abstraction:

  1. Master config changes, config parsing script
  2. Benchmark template / workflow changes
  3. Runner / benchmark Bash script changes
  4. Physical runner changes

Proposal

Master Configs

The first step in supporting multinode configs as first-class citizens in InferenceMAX is allowing developers to define them in the master configs. This provides a strict template in which to pass information down to the actual execution of runner and benchmarks.

Below is an example config employing the structure that multinode configurations will use (note the new fields are highlighted in bold):

dsr1-fp4-gb200-dynamo-trt:
  image: nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3
  model: deepseek-r1-fp4
  model-prefix: dsr1
  runner: gb200
  precision: fp4
  framework: dynamo-trt
  multinode: true
  disagg: true
  seq-len-configs:
  - isl: 1024
    osl: 1024
    search-space:
    - spec-decoding: "mtp"
      conc-list: [ 1, 2, 4, 8, 16, 36 ]
      prefill:
        num-worker: 1
        tp: 4
        ep: 4
        dp-attn: false
        additional-settings:
        - "PREFILL_MAX_NUM_TOKENS=4608"
        - "PREFILL_MAX_BATCH_SIZE=4"
      decode:
        num-worker: 4
        tp: 8
        ep: 8
        dp-attn: false
        additional-settings:
        - "DECODE_MAX_NUM_TOKENS=128"
        - "DECODE_MAX_BATCH_SIZE=32"
        - "DECODE_GPU_MEM_FRACTION=0.9"
        - "DECODE_MTP_SIZE=3"
        - "DECODE_EPLB_NUM_SLOTS=0"

Below is a description of the additions:

Field Type Description
multinode bool Will be added to all new configs AND all existing configs to indicate whether or not a config is multinode; true indicates multinode, false indicates single node
spec-decoding string (optional) One of mtp, draft_models, or none (defaults to none)
prefill / decode dict Nested JSON objects holding the config information specific to the prefill/decode instances, respectively
num-worker int The number of prefill/decode workers
tp / ep int The TP and EP for the prefill/decode instances
dp-attn bool Whether DP attention should be enabled for the prefill/decode instances
additional-settings list[string] A list of strings representing additional environment variables the developers want to be passed to the underlying Bash script. This is split up between prefill/decode so as to discourage developers from passing in environment variables unrelated to prefill/decode. However, no explicit validation will be done on these.
conc-list list[int] A list of concurrencies to run for this scenario. Note: this will be available for single node configs as well. The developer will now get a choice whether to specify the full list of concurrencies or the classic conc-start and conc-end with a step factor of 2. The motivation for this is because, with the complexity added when creating multi-node recipes, it is not always as simple as sweeping across a "neat" distribution of concurrencies.

Motivation for additional-settings

As we know, multinode runs are significantly more complex than single node as they introduce advanced techniques such as PD disaggregation, KV cache offload, and more. Furthermore, different serving frameworks such as vLLM, Dynamo + TRT, and SGLang will all have slightly different parameters that must be set.

Therefore, it is not practical to strictly define all of the settings (as required fields in the master config entries) required for, say, Dynamo when that field may not be relevant to say, SGLang.

In this proposal, we seek to define a standard set of variables needed to run multinode benchmarks (such as num-worker, max-num-tokens, etc.) and provide an additional section that allows developers to pass arbitrary values as environment variables to the runner launch script.

While this adds a bit of complexity both in the master configs as well as the scripts responsible for parsing the configs, we believe it is ultimately necessary as we add more and more configurations to InferenceMAX. We believe this can also be useful for some single node configs in the future.

The overarching point of this is to try and define as much information as possible in the "source of truth" master configs. InferenceMAX prides itself on transparency and ease of understanding/reproducibility – having settings defined at the top level config rather than as complex logic at the Bash script level is important.


Master Config Parsing Scripts

Whenever additions are made to the master configs structure, corresponding changes must be made in the utils/generate_sweep_configs.py script.

In particular, the changes that must be made as part of this proposal are as follows:

  1. Validation logic must be added for the new input configs mentioned above

    • All configurations must have multinode [bool]
    • If multinode, then check for multinode specific fields (both required and optional)
    • If not multinode, then check for single node specific fields
  2. Validation for outputs must be updated to support new configs

    • Recall: output validation is in place to maintain strict integrity with what is expected from the benchmark-tmpl.yml workflow files
  3. Logic must be added to the runner-model-sweep and full-sweep functions to parse the correct configs from the configs and pass them as JSON objects to stdout (for consumption by the workflow files)

    • Will have options --single-node or --multi-node to get either the single node configs or multi node configs
    • We believe it is probably best to keep single/multi node configs separate and not mix them as they will still have some inherent differences when running

Additionally, this PR will include splitting up the single parsing script into two: one for validation and one for the core logic. This is necessary because the single file is getting a bit overwhelming to maintain.


Benchmark Template / Workflow

There are two approaches to consider for integrating the workflows to support multinode:

Option 1: Combined Template

Combine benchmark-tmpl.yml and benchmark-multinode-tmpl.yml into a single workflow template. Inputs to this file will be three-fold:

  1. Single node specific inputs
  2. Multinode specific inputs
  3. Common arguments

The common arguments will be required while the mode specific arguments will be optional. Then, jobs/steps are run conditioned upon whether or not it is a single or multinode benchmark.

Option 2: Separate Templates

Keep benchmark-tmpl.yml and benchmark-multinode-tmpl.yml separate, with separate inputs and slightly different logic.

Analysis

Option Pros Cons
Option 1 Single "point of entry" to run jobs on runners; top-level scheduler workflows don't need to split calls Adds complex logic to template file
Option 2 Straightforward logic in benchmark template script Adds complexity at top level workflow files; requires splitting single and multi node as separate jobs

Recommendation

We lean towards Option 2. While it adds more lines of code in the top level workflows, we believe single and multinode runs are inherently different and should be split up accordingly.

The structure will be as follows:

name: "Full Sweep Scheduler - 1k1k"

on:
  workflow_dispatch:
  schedule:
    - cron: "0 0 * * *"

jobs:
  # Get DeepSeek configs and store as separate output variables for single and multi node
  get-dsr1-configs:
    ...

  # Get GPT OSS configs and store as separate output variables for single and multi node
  get-gptoss-configs:
    ...

  benchmark-dsr1-single-node:
    needs: get-dsr1-configs
    uses: ./.github/workflows/benchmark-tmpl.yml
    name: dsr1 1k1k / ...

  benchmark-gptoss-single-node:
    needs: get-gptoss-configs
    uses: ./.github/workflows/benchmark-tmpl.yml
    name: gptoss 1k1k / ...

  benchmark-dsr1-multi-node:
    needs: get-dsr1-configs
    uses: ./.github/workflows/benchmark-multinode-tmpl.yml
    name: dsr1 1k1k / ...

  benchmark-gptoss-multi-node:
    needs: get-gptoss-configs
    uses: ./.github/workflows/benchmark-multinode-tmpl.yml
    name: gptoss 1k1k / ...

  collect-dsr1-results:
    needs: [benchmark-dsr1-single-node, benchmark-dsr1-multi-node]
    if: ${{ always() }}
    uses: ./.github/workflows/collect-results.yml
    secrets: inherit
    with:
      exp-name: "dsr1_1k1k"

  collect-gptoss-results:
    needs: [benchmark-gptoss-single-node, benchmark-gptoss-multi-node]
    if: ${{ always() }}
    uses: ./.github/workflows/collect-results.yml
    secrets: inherit
    with:
      exp-name: "gptoss_1k1k"

  calc-success-rate:
    needs: [benchmark-dsr1, benchmark-gptoss, benchmark-gb200]
    if: ${{ always() }}
    runs-on: ubuntu-latest
    ...

Concurrency Handling

We propose running all concurrencies for each scenario in the same run, on the same inference server. The reason we opt for this instead of the traditional one concurrency per server is the time it takes to spin up disaggregated inference servers. We don't want to waste an insane amount of compute ($40-50 USD) for each concurrency.

This will require some changes to the bench_serving script to allow multiple concurrencies to be run at once.


Runner/Bash Scripts

Recall that PR #227 standardizes the architecture in which all single node benchmarks are run. Specifically, benchmark-tmpl.yml uses the name of the runner, e.g., b200-nvd_0 to decide which runners/launch_XXXX.sh script to invoke, which subsequently launches a container. The entrypoint to this container is a benchmarks/MODEL_PRECISION_GPU_FRAMEWORK.sh script, based on the associated inputs. As such, this benchmarks/ script will run entirely inside a container.

We will follow a similar approach for multinode integration. However, the runners/ script will not launch a container that the entire benchmark runs in, it will just call a script specific to the model, precision, GPU, framework combination that will kick off a series of scripts starting up the prefill, decode, server, and benchmark processes.


Physical Runners

Recall that currently, the only multinode job (running on GB200) launches a single script that submits ALL scenarios at once to SLURM. This is all done on one single GitHub Actions runner. This massively oversubscribes and takes advantage of the SLURM scheduler (however, we note that on the current GB200 cluster, the scheduling method is FIFO).

As mentioned previously, it is beneficial to have one scenario per job per runner (for debugging purposes, testing, reproducibility, etc). However, we would still like to oversubscribe and take advantage of the SLURM scheduler.

Proposal

Add three additional GitHub Actions runners listening on the login node.

Current GB200 NVL72 configurations average 8.4 nodes per job. With 4 runners submitting jobs concurrently on our 18-node cluster, this results in an oversubscription factor of approximately 1.86.

We propose starting at 4 and decreasing/increasing as necessary.

Note: This will require either getting sudo access on the GB200 cluster (to add the new gharunner users) or getting an SRE to do this for us.

Tradeoffs

This architecture runs one job per runner, oversubscribing up to 4 jobs. The tradeoff here is that each scenario is run in random order, and since all jobs are not submitted at once, we cannot take full advantage of traditional SLURM scheduling techniques.

As mentioned previously, the tradeoff here is that we will achieve better visibility into the multi-node runs and these runs will be easier to debug, as each run will be isolated to a single job.


Frontend Considerations

There are no frontend considerations.


Testing Workflows

As part of this PR, we will make all the necessary changes to make tests for multinode first class as well. We will get rid of the specific workaround logic for multinode (GB200).


Follow-Ups

In designing this proposal, we realize that there are a lot of downstream scripts and programs that we should upstream into the InferenceMAX repo (at the very least as a submodule). Again, InferenceMAX prides itself on being transparent and easily reproducible. When someone has to go to a completely separate repo and sift through a stack trace of 8 Dynamo files, we are going against this principle.

We should try to minimize the amount of git clone-ing that we do at all costs.

In particular, below is a list of files that should be upstreamed as follow up PRs:

  1. Kimbo's bench_serving script

    • We should standardize this and add it (the actual script) to the upstream InferenceMAX repo
  2. Dynamo scripts for PD-disagg

    • We should upstream that SBATCH and python scripts, or perhaps even re-write them to make them more straightforward
    • Right now, they are quite confusing with many levels of indirection

@cquil11 cquil11 force-pushed the multinode-integration branch from 1dd76e5 to 9ca96b1 Compare November 24, 2025 14:40
@cquil11 cquil11 temporarily deployed to fork-pr-validation December 4, 2025 15:08 — with GitHub Actions Inactive
@cquil11 cquil11 temporarily deployed to fork-pr-validation December 4, 2025 16:26 — with GitHub Actions Inactive
@cquil11 cquil11 temporarily deployed to fork-pr-validation December 4, 2025 16:47 — with GitHub Actions Inactive
@cquil11 cquil11 merged commit 93e1b3c into main Dec 5, 2025
7 of 8 checks passed
@cquil11 cquil11 deleted the multinode-integration branch December 5, 2025 20:51
@cquil11 cquil11 restored the multinode-integration branch December 5, 2025 20:52
@functionstackx functionstackx deleted the multinode-integration branch January 11, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

6 participants