-
Notifications
You must be signed in to change notification settings - Fork 70
feat: multinode first class integration #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…alidation logic and config parser
1dd76e5 to
9ca96b1
Compare
This was referenced Dec 5, 2025
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Design Doc: Multi-Node First Class Integration
Test runs:
Introduction
Currently, multi-node benchmarks (in particular, GB200 benchmarks) are second-class. That is, they are executed in a fundamentally different way than single node benchmarks within the InferenceMAX framework.
For instance, while all single node benchmarks follow the rule of one single scenario per GitHub Actions runner, the
runners/launch_gb200-nv.shscript launches all configurations at once (which are hard-coded into the Bash script) and allows SLURM to handle the scheduling.This PR seeks to standardize multinode benchmarks such that:
This design doc will explain the proposed architecture from "highest" to "lowest" levels of abstraction:
Proposal
Master Configs
The first step in supporting multinode configs as first-class citizens in InferenceMAX is allowing developers to define them in the master configs. This provides a strict template in which to pass information down to the actual execution of runner and benchmarks.
Below is an example config employing the structure that multinode configurations will use (note the new fields are highlighted in bold):
Below is a description of the additions:
multinodebooltrueindicates multinode,falseindicates single nodespec-decodingstring(optional)mtp,draft_models, ornone(defaults tonone)prefill/decodedictnum-workerinttp/epintdp-attnbooladditional-settingslist[string]conc-listlist[int]conc-startandconc-endwith a step factor of 2. The motivation for this is because, with the complexity added when creating multi-node recipes, it is not always as simple as sweeping across a "neat" distribution of concurrencies.Motivation for
additional-settingsAs we know, multinode runs are significantly more complex than single node as they introduce advanced techniques such as PD disaggregation, KV cache offload, and more. Furthermore, different serving frameworks such as vLLM, Dynamo + TRT, and SGLang will all have slightly different parameters that must be set.
Therefore, it is not practical to strictly define all of the settings (as required fields in the master config entries) required for, say, Dynamo when that field may not be relevant to say, SGLang.
In this proposal, we seek to define a standard set of variables needed to run multinode benchmarks (such as
num-worker,max-num-tokens, etc.) and provide an additional section that allows developers to pass arbitrary values as environment variables to the runner launch script.While this adds a bit of complexity both in the master configs as well as the scripts responsible for parsing the configs, we believe it is ultimately necessary as we add more and more configurations to InferenceMAX. We believe this can also be useful for some single node configs in the future.
The overarching point of this is to try and define as much information as possible in the "source of truth" master configs. InferenceMAX prides itself on transparency and ease of understanding/reproducibility – having settings defined at the top level config rather than as complex logic at the Bash script level is important.
Master Config Parsing Scripts
Whenever additions are made to the master configs structure, corresponding changes must be made in the
utils/generate_sweep_configs.pyscript.In particular, the changes that must be made as part of this proposal are as follows:
Validation logic must be added for the new input configs mentioned above
multinode[bool]Validation for outputs must be updated to support new configs
benchmark-tmpl.ymlworkflow filesLogic must be added to the
runner-model-sweepandfull-sweepfunctions to parse the correct configs from the configs and pass them as JSON objects to stdout (for consumption by the workflow files)--single-nodeor--multi-nodeto get either the single node configs or multi node configsAdditionally, this PR will include splitting up the single parsing script into two: one for validation and one for the core logic. This is necessary because the single file is getting a bit overwhelming to maintain.
Benchmark Template / Workflow
There are two approaches to consider for integrating the workflows to support multinode:
Option 1: Combined Template
Combine
benchmark-tmpl.ymlandbenchmark-multinode-tmpl.ymlinto a single workflow template. Inputs to this file will be three-fold:The common arguments will be required while the mode specific arguments will be optional. Then, jobs/steps are run conditioned upon whether or not it is a single or multinode benchmark.
Option 2: Separate Templates
Keep
benchmark-tmpl.ymlandbenchmark-multinode-tmpl.ymlseparate, with separate inputs and slightly different logic.Analysis
Recommendation
We lean towards Option 2. While it adds more lines of code in the top level workflows, we believe single and multinode runs are inherently different and should be split up accordingly.
The structure will be as follows:
Concurrency Handling
We propose running all concurrencies for each scenario in the same run, on the same inference server. The reason we opt for this instead of the traditional one concurrency per server is the time it takes to spin up disaggregated inference servers. We don't want to waste an insane amount of compute ($40-50 USD) for each concurrency.
This will require some changes to the
bench_servingscript to allow multiple concurrencies to be run at once.Runner/Bash Scripts
Recall that PR #227 standardizes the architecture in which all single node benchmarks are run. Specifically,
benchmark-tmpl.ymluses the name of the runner, e.g.,b200-nvd_0to decide whichrunners/launch_XXXX.shscript to invoke, which subsequently launches a container. The entrypoint to this container is abenchmarks/MODEL_PRECISION_GPU_FRAMEWORK.shscript, based on the associated inputs. As such, thisbenchmarks/script will run entirely inside a container.We will follow a similar approach for multinode integration. However, the
runners/script will not launch a container that the entire benchmark runs in, it will just call a script specific to the model, precision, GPU, framework combination that will kick off a series of scripts starting up the prefill, decode, server, and benchmark processes.Physical Runners
Recall that currently, the only multinode job (running on GB200) launches a single script that submits ALL scenarios at once to SLURM. This is all done on one single GitHub Actions runner. This massively oversubscribes and takes advantage of the SLURM scheduler (however, we note that on the current GB200 cluster, the scheduling method is FIFO).
As mentioned previously, it is beneficial to have one scenario per job per runner (for debugging purposes, testing, reproducibility, etc). However, we would still like to oversubscribe and take advantage of the SLURM scheduler.
Proposal
Add three additional GitHub Actions runners listening on the login node.
Current GB200 NVL72 configurations average 8.4 nodes per job. With 4 runners submitting jobs concurrently on our 18-node cluster, this results in an oversubscription factor of approximately 1.86.
We propose starting at 4 and decreasing/increasing as necessary.
Tradeoffs
This architecture runs one job per runner, oversubscribing up to 4 jobs. The tradeoff here is that each scenario is run in random order, and since all jobs are not submitted at once, we cannot take full advantage of traditional SLURM scheduling techniques.
As mentioned previously, the tradeoff here is that we will achieve better visibility into the multi-node runs and these runs will be easier to debug, as each run will be isolated to a single job.
Frontend Considerations
There are no frontend considerations.
Testing Workflows
As part of this PR, we will make all the necessary changes to make tests for multinode first class as well. We will get rid of the specific workaround logic for multinode (GB200).
Follow-Ups
In designing this proposal, we realize that there are a lot of downstream scripts and programs that we should upstream into the InferenceMAX repo (at the very least as a submodule). Again, InferenceMAX prides itself on being transparent and easily reproducible. When someone has to go to a completely separate repo and sift through a stack trace of 8 Dynamo files, we are going against this principle.
We should try to minimize the amount of
git clone-ing that we do at all costs.In particular, below is a list of files that should be upstreamed as follow up PRs:
Kimbo's
bench_servingscriptDynamo scripts for PD-disagg