-
Notifications
You must be signed in to change notification settings - Fork 70
feat: candidate search space (no perf changes for amd/nvidia) #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Nov 6, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Candidate Search Space
Note
No changes in the PR affect any performance for either AMD or NVIDIA benchmarks. It is purely a refactor.
Example Workflows:
.github/README.mdfor more examples.Follow up TODOs:
Problems with Current Approach
Currently, the InferenceMAX architecture uses what can best be described as a “trickle down” approach. More specifically, we start by separating the top-level workflows by sequence length (1k1k, 1k8k, 8k1k), which then invokes the “Full Sweep Template” which can be thought of as a function used by another workflow. Like any function, these templates specify certain variables that must be passed, specified in their “signature” or inputs. The Full Sweep Template then spawns a series of jobs conditioned on the input (i.e., the 1k1k scheduler only spawns jobs for ISL/OSL equal to 1024). The jobs spawned by the Full Sweep Template call another workflow function specific to each model (i.e., 70b Template). These workflow functions then invoke jobs across all hardware and precisions compatible with the particular model. Finally, these jobs call the bottom-level function Benchmark Template which encapsulates all of the logic for actually launching the benchmark script and scheduling it on a self-hosted runner. The “Current Architecture Diagram” entry in the Appendix A of this document gives a visual representation of this flow.
The problems with the current approach are as follows:
Let us elaborate on point 2 above. Our workflows rely on GitHub Action’s matrix strategy to generate jobs for all parallelism and concurrency levels in the cartesian product tp-list x conc-list. For instance,
would generate 6 jobs: tp4 conc1, tp4 conc2, tp4 conc4, tp8 conc1, tp8 conc2, tp8 conc4. There are some limitations with this approach:
Proposed Solution
To reconcile all aforementioned issues, we propose a simpler design eliminating many of the layers of workflow functions by specifying all benchmark configurations in an external master configuration JSON, processing it at the top level workflow, and creating a flattened matrix to launch all of the jobs. This will effectively take away the need for the Full Sweep Template as well as the Model Templates. The diagram in Appendix B can be referenced to gain a high-level understanding of the proposed architecture.
In other words, all possible benchmark configurations are defined in one place, which will be considered the primary “source of truth.” Further, the logic for deciding which jobs run will be completely self-contained in a Python script.
Workflow Files
The top-level schedulers will continue to be split up by sequence length – recall this is due to the limitation of approximately 500 jobs per workflow on GitHub actions, otherwise the UI fails to load and times out after 10 seconds. Furthermore, each scheduler workflow is split up by model, again due to a GitHub Actions limitation where no more than 256 jobs may be generated via a matrix. Each model has two jobs associated with it: one to retrieve the appropriate configs from the master config and dump the output JSON to an environment variable and another to consume that JSON to generate a matrix of jobs, for each parallelism-concurrency combination.
Consider the simplified example below:
The
get-gptoss-configsjob calls theutils/get_configs.pyscript with the appropriate input (the config file, sequence length, and model for which to receive configs). Theget_configs.pyscript iterates through all configs in the master configuration and processes them such that they can be loaded via GitHub Action’s fromJson function. Upon success, a matrix of jobs is created for each model.Master Config Structure
The master config YAML is parsed by the
utils/get_configs.pyscript and as such, must adhere to a strict structure. The structure for an entry is laid out below:Below is a concrete example:
In order to keep the pipelines reproducible and reliable, the structure of the config as proposed above must adhere exactly to the format above. This will be both implicitly and explicitly enforced within the workflows.
Test Workflows
Many of the existing workflows used for testing such as
runner-model-sweep-test.yml,runner-sweep-test.yml,runner-test.yml, and more are useful and must be ported/refactored as part of these proposed changes.The above diagram depicts the general flow of any workflow using this proposed architecture.
Single Python Script
At a high level, the Python script (
utils/matrix-logic/generate_sweep_configs.py) referenced in the above diagram works as follows:The user enters a series of pre-defined inputs/arguments describing what configs they would like to run that come from the “source of truth” master configs.
These inputs trigger an appropriate function that invokes logic to fetch a list of all appropriate benchmarks to run, each represented as an individual JSON entry.
This final JSON object with all jobs to run is dumped to stdout such that the JSON can be loaded as using the GHA fromJson command and used to generate a matrix of jobs.
Here is the usage message for the script:
Let’s run through some of the existing test workflows and show we can get the equivalent functionality from this script:
runner-test.ymlMeant to: test 1 model on 1 runner node.
Output:

runner-model-sweep.ymlMeant to: test multiple models on multiple runner nodes/
Output:

Notice identical output, and now even better since no one has to maintain all of the redundant values in runner-model-sweep.yml workflow – now they are all in one source of truth config.
runner-sweep-test.ymlMeant to: test 1 model on multiple runner nodes
In English: “Test all h200 runner nodes that are compatible with a gptoss configuration as described in the master configs.”
In English: “Test all b200 runner nodes that are compatible with a dsr1 fp8 configuration as described in the master configs.”
full-sweep-test.ymlMeant to: test full sweep functionality.
full-sweep-test.ymlis edited to use the Python script to run the selected configurations.New:
e2e-tests.ymlProvides even more control over what tests to run, as the user directly inputs the Python args to the workflow which allows maximum control over testing.
Tradeoff: doesn’t replicate a full sweep exactly i.e., no call to
collect.results.yml, no integration for GB200 (yet) – however, this provides a great testing utility to test all cases quickly.Examples:
python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix gptoss --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length 362In English: “Run all gptoss configurations.”
python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix gptoss --runner-type b200 --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length 46In English: “Run all gptoss configurations that run on B200.”
python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix dsr1 --runner-type b200 mi300x --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length 63In English: “Run all dsr1 configurations that run on b200 or mi300x.”
python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix dsr1 --runner-type b200 mi300x --precision fp8 --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length 30In English: “Run all dsr1 configurations that run on b200 or mi300x and run fp4”
python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix dsr1 --runner-type b200 mi300x --precision fp8 --test-mode --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length 6In English: “Run all dsr1 configurations that run on b200 or mi300x and run fp4, but only reduce the parallelism-concurrency search space to a single run (picks the highest parallelism and lowest concurrency levels available for that configuration)."
Frontend Considerations
The proposed changes do not break anything on the frontend critical-path. In particular, the artifacts that get generated from the full sweeps have the same names, locations as artifacts, etc. Do note that a discovery of this PR is that the frontend critically depends on the name of the full sweep schedulers, as this is what is used to hash the runs and collect their information.
As a follow-up, artifacts now have the additional fields of ep and dp-attn, so should be added upon hover of data points accordingly.
Tradeoffs
There are clear tradeoffs between the proposed architecture and the existing one. For one, while the proposed approach has far fewer levels of workflows (i.e., no more model templates, less workflow “functions”), it has more indirection. That is, most logic is self-contained within the workflow files themselves. This is good for reducing the room for error, but it means having to create a separate workflow file for each new type of functionality.
GB200 Integration
Currently, the GB200 multinode benchmarks run using a separate workflow call template than all other benchmarks (benchmark-multinode-tmpl.yaml). This causes some issues when trying to directly integrate with the proposed architecture. There are some ideas on how to integrate it to the master configuration, but for now the workaround is to just add the GB200 runs as a separate job to the XkYk-sweep.yml workflow files. See the example below from the 1k1k-sweep.yml file:
The plan is to integrate GB200 to the master configs at a later date. For now, this tradeoff is well worth it as it comes with faster development time allowing InferenceMAX to continue moving at the speed of light.
As far as tests for GB200 goes, the same logic as above applies. This refactor will provide a separate gb200-tests.yml workflow file that will allow the user to manually test GB200 nodes. Note that even with the current state of the repo, before refactor, there is very little support for testing GB200.
One Giant Python File
This refactor proposes one giant “god” Python file that encapsulates all logic for what benchmarks should run. This adds a new level of indirection and will make it more difficult to add new functionality. However, this is necessary to move away from in-workflow configurations to master configurations. This is also necessary to run a subset of concurrencies for each parallelism level, instead of running the entire Cartesian product of TP x CONC – this will allow CI time to be reduced by roughly 20%, which will allow more time for running more interesting, multinode tests.
Further, with this new level of indirection, developers may choose now to use it when creating their own testing workflows. Despite being used to generate the full sweeps, much of this script is intended to be used for input to e2e-tests.yml (testing) – if developers decide they would rather write their own logic/workflow files, they are free to do so.
if developers not like this god file for testing, we will go back to drawing board to refactor again,
it is hard for developers that wanna add 1 simple functionality to now add to understand an 900 line god file
Finally, one giant Python script encapsulating all of the logic used to generate tests will naturally be more error prone. In order to fend against bugs, any input from the config files are validated before being sent to the script, and all output is validated after being generated by the script and before being sent to benchmark-tmpl.yml.
Removed Graph Plotting
This PR removes graph plotting to reduce the amount of code that needs to be maintained and increase the speed of the CI pipeline. While this will increase development speed, it will prevent the ability to see different plotted results from different days as our frontend currently does not show historical comparison.
Future considerations
For a while now, we have been thinking about ways to move away from a simple nightly cron job trigger to a code diff based trigger. I.e., only run CI pipelines for configs/files that have changed in a commit. This PR helps move towards this goal as there is now a single stateful representation of all configurations that could be run. Now it is just a matter of deciding what changes affect what configurations.
Additionally, we are working towards pre-merge CI validation (based on labels, or something similar). The Python script will also help enable this, as it gives fine control over what tests should be run.
In order to further enforce against bugginess in the Python script, we intend to add an additional CI workflow that will be run upon any diff to the master configs to validate their structure before pushing to main.
Appendix
A Current Architecture Diagram
B Proposed Architecture Diagram