feat: candidate search space (no perf changes for amd/nvidia) #145

cquil11 · 2025-10-24T19:04:35Z

Candidate Search Space

Note

No changes in the PR affect any performance for either AMD or NVIDIA benchmarks. It is purely a refactor.

Example Workflows:

1k1k Full Sweep Scheduler: https://github.com/InferenceMAX/InferenceMAX/actions/runs/18984029536
Runner Sweep Test: https://github.com/InferenceMAX/InferenceMAX/actions/runs/18976786131
See .github/README.md for more examples.

Follow up TODOs:

Problems with Current Approach

Currently, the InferenceMAX architecture uses what can best be described as a “trickle down” approach. More specifically, we start by separating the top-level workflows by sequence length (1k1k, 1k8k, 8k1k), which then invokes the “Full Sweep Template” which can be thought of as a function used by another workflow. Like any function, these templates specify certain variables that must be passed, specified in their “signature” or inputs. The Full Sweep Template then spawns a series of jobs conditioned on the input (i.e., the 1k1k scheduler only spawns jobs for ISL/OSL equal to 1024). The jobs spawned by the Full Sweep Template call another workflow function specific to each model (i.e., 70b Template). These workflow functions then invoke jobs across all hardware and precisions compatible with the particular model. Finally, these jobs call the bottom-level function Benchmark Template which encapsulates all of the logic for actually launching the benchmark script and scheduling it on a self-hosted runner. The “Current Architecture Diagram” entry in the Appendix A of this document gives a visual representation of this flow.

The problems with the current approach are as follows:

Too many layers of abstraction, i.e., too many workflows, workflow functions, sub-workflows
- Difficult to create new test workflows for specific cases, as this requires a separate job and/or workflow for each use case
- Hard for contributors to make contributions – especially for the first time
Vanilla GitHub Actions does not provide enough native support for the expressivity we want from configs. Put more simply, YAML + GHA primitives is not a programming language.

Let us elaborate on point 2 above. Our workflows rely on GitHub Action’s matrix strategy to generate jobs for all parallelism and concurrency levels in the cartesian product tp-list x conc-list. For instance,

 bmk-b200-trt-fp8:
   if: ${{ inputs.use_b200 }}
   uses: ./.github/workflows/benchmark-tmpl.yml
   with:
      runner: h200-trt
      image: 'nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2'
      model: 'deepseek-ai/DeepSeek-R1-0528'
      framework: 'trt'
      precision: 'fp8'
      exp-name: ${{ inputs.exp-name }}
      isl: ${{ inputs.isl }}
      osl: ${{ inputs.osl }}
      max-model-len: ${{ inputs.max-model-len }}
      random-range-ratio: ${{ inputs.random-range-ratio }}
      conc-list: '[1, 2, 4]'
      tp-list: '[4, 8]'

would generate 6 jobs: tp4 conc1, tp4 conc2, tp4 conc4, tp8 conc1, tp8 conc2, tp8 conc4. There are some limitations with this approach:

There is no way to specify a different concurrency range for one level of parallelism and a different range for another. This is a problem, because first principles tells us that certain tp/conc configurations are not realistic for performance and will not fall on the Pareto frontier.

2. With the current approach, all points are swept across (even the ones that will not fall on the Pareto frontier) leading to wasted CI time and more importantly, wasted compute. There is not a way to specify different parallelism techniques such as expert parallelism or DP attention. All of the logic for this is handled at the benchmark script level which is not ideal – it is preferable that this logic is lifted up out of Bash scripts so that these options can be more easily specified by users. Below is an example of how things like EP and DP attention are currently specified:

EP_SIZE="$TP"
MOE_BACKEND="CUTLASS"
DP_ATTENTION=false

if [[ "$ISL" == "1024" && "$OSL" == "1024" ]]; then
   if [[ $CONC -gt 64 ]]; then
       DP_ATTENTION=true
   fi
elif [[ "$ISL" == "1024" && "$OSL" == "8192" ]]; then
   if [[ $CONC -gt 64 ]]; then
       DP_ATTENTION=true
   fi
elif [[ "$ISL" == "8192" && "$OSL" == "1024" ]]; then
   if [[ $CONC -gt 32 ]]; then
       DP_ATTENTION=true
   fi
fi

Proposed Solution

To reconcile all aforementioned issues, we propose a simpler design eliminating many of the layers of workflow functions by specifying all benchmark configurations in an external master configuration JSON, processing it at the top level workflow, and creating a flattened matrix to launch all of the jobs. This will effectively take away the need for the Full Sweep Template as well as the Model Templates. The diagram in Appendix B can be referenced to gain a high-level understanding of the proposed architecture.

In other words, all possible benchmark configurations are defined in one place, which will be considered the primary “source of truth.” Further, the logic for deciding which jobs run will be completely self-contained in a Python script.

Workflow Files

The top-level schedulers will continue to be split up by sequence length – recall this is due to the limitation of approximately 500 jobs per workflow on GitHub actions, otherwise the UI fails to load and times out after 10 seconds. Furthermore, each scheduler workflow is split up by model, again due to a GitHub Actions limitation where no more than 256 jobs may be generated via a matrix. Each model has two jobs associated with it: one to retrieve the appropriate configs from the master config and dump the output JSON to an environment variable and another to consume that JSON to generate a matrix of jobs, for each parallelism-concurrency combination.

Consider the simplified example below:

name: '1K/1K Sweep'

jobs:
  get-gptoss-configs:
    runs-on: ubuntu-latest
    outputs:
        search-space-config: ${{ steps.get-gptoss-configs.outputs.search-space-config }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - id: get-gptoss-configs
        run: |
          CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/get_configs.py ${GITHUB_WORKSPACE}/.github/configs/master.yaml 1k1k gptoss)
          echo "search-space-config=$CONFIG_JSON" >> $GITHUB_OUTPUT

  benchmark-gptoss:
    needs: get-gptoss-configs
    uses: ./.github/workflows/benchmark-tmpl.yml
    name: gptoss 1k1k
    strategy:
      fail-fast: false
      matrix:
        config: ${{ fromJson(needs.get-gptoss-configs.outputs.search-space-config) }}
    secrets: inherit
    with:
      exp-name: "gptoss_1k1k"
      isl: 1024
      osl: 1024
      max-model-len: 2048
      runner: ${{ matrix.config.runner }}
      image: ${{ matrix.config.image }}
      model: ${{ matrix.config.model }}
      framework: ${{ matrix.config.framework }}
      precision: ${{ matrix.config.precision }}
      tp: ${{ matrix.config.tp }}
      conc: ${{ matrix.config.conc }}

The get-gptoss-configs job calls the utils/get_configs.py script with the appropriate input (the config file, sequence length, and model for which to receive configs). The get_configs.py script iterates through all configs in the master configuration and processes them such that they can be loaded via GitHub Action’s fromJson function. Upon success, a matrix of jobs is created for each model.

Master Config Structure

The master config YAML is parsed by the utils/get_configs.py script and as such, must adhere to a strict structure. The structure for an entry is laid out below:

<INFMAX_MODEL_NAME>-<PRECISION>-<RUNNER>:
  image: 
  model: 
  runner:
  precision:
  framework:
  seq-len-configs:
  - isl: 1024
    osl: 1024
    bmk-space:
    - { tp: 1, conc-start:, conc-end: }
    - { tp: 2, conc-start:, conc-end: }
	...
  - isl: 1024
    osl: 8192
    bmk-space:
	...
  - isl: 8192
    osl: 1024
    bmk-space:
    	...

Below is a concrete example:

dsr1-fp8-h200-trt:
  image: nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2
  model: deepseek-ai/DeepSeek-R1-0528
  runner: h200-trt
  precision: fp8
  framework: trt
  # For all sequence lengths, EP=TP
  seq-len-configs:
  - isl: 1024
    osl: 1024
    # If CONC > 64, then DP_ATTN=true
    bmk-space:
    - { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
  - isl: 1024
    osl: 8192
    # If CONC > 64, then DP_ATTN=true
    bmk-space:
    - { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
  - isl: 8192
    osl: 1024
    # If CONC > 32, then DP_ATTN=true
    bmk-space:
    - { tp: 8, ep: 8, conc-start: 4, conc-end: 32 }
    - { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 64 }

In order to keep the pipelines reproducible and reliable, the structure of the config as proposed above must adhere exactly to the format above. This will be both implicitly and explicitly enforced within the workflows.

Test Workflows

Many of the existing workflows used for testing such as runner-model-sweep-test.yml, runner-sweep-test.yml, runner-test.yml, and more are useful and must be ported/refactored as part of these proposed changes.

The above diagram depicts the general flow of any workflow using this proposed architecture.

Single Python Script

At a high level, the Python script (utils/matrix-logic/generate_sweep_configs.py) referenced in the above diagram works as follows:
The user enters a series of pre-defined inputs/arguments describing what configs they would like to run that come from the “source of truth” master configs.
These inputs trigger an appropriate function that invokes logic to fetch a list of all appropriate benchmarks to run, each represented as an individual JSON entry.
This final JSON object with all jobs to run is dumped to stdout such that the JSON can be loaded as using the GHA fromJson command and used to generate a matrix of jobs.

Here is the usage message for the script:

Let’s run through some of the existing test workflows and show we can get the equivalent functionality from this script:

`runner-test.yml`

Meant to: test 1 model on 1 runner node.

python3 utils/matrix-logic/generate_sweep_configs.py custom --config-files .github/configs/amd-master.yaml .github/configs/nvidia-master.yaml --runner-config  .github/configs/runners.yaml --runner-label h200-nv_1 --framework vllm --precision fp8 --exp-name 70b_test --image vllm/vllm-openai:v0.10.2 --model deepseek-ai/DeepSeek-R1-0528

Output:

`runner-model-sweep.yml`

Meant to: test multiple models on multiple runner nodes/

python3 utils/matrix-logic/generate_sweep_configs.py runner-model-sweep --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --runner-type h100

Output:

Notice identical output, and now even better since no one has to maintain all of the redundant values in runner-model-sweep.yml workflow – now they are all in one source of truth config.

`runner-sweep-test.yml`

Meant to: test 1 model on multiple runner nodes

Never actually used (not even once) in the current repo.
Slightly confusing for testers, since users are prompted to choose a runner type, Docker image, model, framework, and precision – it is hard for newer users / contributors to know what combinations of these inputs is actually meant to be run.
This refactor eliminates this challenge to an extent. Instead, the user just inputs the model they want to sweep across runners, and the script finds all configs from the master configs that include the specified model.
Also an option to filter by framework and precision. In the case that the user inputs an incorrect combination, the script fails immediately instead of failing at the benchmark script level, which takes longer to fail.

python3 utils/matrix-logic/generate_sweep_configs.py runner-sweep --config-files .github/configs/amd-master.yaml .github/configs/nvidia-master.yaml --runner-config .github/configs/runners.yaml  --model-prefix gptoss --runner-type h200

In English: “Test all h200 runner nodes that are compatible with a gptoss configuration as described in the master configs.”

python3 utils/matrix-logic/generate_sweep_configs.py runner-sweep --config-files .github/configs/amd-master.yaml .github/configs/nvidia-master.yaml --runner-config .github/configs/runners.yaml  --model-prefix dsr1 --precision fp8 --runner-type b200

In English: “Test all b200 runner nodes that are compatible with a dsr1 fp8 configuration as described in the master configs.”

`full-sweep-test.yml`

Meant to: test full sweep functionality.

Currently, user inputs boolean selections to run some combination of runners and ISL/OSL.
To replicate this, full-sweep-test.yml is edited to use the Python script to run the selected configurations.
- Quick and easy way to test, but still requires the “full” sweeps to run.

New: `e2e-tests.yml`

Provides even more control over what tests to run, as the user directly inputs the Python args to the workflow which allows maximum control over testing.
Tradeoff: doesn’t replicate a full sweep exactly i.e., no call to collect.results.yml, no integration for GB200 (yet) – however, this provides a great testing utility to test all cases quickly.
Examples:

python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix gptoss --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
362

In English: “Run all gptoss configurations.”

python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix gptoss --runner-type b200 --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
46

In English: “Run all gptoss configurations that run on B200.”

 python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix dsr1 --runner-type b200 mi300x --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
63

In English: “Run all dsr1 configurations that run on b200 or mi300x.”

python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix dsr1 --runner-type b200 mi300x --precision fp8 --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
30

In English: “Run all dsr1 configurations that run on b200 or mi300x and run fp4”

python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix dsr1 --runner-type b200 mi300x --precision fp8 --test-mode --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
6

In English: “Run all dsr1 configurations that run on b200 or mi300x and run fp4, but only reduce the parallelism-concurrency search space to a single run (picks the highest parallelism and lowest concurrency levels available for that configuration)."

Frontend Considerations

The proposed changes do not break anything on the frontend critical-path. In particular, the artifacts that get generated from the full sweeps have the same names, locations as artifacts, etc. Do note that a discovery of this PR is that the frontend critically depends on the name of the full sweep schedulers, as this is what is used to hash the runs and collect their information.

As a follow-up, artifacts now have the additional fields of ep and dp-attn, so should be added upon hover of data points accordingly.

Tradeoffs

There are clear tradeoffs between the proposed architecture and the existing one. For one, while the proposed approach has far fewer levels of workflows (i.e., no more model templates, less workflow “functions”), it has more indirection. That is, most logic is self-contained within the workflow files themselves. This is good for reducing the room for error, but it means having to create a separate workflow file for each new type of functionality.

GB200 Integration

Currently, the GB200 multinode benchmarks run using a separate workflow call template than all other benchmarks (benchmark-multinode-tmpl.yaml). This causes some issues when trying to directly integrate with the proposed architecture. There are some ideas on how to integrate it to the master configuration, but for now the workaround is to just add the GB200 runs as a separate job to the XkYk-sweep.yml workflow files. See the example below from the 1k1k-sweep.yml file:

   # This is a workaround until we can integrate GB200 into master configs.
    benchmark-gb200:
        uses: ./.github/workflows/benchmark-multinode-tmpl.yml
        name: gb200 1k1k sweep
        strategy:
            fail-fast: false
            matrix:
                config:
                    - {
                          "image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
                          "model": "deepseek-r1-fp4",
                          "model-prefix": "dsr1",
                          "precision": "fp4",
                          "framework": "dynamo-trtllm",
                          "mtp": "off",
                      }
                    - {
                          "image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
                          "model": "deepseek-r1-fp4",
                          "model-prefix": "dsr1",
                          "precision": "fp4",
                          "framework": "dynamo-trtllm",
                          "mtp": "on",
                      }
                    - {
                          "image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
                          "model": "deepseek-ai/DeepSeek-R1-0528",
                          "model-prefix": "dsr1",
                          "precision": "fp8",
                          "framework": "dynamo-sglang",
                          "mtp": "off",
                      }
        secrets: inherit
        with:
            runner: gb200
            image: ${{ matrix.config.image }}
            model: ${{ matrix.config.model }}
            framework: ${{ matrix.config.framework }}
            precision: ${{ matrix.config.precision }}
            exp-name: ${{ matrix.config.model-prefix }}_1k1k
            isl: 1024
            osl: 1024
            max-model-len: 2048
            mtp-mode: ${{ matrix.config.mtp }}

The plan is to integrate GB200 to the master configs at a later date. For now, this tradeoff is well worth it as it comes with faster development time allowing InferenceMAX to continue moving at the speed of light.

As far as tests for GB200 goes, the same logic as above applies. This refactor will provide a separate gb200-tests.yml workflow file that will allow the user to manually test GB200 nodes. Note that even with the current state of the repo, before refactor, there is very little support for testing GB200.

One Giant Python File

This refactor proposes one giant “god” Python file that encapsulates all logic for what benchmarks should run. This adds a new level of indirection and will make it more difficult to add new functionality. However, this is necessary to move away from in-workflow configurations to master configurations. This is also necessary to run a subset of concurrencies for each parallelism level, instead of running the entire Cartesian product of TP x CONC – this will allow CI time to be reduced by roughly 20%, which will allow more time for running more interesting, multinode tests.

Further, with this new level of indirection, developers may choose now to use it when creating their own testing workflows. Despite being used to generate the full sweeps, much of this script is intended to be used for input to e2e-tests.yml (testing) – if developers decide they would rather write their own logic/workflow files, they are free to do so.

if developers not like this god file for testing, we will go back to drawing board to refactor again,

it is hard for developers that wanna add 1 simple functionality to now add to understand an 900 line god file

Finally, one giant Python script encapsulating all of the logic used to generate tests will naturally be more error prone. In order to fend against bugs, any input from the config files are validated before being sent to the script, and all output is validated after being generated by the script and before being sent to benchmark-tmpl.yml.

Removed Graph Plotting

This PR removes graph plotting to reduce the amount of code that needs to be maintained and increase the speed of the CI pipeline. While this will increase development speed, it will prevent the ability to see different plotted results from different days as our frontend currently does not show historical comparison.

Future considerations

For a while now, we have been thinking about ways to move away from a simple nightly cron job trigger to a code diff based trigger. I.e., only run CI pipelines for configs/files that have changed in a commit. This PR helps move towards this goal as there is now a single stateful representation of all configurations that could be run. Now it is just a matter of deciding what changes affect what configurations.

Additionally, we are working towards pre-merge CI validation (based on labels, or something similar). The Python script will also help enable this, as it gives fine control over what tests should be run.

In order to further enforce against bugginess in the Python script, we intend to add an additional CI workflow that will be run upon any diff to the master configs to validate their structure before pushing to main.

Appendix

A Current Architecture Diagram

B Proposed Architecture Diagram

.github/workflows/benchmark-tmpl.yml

cquil11 added 28 commits October 24, 2025 14:04

initial commit based on kimbos edits

1ada0dc

adding config and python script:

a1b7476

adding runner field

d474718

finishing up script, ready for testing

346b10d

testing purposes

0dc246c

testing purposes

02f5792

refactoring more

e93d20b

refactoring more

88239ac

refactoring more

f00e47d

refactoring more

8cc9eeb

refactoring more

7be2673

refactoring more

f9c5e27

refactoring more

bb460c7

refactoring more

2a5658a

refactoring more

15da179

refactoring more

9bf6b1f

refactoring more

8d330cd

updating the benchmark files with logic

8f665dd

updating the benchmark files with logic

0987482

updating the benchmark files with logic

d9fd191

updating the benchmark files with logic

78f6b8d

updating the benchmark files with logic

d808413

updating the benchmark files with logic

bc24be4

testing concurrency

7479f74

updating the benchmark files with logic

93fba3b

updating the benchmark files with logic

d021eb3

updating the benchmark files with logic

09ebb8a

updating the benchmark files with logic

6c61ba9

functionstackx reviewed Oct 27, 2025

View reviewed changes

.github/workflows/benchmark-tmpl.yml Show resolved Hide resolved

updating the benchmark files with logic

f7d8340

cquil11 added 5 commits October 31, 2025 15:10

testing concurrency

264186f

update random range ratio default

e9e0e70

get process results vals from env vars instead of argv

bbc2220

get process results vals from env vars instead of argv pt 2

d5ec7de

editing runners yaml

6af36ef

cquil11 mentioned this pull request Oct 31, 2025

SPIKE: investigate ways to slim down Python file used for testing #152

Closed

cquil11 added 9 commits October 31, 2025 18:42

testing concurrency

cefcf15

adding more workflows

46545a9

deleting files

e59f2d7

testing concurrency

fe445a1

testing concurrency

d154049

testing concurrency

880d3c8

remove 70b

026d16b

cleaning up after rebase

4a81cd4

changing name of files from XkYk to shceduler

cac35bc

functionstackx requested a review from araslanix November 1, 2025 00:56

cquil11 added 7 commits November 3, 2025 09:39

double check and update master configs

b60289e

double check and update master configs pt 2

9fba14a

add pydantic pip install

c331874

bug fix

582e1b1

update cron trigger to 9:00 PM CDT

4b78c4a

runner name bug in process result python script

7c4c931

Merge branch 'main' into initial-refactor

c1b3530

cquil11 merged commit 17d5e20 into main Nov 3, 2025

cquil11 deleted the initial-refactor branch November 3, 2025 18:05

functionstackx mentioned this pull request Nov 5, 2025

fix: gptoss b200 after refactor #168

Merged

Copilot AI mentioned this pull request Nov 6, 2025

Restore calc-success-rate job to full-sweep schedulers #180

Merged

This was referenced Nov 6, 2025

WIP: adding pr label auto validation #167

Merged

Drastically changed results for DeepSeek R1 0528 fp8 8k/1k #192

Closed

cquil11 mentioned this pull request Dec 10, 2025

feat: performance changelog triggered runs (as opposed to nightly) #267

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: candidate search space (no perf changes for amd/nvidia) #145

feat: candidate search space (no perf changes for amd/nvidia) #145

Uh oh!

cquil11 commented Oct 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: candidate search space (no perf changes for amd/nvidia) #145

feat: candidate search space (no perf changes for amd/nvidia) #145

Uh oh!

Conversation

cquil11 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Candidate Search Space

Problems with Current Approach

Proposed Solution

Workflow Files

Master Config Structure

Test Workflows

Single Python Script

runner-test.yml

runner-model-sweep.yml

runner-sweep-test.yml

full-sweep-test.yml

New: e2e-tests.yml

Frontend Considerations

Tradeoffs

GB200 Integration

One Giant Python File

Removed Graph Plotting

Future considerations

Appendix

A Current Architecture Diagram

B Proposed Architecture Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cquil11 commented Oct 24, 2025 •

edited

Loading

`runner-test.yml`

`runner-model-sweep.yml`

`runner-sweep-test.yml`

`full-sweep-test.yml`

New: `e2e-tests.yml`