Skip to content

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Oct 24, 2025

Candidate Search Space

Note

No changes in the PR affect any performance for either AMD or NVIDIA benchmarks. It is purely a refactor.

Example Workflows:

Follow up TODOs:

Problems with Current Approach

Currently, the InferenceMAX architecture uses what can best be described as a “trickle down” approach. More specifically, we start by separating the top-level workflows by sequence length (1k1k, 1k8k, 8k1k), which then invokes the “Full Sweep Template” which can be thought of as a function used by another workflow. Like any function, these templates specify certain variables that must be passed, specified in their “signature” or inputs. The Full Sweep Template then spawns a series of jobs conditioned on the input (i.e., the 1k1k scheduler only spawns jobs for ISL/OSL equal to 1024). The jobs spawned by the Full Sweep Template call another workflow function specific to each model (i.e., 70b Template). These workflow functions then invoke jobs across all hardware and precisions compatible with the particular model. Finally, these jobs call the bottom-level function Benchmark Template which encapsulates all of the logic for actually launching the benchmark script and scheduling it on a self-hosted runner. The “Current Architecture Diagram” entry in the Appendix A of this document gives a visual representation of this flow.

The problems with the current approach are as follows:

  • Too many layers of abstraction, i.e., too many workflows, workflow functions, sub-workflows
    • Difficult to create new test workflows for specific cases, as this requires a separate job and/or workflow for each use case
    • Hard for contributors to make contributions – especially for the first time
  • Vanilla GitHub Actions does not provide enough native support for the expressivity we want from configs. Put more simply, YAML + GHA primitives is not a programming language.

Let us elaborate on point 2 above. Our workflows rely on GitHub Action’s matrix strategy to generate jobs for all parallelism and concurrency levels in the cartesian product tp-list x conc-list. For instance,

 bmk-b200-trt-fp8:
   if: ${{ inputs.use_b200 }}
   uses: ./.github/workflows/benchmark-tmpl.yml
   with:
      runner: h200-trt
      image: 'nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2'
      model: 'deepseek-ai/DeepSeek-R1-0528'
      framework: 'trt'
      precision: 'fp8'
      exp-name: ${{ inputs.exp-name }}
      isl: ${{ inputs.isl }}
      osl: ${{ inputs.osl }}
      max-model-len: ${{ inputs.max-model-len }}
      random-range-ratio: ${{ inputs.random-range-ratio }}
      conc-list: '[1, 2, 4]'
      tp-list: '[4, 8]'

would generate 6 jobs: tp4 conc1, tp4 conc2, tp4 conc4, tp8 conc1, tp8 conc2, tp8 conc4. There are some limitations with this approach:

  1. There is no way to specify a different concurrency range for one level of parallelism and a different range for another. This is a problem, because first principles tells us that certain tp/conc configurations are not realistic for performance and will not fall on the Pareto frontier.
image 2. With the current approach, all points are swept across (even the ones that will not fall on the Pareto frontier) leading to wasted CI time and more importantly, wasted compute. There is not a way to specify different parallelism techniques such as expert parallelism or DP attention. All of the logic for this is handled at the benchmark script level which is not ideal – it is preferable that this logic is lifted up out of Bash scripts so that these options can be more easily specified by users. Below is an example of how things like EP and DP attention are currently specified:
EP_SIZE="$TP"
MOE_BACKEND="CUTLASS"
DP_ATTENTION=false

if [[ "$ISL" == "1024" && "$OSL" == "1024" ]]; then
   if [[ $CONC -gt 64 ]]; then
       DP_ATTENTION=true
   fi
elif [[ "$ISL" == "1024" && "$OSL" == "8192" ]]; then
   if [[ $CONC -gt 64 ]]; then
       DP_ATTENTION=true
   fi
elif [[ "$ISL" == "8192" && "$OSL" == "1024" ]]; then
   if [[ $CONC -gt 32 ]]; then
       DP_ATTENTION=true
   fi
fi

Proposed Solution

To reconcile all aforementioned issues, we propose a simpler design eliminating many of the layers of workflow functions by specifying all benchmark configurations in an external master configuration JSON, processing it at the top level workflow, and creating a flattened matrix to launch all of the jobs. This will effectively take away the need for the Full Sweep Template as well as the Model Templates. The diagram in Appendix B can be referenced to gain a high-level understanding of the proposed architecture.

In other words, all possible benchmark configurations are defined in one place, which will be considered the primary “source of truth.” Further, the logic for deciding which jobs run will be completely self-contained in a Python script.

Workflow Files

The top-level schedulers will continue to be split up by sequence length – recall this is due to the limitation of approximately 500 jobs per workflow on GitHub actions, otherwise the UI fails to load and times out after 10 seconds. Furthermore, each scheduler workflow is split up by model, again due to a GitHub Actions limitation where no more than 256 jobs may be generated via a matrix. Each model has two jobs associated with it: one to retrieve the appropriate configs from the master config and dump the output JSON to an environment variable and another to consume that JSON to generate a matrix of jobs, for each parallelism-concurrency combination.

Consider the simplified example below:

name: '1K/1K Sweep'

jobs:
  get-gptoss-configs:
    runs-on: ubuntu-latest
    outputs:
        search-space-config: ${{ steps.get-gptoss-configs.outputs.search-space-config }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - id: get-gptoss-configs
        run: |
          CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/get_configs.py ${GITHUB_WORKSPACE}/.github/configs/master.yaml 1k1k gptoss)
          echo "search-space-config=$CONFIG_JSON" >> $GITHUB_OUTPUT

  benchmark-gptoss:
    needs: get-gptoss-configs
    uses: ./.github/workflows/benchmark-tmpl.yml
    name: gptoss 1k1k
    strategy:
      fail-fast: false
      matrix:
        config: ${{ fromJson(needs.get-gptoss-configs.outputs.search-space-config) }}
    secrets: inherit
    with:
      exp-name: "gptoss_1k1k"
      isl: 1024
      osl: 1024
      max-model-len: 2048
      runner: ${{ matrix.config.runner }}
      image: ${{ matrix.config.image }}
      model: ${{ matrix.config.model }}
      framework: ${{ matrix.config.framework }}
      precision: ${{ matrix.config.precision }}
      tp: ${{ matrix.config.tp }}
      conc: ${{ matrix.config.conc }}

The get-gptoss-configs job calls the utils/get_configs.py script with the appropriate input (the config file, sequence length, and model for which to receive configs). The get_configs.py script iterates through all configs in the master configuration and processes them such that they can be loaded via GitHub Action’s fromJson function. Upon success, a matrix of jobs is created for each model.

Screenshot 2025-10-31 at 11 28 31 AM

Master Config Structure

The master config YAML is parsed by the utils/get_configs.py script and as such, must adhere to a strict structure. The structure for an entry is laid out below:

<INFMAX_MODEL_NAME>-<PRECISION>-<RUNNER>:
  image: 
  model: 
  runner:
  precision:
  framework:
  seq-len-configs:
  - isl: 1024
    osl: 1024
    bmk-space:
    - { tp: 1, conc-start:, conc-end: }
    - { tp: 2, conc-start:, conc-end: }
	...
  - isl: 1024
    osl: 8192
    bmk-space:
	...
  - isl: 8192
    osl: 1024
    bmk-space:
    	...

Below is a concrete example:

dsr1-fp8-h200-trt:
  image: nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2
  model: deepseek-ai/DeepSeek-R1-0528
  runner: h200-trt
  precision: fp8
  framework: trt
  # For all sequence lengths, EP=TP
  seq-len-configs:
  - isl: 1024
    osl: 1024
    # If CONC > 64, then DP_ATTN=true
    bmk-space:
    - { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
  - isl: 1024
    osl: 8192
    # If CONC > 64, then DP_ATTN=true
    bmk-space:
    - { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
  - isl: 8192
    osl: 1024
    # If CONC > 32, then DP_ATTN=true
    bmk-space:
    - { tp: 8, ep: 8, conc-start: 4, conc-end: 32 }
    - { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 64 }

In order to keep the pipelines reproducible and reliable, the structure of the config as proposed above must adhere exactly to the format above. This will be both implicitly and explicitly enforced within the workflows.

Test Workflows

Many of the existing workflows used for testing such as runner-model-sweep-test.yml, runner-sweep-test.yml, runner-test.yml, and more are useful and must be ported/refactored as part of these proposed changes.

image

The above diagram depicts the general flow of any workflow using this proposed architecture.

Single Python Script

At a high level, the Python script (utils/matrix-logic/generate_sweep_configs.py) referenced in the above diagram works as follows:
The user enters a series of pre-defined inputs/arguments describing what configs they would like to run that come from the “source of truth” master configs.
These inputs trigger an appropriate function that invokes logic to fetch a list of all appropriate benchmarks to run, each represented as an individual JSON entry.
This final JSON object with all jobs to run is dumped to stdout such that the JSON can be loaded as using the GHA fromJson command and used to generate a matrix of jobs.

Here is the usage message for the script:

Screenshot 2025-10-31 at 11 12 10 AM

Let’s run through some of the existing test workflows and show we can get the equivalent functionality from this script:

runner-test.yml

Meant to: test 1 model on 1 runner node.

python3 utils/matrix-logic/generate_sweep_configs.py custom --config-files .github/configs/amd-master.yaml .github/configs/nvidia-master.yaml --runner-config  .github/configs/runners.yaml --runner-label h200-nv_1 --framework vllm --precision fp8 --exp-name 70b_test --image vllm/vllm-openai:v0.10.2 --model deepseek-ai/DeepSeek-R1-0528

Output:
Screenshot 2025-10-31 at 11 42 52 AM

runner-model-sweep.yml

Meant to: test multiple models on multiple runner nodes/

python3 utils/matrix-logic/generate_sweep_configs.py runner-model-sweep --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --runner-type h100

Output:
Screenshot 2025-10-31 at 11 44 01 AM

Notice identical output, and now even better since no one has to maintain all of the redundant values in runner-model-sweep.yml workflow – now they are all in one source of truth config.

runner-sweep-test.yml

Meant to: test 1 model on multiple runner nodes

  • Never actually used (not even once) in the current repo.
  • Slightly confusing for testers, since users are prompted to choose a runner type, Docker image, model, framework, and precision – it is hard for newer users / contributors to know what combinations of these inputs is actually meant to be run.
  • This refactor eliminates this challenge to an extent. Instead, the user just inputs the model they want to sweep across runners, and the script finds all configs from the master configs that include the specified model.
  • Also an option to filter by framework and precision. In the case that the user inputs an incorrect combination, the script fails immediately instead of failing at the benchmark script level, which takes longer to fail.
Screenshot 2025-10-31 at 11 47 06 AM
python3 utils/matrix-logic/generate_sweep_configs.py runner-sweep --config-files .github/configs/amd-master.yaml .github/configs/nvidia-master.yaml --runner-config .github/configs/runners.yaml  --model-prefix gptoss --runner-type h200

In English: “Test all h200 runner nodes that are compatible with a gptoss configuration as described in the master configs.”

python3 utils/matrix-logic/generate_sweep_configs.py runner-sweep --config-files .github/configs/amd-master.yaml .github/configs/nvidia-master.yaml --runner-config .github/configs/runners.yaml  --model-prefix dsr1 --precision fp8 --runner-type b200

In English: “Test all b200 runner nodes that are compatible with a dsr1 fp8 configuration as described in the master configs.”

full-sweep-test.yml

Meant to: test full sweep functionality.

  • Currently, user inputs boolean selections to run some combination of runners and ISL/OSL.
  • To replicate this, full-sweep-test.yml is edited to use the Python script to run the selected configurations.
    • Quick and easy way to test, but still requires the “full” sweeps to run.

New: e2e-tests.yml

Provides even more control over what tests to run, as the user directly inputs the Python args to the workflow which allows maximum control over testing.
Tradeoff: doesn’t replicate a full sweep exactly i.e., no call to collect.results.yml, no integration for GB200 (yet) – however, this provides a great testing utility to test all cases quickly.
Examples:

python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix gptoss --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
362

In English: “Run all gptoss configurations.”

python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix gptoss --runner-type b200 --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
46

In English: “Run all gptoss configurations that run on B200.”

 python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix dsr1 --runner-type b200 mi300x --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
63

In English: “Run all dsr1 configurations that run on b200 or mi300x.”

python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix dsr1 --runner-type b200 mi300x --precision fp8 --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
30

In English: “Run all dsr1 configurations that run on b200 or mi300x and run fp4”

python3 utils/matrix-logic/generate_sweep_configs.py full-sweep --model-prefix dsr1 --runner-type b200 mi300x --precision fp8 --test-mode --config-files .github/configs/nvidia-master.yaml .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml | jq length
6

In English: “Run all dsr1 configurations that run on b200 or mi300x and run fp4, but only reduce the parallelism-concurrency search space to a single run (picks the highest parallelism and lowest concurrency levels available for that configuration)."

Frontend Considerations

The proposed changes do not break anything on the frontend critical-path. In particular, the artifacts that get generated from the full sweeps have the same names, locations as artifacts, etc. Do note that a discovery of this PR is that the frontend critically depends on the name of the full sweep schedulers, as this is what is used to hash the runs and collect their information.

As a follow-up, artifacts now have the additional fields of ep and dp-attn, so should be added upon hover of data points accordingly.

Tradeoffs

There are clear tradeoffs between the proposed architecture and the existing one. For one, while the proposed approach has far fewer levels of workflows (i.e., no more model templates, less workflow “functions”), it has more indirection. That is, most logic is self-contained within the workflow files themselves. This is good for reducing the room for error, but it means having to create a separate workflow file for each new type of functionality.

GB200 Integration

Currently, the GB200 multinode benchmarks run using a separate workflow call template than all other benchmarks (benchmark-multinode-tmpl.yaml). This causes some issues when trying to directly integrate with the proposed architecture. There are some ideas on how to integrate it to the master configuration, but for now the workaround is to just add the GB200 runs as a separate job to the XkYk-sweep.yml workflow files. See the example below from the 1k1k-sweep.yml file:

   # This is a workaround until we can integrate GB200 into master configs.
    benchmark-gb200:
        uses: ./.github/workflows/benchmark-multinode-tmpl.yml
        name: gb200 1k1k sweep
        strategy:
            fail-fast: false
            matrix:
                config:
                    - {
                          "image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
                          "model": "deepseek-r1-fp4",
                          "model-prefix": "dsr1",
                          "precision": "fp4",
                          "framework": "dynamo-trtllm",
                          "mtp": "off",
                      }
                    - {
                          "image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
                          "model": "deepseek-r1-fp4",
                          "model-prefix": "dsr1",
                          "precision": "fp4",
                          "framework": "dynamo-trtllm",
                          "mtp": "on",
                      }
                    - {
                          "image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
                          "model": "deepseek-ai/DeepSeek-R1-0528",
                          "model-prefix": "dsr1",
                          "precision": "fp8",
                          "framework": "dynamo-sglang",
                          "mtp": "off",
                      }
        secrets: inherit
        with:
            runner: gb200
            image: ${{ matrix.config.image }}
            model: ${{ matrix.config.model }}
            framework: ${{ matrix.config.framework }}
            precision: ${{ matrix.config.precision }}
            exp-name: ${{ matrix.config.model-prefix }}_1k1k
            isl: 1024
            osl: 1024
            max-model-len: 2048
            mtp-mode: ${{ matrix.config.mtp }}

The plan is to integrate GB200 to the master configs at a later date. For now, this tradeoff is well worth it as it comes with faster development time allowing InferenceMAX to continue moving at the speed of light.

As far as tests for GB200 goes, the same logic as above applies. This refactor will provide a separate gb200-tests.yml workflow file that will allow the user to manually test GB200 nodes. Note that even with the current state of the repo, before refactor, there is very little support for testing GB200.

One Giant Python File

This refactor proposes one giant “god” Python file that encapsulates all logic for what benchmarks should run. This adds a new level of indirection and will make it more difficult to add new functionality. However, this is necessary to move away from in-workflow configurations to master configurations. This is also necessary to run a subset of concurrencies for each parallelism level, instead of running the entire Cartesian product of TP x CONC – this will allow CI time to be reduced by roughly 20%, which will allow more time for running more interesting, multinode tests.

Further, with this new level of indirection, developers may choose now to use it when creating their own testing workflows. Despite being used to generate the full sweeps, much of this script is intended to be used for input to e2e-tests.yml (testing) – if developers decide they would rather write their own logic/workflow files, they are free to do so.

if developers not like this god file for testing, we will go back to drawing board to refactor again,

it is hard for developers that wanna add 1 simple functionality to now add to understand an 900 line god file

Finally, one giant Python script encapsulating all of the logic used to generate tests will naturally be more error prone. In order to fend against bugs, any input from the config files are validated before being sent to the script, and all output is validated after being generated by the script and before being sent to benchmark-tmpl.yml.

Removed Graph Plotting

This PR removes graph plotting to reduce the amount of code that needs to be maintained and increase the speed of the CI pipeline. While this will increase development speed, it will prevent the ability to see different plotted results from different days as our frontend currently does not show historical comparison.

Future considerations

For a while now, we have been thinking about ways to move away from a simple nightly cron job trigger to a code diff based trigger. I.e., only run CI pipelines for configs/files that have changed in a commit. This PR helps move towards this goal as there is now a single stateful representation of all configurations that could be run. Now it is just a matter of deciding what changes affect what configurations.

Additionally, we are working towards pre-merge CI validation (based on labels, or something similar). The Python script will also help enable this, as it gives fine control over what tests should be run.

In order to further enforce against bugginess in the Python script, we intend to add an additional CI workflow that will be run upon any diff to the master configs to validate their structure before pushing to main.

Appendix

A Current Architecture Diagram

image

B Proposed Architecture Diagram

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants