Skip to content

Conversation

@cquil11
Copy link
Collaborator

@cquil11 cquil11 commented Dec 2, 2025

Design Doc: Changelog Triggered Only Runs

Solves: #298 (comment)

Test full sweep: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20081821674
Test with changelog: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20218553330

Introduction and Motivation

InferenceMAX prides itself on running regularly to keep up with the fast pace of software changes. However, running software release builds & release candidate builds don’t typically happen on a _nightly _basis and as such, it is unnecessary to perform a full-sweep every single night. Instead, it is sufficient to run only the configs (from the master configs) whose performance is affected as a result of any particular change.

Implementation

After the major refactors (#251 and #145), all run information is defined in source of truth master configuration files. These files are then parsed via the utils/matrix_logic/generate_sweep_configs.py script, whose JSON output is subsequently passed to a benchmark-*-tmpl.yml workflow. This sets up this proposal nicely, as each configuration has a “key” string associated with it. We propose a high-level “performance changelog” at the root of the repo (perf-changelog.yaml) that developers will use to specify config keys whose performance is affected in a particular PR.

The main change here can be summarized as follows: instead of running each config every night, configs will only be run on an “as needed” basis (i.e., when a change is made that affects their performance). This will more efficiently use compute resources while leading to a faster feedback cycle.

Example Developer Workflow

As an example, suppose a developer updates some code that will affect the performance of the dsr1-fp4-b200-sglang config. Perhaps they change the version of the SGLang serving image. Then, the developer should append to the end of the perf-changelog.yaml and add the following:

Then, upon merge to main, the run-sweep.yml workflow will be triggered (if and only if a change to perf-changelog.yaml has been detected). This workflow will subsequently run a script called process-changelog.py. This script takes in the following as arguments:

  • Path to the perf changelog file
  • A base Git ref (i.e., main/HEAD~1)
  • A head Git ref (i.e., main/HEAD)

The script then gets the diff between the base perf changelog and the head perf changelog. If there is any negative diff, the script fails and so does the workflow (there should never be any negative diff as this changelog is meant to be a running changelog of all perf changes, and it runs from the bottom). The positive diff is then pre-processed to remove the line diff indicators, and then loaded to a list of dicts representing the changes. Then, a function retrieves the list of all config keys to run and passes them to generate_sweep_configs.py with the new subcommand test-config. This subcommand simply takes in a list of config keys (from the master config YAMLs) and retrieves the cases to run for each config. Once all results are retrieved, we organize them by sequence length and single/multi node and then dump them to stdout and direct them to an output of the invoking job in the workflow.

Workflow Mechanisms

The run-sweep.yml workflow will be the main workflow used for running “official” sweeps. As described above, this runs on any merge to main where there is an _addition _to perf-changelog.yml. This workflow begins with this job that retrieves which jobs to run based on diff: \

The output of process-changelog.py will be split by single/multi node and sequence length. Recall this is due to the 256 matrix generation limit that GitHub Actions imposes. If this limit didn’t exist, we would only split by single and multi node. We note that this implementation is not very scalable in the event we add more sequence lengths and/or more models (if we add more sequence lengths, we will continue having to hard-code new jobs sweep-single/multi-node-XkYk for each one, if we add new models, then we will have to further split by model as adding new models will increase the number of configs for each scenario described above). We hope to have a more permanent solution to this in the near future (that will likely involve introducing a sub-workflow). However, this is not as big of an issue, since we don’t anticipate needing to run “full” sweeps often, if at all.

We also note that the GitHub Action engineers are close to implementing lazy loading for workflows with many jobs, so the 10s loading timeout will hopefully no longer be an issue.

Testing

In order to ensure that sweeps can be tested and validated before merging and running the “official” sweep, developers can enable sweeps on their pull request by ensuring three things:

  1. The PR is marked as “ready for review”
  2. The PR includes valid changes to perf-changelog.yaml
  3. The PR includes the label sweep-enabled

This will trigger the run-sweep.yml workflow on the PR branch.

Ensuing Frontend Changes

These changes would have many frontend implications. First we will describe the high-level considerations (such as what the UI displays) and then we will discuss the low-level considerations (how data is scraped from the artifacts).

In order to not delay main InferenceMAX development, the plan is to “freeze” the InferenceMAX UI on the December 9th run and implement the frontend changes over the course of the following week.

High-Level Considerations

Currently, the InferenceMAX frontend displays runs for each date, with an option to view historical runs via a date selector. This method assumes that a full sweep of runs is being run each day. However with this proposal, we will be running configurations relatively infrequently (only when necessary). Therefore, it doesn’t make sense to display runs for each date in the date dropdown selector. We propose the following: for the “Select a run date” dropdown calendar, only make clickable the dates on which a run was triggered. When this date is selected, there are two scenarios:

  1. For the configs that were _unchanged _as a result of the triggered run (i.e., the configs that were not specified in the perf-changelog.yaml change associated with this run), carry-forward the data points from that config’s most recent run. One thing to note here is that we ought to make sure that all runs are completely successful (i.e., all jobs completed successfully) so as not to carry-forward incomplete run data.
  2. For the configs that were _updated _as a result of the triggered run, display the new data points associated with the run.

With this strategy, we should differentiate the data points/frontier line in a way such that it is obvious to the user, on any given date, what data is _new _and what data is carry-forward. Implementation details will be left up to Akira.

We also should get rid of reliability data (at least in its current form). If we want to keep it, we can simply display “success rate per GPU sku” and filter over “last X runs,” or something like that. But even so, it appears reliability sort of loses its meaning now that we aren’t running every night. Now when running via a trigger, we will want to test rigorously _before _merging such that the “official” run is successful.

Low-Level Considerations

Recall, we are currently running three separate workflows each night (separated by sequence length, 1k1k, 1k8k, and 8k1k). This is shown here. With this proposal, we will be running _all _desired configs in a _single _run. The resulting artifact will be called results_all (as opposed to being split up by model and sequence length e.g., results_dsr1_1k1k). This shouldn’t be too difficult to implement, in fact it should make things easier since we won’t have to aggregate all results into a single file.

Further, we need to ensure backwards compatibility such that the old workflows can still be processed correctly.

Additional Considerations

We plan on back-filling the perf-changelog.yaml to reflect perf changes since the launch of InferenceMAX v1. We will not run CI on these back-filled changes – it is just so that the changes will be reflected. In order to accomplish this, we need to add a mechanism to the run-sweep.yml workflow in order to skip running the actual CI pipeline upon merge to main. We add the following condition to the setup job:

Now, if the merge commit contains the “[skip-sweep]” string, CI will not be run. This will allow developers to “force” make alterations to perf-changelog.yaml without actually running CI.

@cquil11 cquil11 force-pushed the diff-only-runs branch 2 times, most recently from 0051fe8 to 184b4ae Compare December 4, 2025 17:31
@cquil11 cquil11 changed the base branch from main to multinode-integration December 4, 2025 17:31
Base automatically changed from multinode-integration to main December 5, 2025 20:51
@cquil11 cquil11 force-pushed the diff-only-runs branch 2 times, most recently from 22ddb79 to 9acd1e7 Compare December 5, 2025 21:07
@functionstackx functionstackx moved this to In Progress in InferenceMAX Board Dec 7, 2025
@cquil11 cquil11 marked this pull request as ready for review December 9, 2025 23:20
@cquil11 cquil11 requested a review from a team as a code owner December 9, 2025 23:20
@github-actions
Copy link

github-actions bot commented Dec 9, 2025

📊 Line Count Report

File: utils/matrix_logic/generate_sweep_configs.py

Total Lines: 682

Base Lines: 570

Change: +112 lines 📈

@cquil11 cquil11 temporarily deployed to fork-pr-validation December 9, 2025 23:23 — with GitHub Actions Inactive
@cquil11 cquil11 changed the title [WIP]: Diff only runs feat: performance changelog triggered runs (as opposed to nightly) Dec 10, 2025
new single workflow that runs on merge to main, new perg-changelog.yaml to track performance changes, new logic to parse changelog, removed cron job in full sweep schedulers
@functionstackx
Copy link
Contributor

@cquil11 does the result filenames change? since it is now one mega run instead of 3 spearate jobs? the frontend regex based on filenames

also shouldnt these all 3 files be dleted since not needed anymore?
image

@functionstackx
Copy link
Contributor

@cquil11 can u show that an run where process_result.py & github summary & utils/collect_results.py all show similar results before and after

plz include link to test run & screenshots after u unzip the file

@cquil11
Copy link
Collaborator Author

cquil11 commented Dec 10, 2025

@functionstackx yes, this will have frontend implications which I briefly mentioned in PR description
I was planning on keeping these (for a bit at least) to still allow developers to re-run failed workflows (in the legacy way) for a bit. I can remove if you think that's best

can u show that an run where process_result.py & github summary & utils/collect_results.py all show similar results before and after
plz include link to test run & screenshots after u unzip the file

yes, one is finishing up now:
here is the run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20081821674
here is the aggregated results_all with all data points: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20081821674/artifacts/4830384278

Copy link
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments mostly around validatio of correctness of this PR

Copy link
Collaborator

@chunfangamd chunfangamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to the messages in the code

@cquil11 cquil11 temporarily deployed to fork-pr-validation December 12, 2025 14:42 — with GitHub Actions Inactive
@cquil11 cquil11 merged commit 9c0fcc9 into main Dec 15, 2025
11 checks passed
@cquil11 cquil11 deleted the diff-only-runs branch December 15, 2025 15:18
@github-project-automation github-project-automation bot moved this from In Progress to Done in InferenceMAX Board Dec 15, 2025
@cquil11 cquil11 restored the diff-only-runs branch December 16, 2025 20:35
Oseltamivir pushed a commit that referenced this pull request Dec 17, 2025
) [skip-sweep]

* add logic for event driven runs

new single workflow that runs on merge to main, new perg-changelog.yaml to track performance changes, new logic to parse changelog, removed cron job in full sweep schedulers

* testing pt 1

* raise error if yaml diff in perf changelog is not valid

* remove unused imports in process_changelog.py

* config data key fix

* raise error if test-config subprocess fails to run

* backfill changelog

* backfill changelog pt 2

* backfill changelog pt 3

* backfill changelog pt 4

* backfill changelog pt 5

* backfill changelog pt 6

* add always() condition to upload changelog metadata

* backfill changelog pt 7 (test)

* backfill changelog pt 8 (revert test)

* backfill changelog pt 9

* backfill changelog pt 11

* change if condition for jobs in run sweep workflow

* debugging run sweep workflow

* debugging run sweep workflow pt 2

* debugging run sweep workflow pt 3 (revert)

* debugging run sweep workflow pt 4

* debugging run sweep workflow pt 5

* debugging run sweep workflow pt 6

* debugging run sweep workflow pt 7

* add always() condition to upload changelog metadata (add back, this got removed)

* add bmk prefix to results

* backfill changelog official

* for concurrency group, use more unique sha
cquil11 added a commit that referenced this pull request Dec 17, 2025
* Initial commit, for #304

* Allow testing on own PR

* condense workflow

* Rename Workflow

* Use environments

* Changed environment location

* Stricter activation

* Test replies

* Test replies

* Use token for comment perm

* Forgot validation

* feat: performance changelog triggered runs (as opposed to nightly) (#267) [skip-sweep]

* add logic for event driven runs

new single workflow that runs on merge to main, new perg-changelog.yaml to track performance changes, new logic to parse changelog, removed cron job in full sweep schedulers

* testing pt 1

* raise error if yaml diff in perf changelog is not valid

* remove unused imports in process_changelog.py

* config data key fix

* raise error if test-config subprocess fails to run

* backfill changelog

* backfill changelog pt 2

* backfill changelog pt 3

* backfill changelog pt 4

* backfill changelog pt 5

* backfill changelog pt 6

* add always() condition to upload changelog metadata

* backfill changelog pt 7 (test)

* backfill changelog pt 8 (revert test)

* backfill changelog pt 9

* backfill changelog pt 11

* change if condition for jobs in run sweep workflow

* debugging run sweep workflow

* debugging run sweep workflow pt 2

* debugging run sweep workflow pt 3 (revert)

* debugging run sweep workflow pt 4

* debugging run sweep workflow pt 5

* debugging run sweep workflow pt 6

* debugging run sweep workflow pt 7

* add always() condition to upload changelog metadata (add back, this got removed)

* add bmk prefix to results

* backfill changelog official

* for concurrency group, use more unique sha

* chore(deps): bump the github-actions group across 1 directory with 3 updates (#331)

Bumps the github-actions group with 3 updates in the / directory: [actions/checkout](https://github.com/actions/checkout), [actions/upload-artifact](https://github.com/actions/upload-artifact) and [actions/download-artifact](https://github.com/actions/download-artifact).


Updates `actions/checkout` from 6.0.0 to 6.0.1
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v6...8e8c483)

Updates `actions/upload-artifact` from 5.0.0 to 6.0.0
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@330a01c...b7c566a)

Updates `actions/download-artifact` from 6.0.0 to 7.0.0
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](actions/download-artifact@018cc2c...37930b1)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: 6.0.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: github-actions
- dependency-name: actions/upload-artifact
  dependency-version: 6.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: actions/download-artifact
  dependency-version: 7.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: add final newline to original perf-changelog.yaml so that there wont be erroneous negative diff [skip-sweep] (#333)

* Update MI355x Deepseek-R1 FP4  SGLang Image to v0.5.6.post1 (#330)

* Update amd-master.yaml

* Update perf-changelog.yaml

* Update dsr1_fp4_mi355x_docker.sh

* Update dsr1_fp4_mi355x_docker.sh

---------

Co-authored-by: Cameron Quilici <cjquilici@gmail.com>

* TOCTOU

* Test new env

* Ready for merge

* Add benchmark script for GPTOSS FP4 B200 TRT-LLM (#256)

* Add benchmark script for GPTOSS FP4 B200 TRT-LLM

* make changes to perf changelog

---------

Co-authored-by: Cameron Quilici <cjquilici@gmail.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Cameron Quilici <cjquilici@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: ppalanga <ppalanga@amd.com>
Co-authored-by: Ankur Singh <ankusingh@nvidia.com>
@functionstackx functionstackx deleted the diff-only-runs branch January 11, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants