-
Notifications
You must be signed in to change notification settings - Fork 70
feat: performance changelog triggered runs (as opposed to nightly) #267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0051fe8 to
184b4ae
Compare
22ddb79 to
9acd1e7
Compare
e9209b5 to
6d674e9
Compare
📊 Line Count ReportFile: Total Lines: 682 Base Lines: 570 Change: +112 lines 📈 |
new single workflow that runs on merge to main, new perg-changelog.yaml to track performance changes, new logic to parse changelog, removed cron job in full sweep schedulers
6dfc296 to
433f2ef
Compare
|
@cquil11 does the result filenames change? since it is now one mega run instead of 3 spearate jobs? the frontend regex based on filenames also shouldnt these all 3 files be dleted since not needed anymore? |
|
@cquil11 can u show that an run where process_result.py & github summary & utils/collect_results.py all show similar results before and after plz include link to test run & screenshots after u unzip the file |
|
@functionstackx yes, this will have frontend implications which I briefly mentioned in PR description
yes, one is finishing up now: |
functionstackx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments mostly around validatio of correctness of this PR
chunfangamd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to the messages in the code
) [skip-sweep] * add logic for event driven runs new single workflow that runs on merge to main, new perg-changelog.yaml to track performance changes, new logic to parse changelog, removed cron job in full sweep schedulers * testing pt 1 * raise error if yaml diff in perf changelog is not valid * remove unused imports in process_changelog.py * config data key fix * raise error if test-config subprocess fails to run * backfill changelog * backfill changelog pt 2 * backfill changelog pt 3 * backfill changelog pt 4 * backfill changelog pt 5 * backfill changelog pt 6 * add always() condition to upload changelog metadata * backfill changelog pt 7 (test) * backfill changelog pt 8 (revert test) * backfill changelog pt 9 * backfill changelog pt 11 * change if condition for jobs in run sweep workflow * debugging run sweep workflow * debugging run sweep workflow pt 2 * debugging run sweep workflow pt 3 (revert) * debugging run sweep workflow pt 4 * debugging run sweep workflow pt 5 * debugging run sweep workflow pt 6 * debugging run sweep workflow pt 7 * add always() condition to upload changelog metadata (add back, this got removed) * add bmk prefix to results * backfill changelog official * for concurrency group, use more unique sha
* Initial commit, for #304 * Allow testing on own PR * condense workflow * Rename Workflow * Use environments * Changed environment location * Stricter activation * Test replies * Test replies * Use token for comment perm * Forgot validation * feat: performance changelog triggered runs (as opposed to nightly) (#267) [skip-sweep] * add logic for event driven runs new single workflow that runs on merge to main, new perg-changelog.yaml to track performance changes, new logic to parse changelog, removed cron job in full sweep schedulers * testing pt 1 * raise error if yaml diff in perf changelog is not valid * remove unused imports in process_changelog.py * config data key fix * raise error if test-config subprocess fails to run * backfill changelog * backfill changelog pt 2 * backfill changelog pt 3 * backfill changelog pt 4 * backfill changelog pt 5 * backfill changelog pt 6 * add always() condition to upload changelog metadata * backfill changelog pt 7 (test) * backfill changelog pt 8 (revert test) * backfill changelog pt 9 * backfill changelog pt 11 * change if condition for jobs in run sweep workflow * debugging run sweep workflow * debugging run sweep workflow pt 2 * debugging run sweep workflow pt 3 (revert) * debugging run sweep workflow pt 4 * debugging run sweep workflow pt 5 * debugging run sweep workflow pt 6 * debugging run sweep workflow pt 7 * add always() condition to upload changelog metadata (add back, this got removed) * add bmk prefix to results * backfill changelog official * for concurrency group, use more unique sha * chore(deps): bump the github-actions group across 1 directory with 3 updates (#331) Bumps the github-actions group with 3 updates in the / directory: [actions/checkout](https://github.com/actions/checkout), [actions/upload-artifact](https://github.com/actions/upload-artifact) and [actions/download-artifact](https://github.com/actions/download-artifact). Updates `actions/checkout` from 6.0.0 to 6.0.1 - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v6...8e8c483) Updates `actions/upload-artifact` from 5.0.0 to 6.0.0 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@330a01c...b7c566a) Updates `actions/download-artifact` from 6.0.0 to 7.0.0 - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@018cc2c...37930b1) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: 6.0.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: github-actions - dependency-name: actions/upload-artifact dependency-version: 6.0.0 dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: actions/download-artifact dependency-version: 7.0.0 dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: add final newline to original perf-changelog.yaml so that there wont be erroneous negative diff [skip-sweep] (#333) * Update MI355x Deepseek-R1 FP4 SGLang Image to v0.5.6.post1 (#330) * Update amd-master.yaml * Update perf-changelog.yaml * Update dsr1_fp4_mi355x_docker.sh * Update dsr1_fp4_mi355x_docker.sh --------- Co-authored-by: Cameron Quilici <cjquilici@gmail.com> * TOCTOU * Test new env * Ready for merge * Add benchmark script for GPTOSS FP4 B200 TRT-LLM (#256) * Add benchmark script for GPTOSS FP4 B200 TRT-LLM * make changes to perf changelog --------- Co-authored-by: Cameron Quilici <cjquilici@gmail.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Cameron Quilici <cjquilici@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: ppalanga <ppalanga@amd.com> Co-authored-by: Ankur Singh <ankusingh@nvidia.com>

Design Doc: Changelog Triggered Only Runs
Solves: #298 (comment)
Test full sweep: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20081821674
Test with changelog: https://github.com/InferenceMAX/InferenceMAX/actions/runs/20218553330
Introduction and Motivation
InferenceMAX prides itself on running regularly to keep up with the fast pace of software changes. However, running software release builds & release candidate builds don’t typically happen on a _nightly _basis and as such, it is unnecessary to perform a full-sweep every single night. Instead, it is sufficient to run only the configs (from the master configs) whose performance is affected as a result of any particular change.
Implementation
After the major refactors (#251 and #145), all run information is defined in source of truth master configuration files. These files are then parsed via the
utils/matrix_logic/generate_sweep_configs.pyscript, whose JSON output is subsequently passed to abenchmark-*-tmpl.ymlworkflow. This sets up this proposal nicely, as each configuration has a “key” string associated with it. We propose a high-level “performance changelog” at the root of the repo (perf-changelog.yaml) that developers will use to specify config keys whose performance is affected in a particular PR.The main change here can be summarized as follows: instead of running each config every night, configs will only be run on an “as needed” basis (i.e., when a change is made that affects their performance). This will more efficiently use compute resources while leading to a faster feedback cycle.
Example Developer Workflow
As an example, suppose a developer updates some code that will affect the performance of the
dsr1-fp4-b200-sglangconfig. Perhaps they change the version of the SGLang serving image. Then, the developer should append to the end of theperf-changelog.yamland add the following:Then, upon merge to main, the
run-sweep.ymlworkflow will be triggered (if and only if a change toperf-changelog.yamlhas been detected). This workflow will subsequently run a script calledprocess-changelog.py. This script takes in the following as arguments:The script then gets the diff between the base perf changelog and the head perf changelog. If there is any negative diff, the script fails and so does the workflow (there should never be any negative diff as this changelog is meant to be a running changelog of all perf changes, and it runs from the bottom). The positive diff is then pre-processed to remove the line diff indicators, and then loaded to a list of dicts representing the changes. Then, a function retrieves the list of all config keys to run and passes them to
generate_sweep_configs.pywith the new subcommandtest-config. This subcommand simply takes in a list of config keys (from the master config YAMLs) and retrieves the cases to run for each config. Once all results are retrieved, we organize them by sequence length and single/multi node and then dump them to stdout and direct them to an output of the invoking job in the workflow.Workflow Mechanisms
The
run-sweep.ymlworkflow will be the main workflow used for running “official” sweeps. As described above, this runs on any merge to main where there is an _addition _toperf-changelog.yml. This workflow begins with this job that retrieves which jobs to run based on diff: \The output of
process-changelog.pywill be split by single/multi node and sequence length. Recall this is due to the 256 matrix generation limit that GitHub Actions imposes. If this limit didn’t exist, we would only split by single and multi node. We note that this implementation is not very scalable in the event we add more sequence lengths and/or more models (if we add more sequence lengths, we will continue having to hard-code new jobssweep-single/multi-node-XkYkfor each one, if we add new models, then we will have to further split by model as adding new models will increase the number of configs for each scenario described above). We hope to have a more permanent solution to this in the near future (that will likely involve introducing a sub-workflow). However, this is not as big of an issue, since we don’t anticipate needing to run “full” sweeps often, if at all.We also note that the GitHub Action engineers are close to implementing lazy loading for workflows with many jobs, so the 10s loading timeout will hopefully no longer be an issue.
Testing
In order to ensure that sweeps can be tested and validated before merging and running the “official” sweep, developers can enable sweeps on their pull request by ensuring three things:
perf-changelog.yamlsweep-enabledThis will trigger the
run-sweep.ymlworkflow on the PR branch.Ensuing Frontend Changes
These changes would have many frontend implications. First we will describe the high-level considerations (such as what the UI displays) and then we will discuss the low-level considerations (how data is scraped from the artifacts).
In order to not delay main InferenceMAX development, the plan is to “freeze” the InferenceMAX UI on the December 9th run and implement the frontend changes over the course of the following week.
High-Level Considerations
Currently, the InferenceMAX frontend displays runs for each date, with an option to view historical runs via a date selector. This method assumes that a full sweep of runs is being run each day. However with this proposal, we will be running configurations relatively infrequently (only when necessary). Therefore, it doesn’t make sense to display runs for each date in the date dropdown selector. We propose the following: for the “Select a run date” dropdown calendar, only make clickable the dates on which a run was triggered. When this date is selected, there are two scenarios:
perf-changelog.yamlchange associated with this run), carry-forward the data points from that config’s most recent run. One thing to note here is that we ought to make sure that all runs are completely successful (i.e., all jobs completed successfully) so as not to carry-forward incomplete run data.With this strategy, we should differentiate the data points/frontier line in a way such that it is obvious to the user, on any given date, what data is _new _and what data is carry-forward. Implementation details will be left up to Akira.
We also should get rid of reliability data (at least in its current form). If we want to keep it, we can simply display “success rate per GPU sku” and filter over “last X runs,” or something like that. But even so, it appears reliability sort of loses its meaning now that we aren’t running every night. Now when running via a trigger, we will want to test rigorously _before _merging such that the “official” run is successful.
Low-Level Considerations
Recall, we are currently running three separate workflows each night (separated by sequence length, 1k1k, 1k8k, and 8k1k). This is shown here. With this proposal, we will be running _all _desired configs in a _single _run. The resulting artifact will be called
results_all(as opposed to being split up by model and sequence length e.g.,results_dsr1_1k1k). This shouldn’t be too difficult to implement, in fact it should make things easier since we won’t have to aggregate all results into a single file.Further, we need to ensure backwards compatibility such that the old workflows can still be processed correctly.
Additional Considerations
We plan on back-filling the
perf-changelog.yamlto reflect perf changes since the launch of InferenceMAX v1. We will not run CI on these back-filled changes – it is just so that the changes will be reflected. In order to accomplish this, we need to add a mechanism to therun-sweep.ymlworkflow in order to skip running the actual CI pipeline upon merge to main. We add the following condition to thesetupjob:Now, if the merge commit contains the “[skip-sweep]” string, CI will not be run. This will allow developers to “force” make alterations to
perf-changelog.yamlwithout actually running CI.