Enable failing diffs on regression by laithsakka · Pull Request #136551 · pytorch/pytorch

laithsakka · 2024-09-24T18:00:57Z

Stack from ghstack (oldest at bottom):

example of failing diff
[no land] test fail due to win #136740
test this by running
python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv

results

WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it

MISSING REGRESSION TEST does not fail but its logged.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec

[ghstack-poisoned]

pytorch-bot · 2024-09-24T18:01:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136551

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

PyTorch Testing Nodes Undergoing ROCm 6.2.1 Upgrades

✅ No Failures

As of commit 32b2b90 with merge base d2455b9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

regression introduced by setting low expected ``` **REGRESSION** benchmark add_loop_eager_dynamic failed, actual instruction count 5486451976 is higher than expected 1 with noise margin 0.01 if this is an expected regression, please update the expected instruction count in the benchmark. ``` win introduced by setting high expected ``` **WIN** benchmark add_loop_eager failed, actual instruction count 2758450573 is lower than expected 10000000000 with noise margin 0.01 please update the expected instruction count in the benchmark. ``` I will follow up with diffs that enable the regressions at diff time. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

ezyang · 2024-09-25T03:52:21Z

You're putting the expected count number in a Python file, which means when you write the auto-updater it will be harder to programatically update. Can we plan for programattic updates now, or have you decided you don't want them?

oulgen · 2024-09-25T06:21:56Z

Earlier today we discussed how you should log each PR you block so that we can know/track when a PR is blocked. Should we add that functionality before landing this?

ezyang · 2024-09-25T13:27:28Z

Doesn't this PR have those logs?

laithsakka · 2024-09-25T19:01:47Z

Earlier today we discussed how you should log each PR you block so that we can know/track when a PR is blocked. Should we add that functionality before landing this?

yeh this PR have the logs cc @ezyang @oulgen

laithsakka · 2024-09-25T19:04:24Z

You're putting the expected count number in a Python file, which means when you write the auto-updater it will be harder to programatically update. Can we plan for programattic updates now, or have you decided you don't want them?

mhmm i see i can change the way we do it to have them in a separate file

ezyang · 2024-09-26T01:22:23Z

To be super explicit, I don't want to land this /without/ the autoupdater. It's a package deal.

regression introduced by setting low expected ``` **REGRESSION** benchmark add_loop_eager_dynamic failed, actual instruction count 5486451976 is higher than expected 1 with noise margin 0.01 if this is an expected regression, please update the expected instruction count in the benchmark. ``` win introduced by setting high expected ``` **WIN** benchmark add_loop_eager failed, actual instruction count 2758450573 is lower than expected 10000000000 with noise margin 0.01 please update the expected instruction count in the benchmark. ``` I will follow up with diffs that enable the regressions at diff time. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

test this by running python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv results ``` WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results. MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it ``` MISSING REGRESSION TEST does not fail but its logged. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

laithsakka · 2024-09-26T20:25:00Z

addressed comments @ezyang @oulgen

benchmarks/dynamo/pr_time_benchmarks/test_check_result/result_test.csv

ezyang · 2024-09-27T12:23:17Z

benchmarks/dynamo/pr_time_benchmarks/check_results.py

+
+        if result < low:
+            fail = True
+            ratio = (float)(entry.expected_value - result) * 100 / entry.expected_value


the heck is this, just do float(...) like a normal person lol

lol pardon my c++ background and lack of python your majesty, i will update it

ezyang · 2024-09-27T12:23:55Z

benchmarks/dynamo/pr_time_benchmarks/check_results.py

+            print(
+                f"WIN: benchmark {key} failed, actual result {result} is {ratio:.2f}% lower than "
+                f"expected {entry.expected_value} ±{entry.noise_margin*100:.2f}% "
+                f"please update the expected results."


OK, so are you going to write the updater script too?

not in this diff, I will follow up in a different diff.

1. example of failing diff #136740 2. test this by running python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv results ``` WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results. MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it ``` MISSING REGRESSION TEST does not fail but its logged. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

laithsakka · 2024-09-27T17:47:18Z

adress comments

1. example of failing diff #136740 2. test this by running python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv results ``` WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results. REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results. MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it ``` MISSING REGRESSION TEST does not fail but its logged. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

laithsakka · 2024-09-29T17:43:55Z

@pytorchbot merge

pytorchmergebot · 2024-09-29T17:45:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

enable failing diffs on regression

c28f9fb

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo labels Sep 24, 2024

laithsakka changed the title ~~enable failing diffs on regression~~ Enable failing diffs on regression Sep 24, 2024

Update on "Enable failing diffs on regression"

eb704c0

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

laithsakka added ci-scribe Enable logging to Scribe on the CI job topic: not user facing topic category labels Sep 24, 2024

laithsakka requested a review from a team as a code owner September 24, 2024 19:21

laithsakka had a problem deploying to scribe-pr September 24, 2024 19:23 — with GitHub Actions Error

laithsakka added 2 commits September 24, 2024 14:01

pytorch-bot bot mentioned this pull request Sep 24, 2024

Enable regression test for add loop benchmarks #136573

Closed

laithsakka added 7 commits September 25, 2024 23:14

laithsakka mentioned this pull request Sep 26, 2024

[no land] test fail due to win #136740

Closed

ezyang reviewed Sep 27, 2024

View reviewed changes

benchmarks/dynamo/pr_time_benchmarks/test_check_result/result_test.csv Show resolved Hide resolved

ezyang reviewed Sep 27, 2024

View reviewed changes

laithsakka added 2 commits September 27, 2024 10:48

ezyang approved these changes Sep 28, 2024

View reviewed changes

Conversation

laithsakka commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136551

❗ 1 Active SEVs

✅ No Failures

Uh oh!

ezyang commented Sep 25, 2024

Uh oh!

oulgen commented Sep 25, 2024

Uh oh!

ezyang commented Sep 25, 2024

Uh oh!

laithsakka commented Sep 25, 2024

Uh oh!

laithsakka commented Sep 25, 2024

Uh oh!

ezyang commented Sep 26, 2024

Uh oh!

laithsakka commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ezyang Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

laithsakka Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

laithsakka Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laithsakka commented Sep 27, 2024

Uh oh!

laithsakka commented Sep 29, 2024

Uh oh!

pytorchmergebot commented Sep 29, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

laithsakka commented Sep 24, 2024 •

edited

Loading

pytorch-bot bot commented Sep 24, 2024 •

edited

Loading

laithsakka commented Sep 26, 2024 •

edited

Loading

laithsakka Sep 27, 2024 •

edited

Loading

laithsakka Sep 27, 2024 •

edited

Loading