Skip to content

Enable failing diffs on regression#136551

Closed
laithsakka wants to merge 20 commits intogh/laithsakka/71/basefrom
gh/laithsakka/71/head
Closed

Enable failing diffs on regression#136551
laithsakka wants to merge 20 commits intogh/laithsakka/71/basefrom
gh/laithsakka/71/head

Conversation

@laithsakka
Copy link
Contributor

@laithsakka laithsakka commented Sep 24, 2024

Stack from ghstack (oldest at bottom):

  1. example of failing diff
    [no land] test fail due to win #136740

  2. test this by running
    python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv

results

WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it

MISSING REGRESSION TEST does not fail but its logged.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136551

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 32b2b90 with merge base d2455b9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@laithsakka laithsakka changed the title enable failing diffs on regression Enable failing diffs on regression Sep 24, 2024
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
@laithsakka laithsakka added ci-scribe Enable logging to Scribe on the CI job topic: not user facing topic category labels Sep 24, 2024
regression introduced by setting low expected 
```
**REGRESSION** benchmark add_loop_eager_dynamic failed, actual instruction count 5486451976 is higher than expected 1 with noise margin 0.01 if this is an expected regression, please update the expected instruction count in the benchmark.
```

win introduced by setting high expected 
```
**WIN** benchmark add_loop_eager failed, actual instruction count 2758450573 is lower than expected 10000000000 with noise margin 0.01 please update the expected instruction count in the benchmark.
```
I will follow up with diffs that enable the regressions at diff time. 
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
@laithsakka laithsakka requested a review from a team as a code owner September 24, 2024 19:21
regression introduced by setting low expected 
```
**REGRESSION** benchmark add_loop_eager_dynamic failed, actual instruction count 5486451976 is higher than expected 1 with noise margin 0.01 if this is an expected regression, please update the expected instruction count in the benchmark.
```

win introduced by setting high expected 
```
**WIN** benchmark add_loop_eager failed, actual instruction count 2758450573 is lower than expected 10000000000 with noise margin 0.01 please update the expected instruction count in the benchmark.
```
I will follow up with diffs that enable the regressions at diff time. 
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
regression introduced by setting low expected 
```
**REGRESSION** benchmark add_loop_eager_dynamic failed, actual instruction count 5486451976 is higher than expected 1 with noise margin 0.01 if this is an expected regression, please update the expected instruction count in the benchmark.
```

win introduced by setting high expected 
```
**WIN** benchmark add_loop_eager failed, actual instruction count 2758450573 is lower than expected 10000000000 with noise margin 0.01 please update the expected instruction count in the benchmark.
```
I will follow up with diffs that enable the regressions at diff time. 
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
regression introduced by setting low expected 
```
**REGRESSION** benchmark add_loop_eager_dynamic failed, actual instruction count 5486451976 is higher than expected 1 with noise margin 0.01 if this is an expected regression, please update the expected instruction count in the benchmark.
```

win introduced by setting high expected 
```
**WIN** benchmark add_loop_eager failed, actual instruction count 2758450573 is lower than expected 10000000000 with noise margin 0.01 please update the expected instruction count in the benchmark.
```
I will follow up with diffs that enable the regressions at diff time. 
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
@ezyang
Copy link
Contributor

ezyang commented Sep 25, 2024

You're putting the expected count number in a Python file, which means when you write the auto-updater it will be harder to programatically update. Can we plan for programattic updates now, or have you decided you don't want them?

@oulgen
Copy link
Contributor

oulgen commented Sep 25, 2024

Earlier today we discussed how you should log each PR you block so that we can know/track when a PR is blocked. Should we add that functionality before landing this?

@ezyang
Copy link
Contributor

ezyang commented Sep 25, 2024

Doesn't this PR have those logs?

@laithsakka
Copy link
Contributor Author

Earlier today we discussed how you should log each PR you block so that we can know/track when a PR is blocked. Should we add that functionality before landing this?

yeh this PR have the logs cc @ezyang @oulgen

@laithsakka
Copy link
Contributor Author

You're putting the expected count number in a Python file, which means when you write the auto-updater it will be harder to programatically update. Can we plan for programattic updates now, or have you decided you don't want them?

mhmm i see i can change the way we do it to have them in a separate file

@ezyang
Copy link
Contributor

ezyang commented Sep 26, 2024

To be super explicit, I don't want to land this /without/ the autoupdater. It's a package deal.

regression introduced by setting low expected 
```
**REGRESSION** benchmark add_loop_eager_dynamic failed, actual instruction count 5486451976 is higher than expected 1 with noise margin 0.01 if this is an expected regression, please update the expected instruction count in the benchmark.
```

win introduced by setting high expected 
```
**WIN** benchmark add_loop_eager failed, actual instruction count 2758450573 is lower than expected 10000000000 with noise margin 0.01 please update the expected instruction count in the benchmark.
```
I will follow up with diffs that enable the regressions at diff time. 
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
regression introduced by setting low expected 
```
**REGRESSION** benchmark add_loop_eager_dynamic failed, actual instruction count 5486451976 is higher than expected 1 with noise margin 0.01 if this is an expected regression, please update the expected instruction count in the benchmark.
```

win introduced by setting high expected 
```
**WIN** benchmark add_loop_eager failed, actual instruction count 2758450573 is lower than expected 10000000000 with noise margin 0.01 please update the expected instruction count in the benchmark.
```
I will follow up with diffs that enable the regressions at diff time. 
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
test this by running 
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv 

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
test this by running 
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv 

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
test this by running 
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv 

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
test this by running 
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv 

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
test this by running 
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv 

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
test this by running 
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv 

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
@laithsakka
Copy link
Contributor Author

laithsakka commented Sep 26, 2024

addressed comments @ezyang @oulgen


if result < low:
fail = True
ratio = (float)(entry.expected_value - result) * 100 / entry.expected_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the heck is this, just do float(...) like a normal person lol

Copy link
Contributor Author

@laithsakka laithsakka Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol pardon my c++ background and lack of python your majesty, i will update it

print(
f"WIN: benchmark {key} failed, actual result {result} is {ratio:.2f}% lower than "
f"expected {entry.expected_value} ±{entry.noise_margin*100:.2f}% "
f"please update the expected results."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so are you going to write the updater script too?

Copy link
Contributor Author

@laithsakka laithsakka Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in this diff, I will follow up in a different diff.

1. example of failing diff
#136740

2. test this by running 
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv 

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
@laithsakka
Copy link
Contributor Author

adress comments

1. example of failing diff
#136740

2. test this by running 
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv 

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
1. example of failing diff
#136740

2. test this by running 
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv 

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
@laithsakka
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-scribe Enable logging to Scribe on the CI job ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants