Add regression detection [DNM yet] by ncclementi · Pull Request #226 · coiled/benchmarks

ncclementi · 2022-08-04T21:01:56Z

Closes Performance regression visibility #190

ian-r-rose · 2022-08-04T21:20:31Z

.github/workflows/tests.yml

+      # - name: Upload benchmark db
+      #   env:
+      #     AWS_ACCESS_KEY_ID: ${{ secrets.RUNTIME_CI_BOT_AWS_ACCESS_KEY_ID }}
+      #     AWS_SECRET_ACCESS_KEY: ${{ secrets.RUNTIME_CI_BOT_AWS_SECRET_ACCESS_KEY }}
+      #     AWS_DEFAULT_REGION: us-east-2  # this is needed for boto for some reason
+      #     DB_NAME: benchmark.db
+      #   run: |
+      #     aws s3 cp $DB_NAME s3://coiled-runtime-ci/benchmarks/


Thank you for being careful here :)

@ian-r-rose is it safe to uncomment this now? Unless there are any other things you want to check?

I think we're ready

ncclementi · 2022-08-04T22:09:58Z

Raising an exception with a summary of what's failing seems to be working, see https://github.com/coiled/coiled-runtime/runs/7681316108?check_suite_focus=true.

But I've just realized that I didn't use the latest run to compare because the individual databases only get updated on main, so I modify that to test it.

Separate comment: Currently the threshold is (mean + 1 std) where mean is computed with the latest 10 runs excluding the one being evaluated for regression. That seems to be tight, although in some cases we will be missing regression problems due to averaging, in cases like test_download_throughput[s3fs] in https://coiled.github.io/coiled-runtime/coiled-upstream-py3.9.html

tests/test_detect_regressions.py

ncclementi · 2022-08-04T22:57:09Z

I'm having problems with getting the last value of the timeseries being the one of the run that just happened. It seems the database is not getting updated because I'm using the database artifact after the single DBs were merged and I keep getting the incorrect last value.

I put a sleep on benchamark/test_coiled.py::test_default_cluster_spinup_time and that should have triggered the exception and it didn't.
If I look at one of the failed examples, I see

E               runtime= 'coiled-0.0.3-py3.7', name= 'test_shuffle_parquet', duration_last = 226.18878746032715, dur_threshold= 223.97877921614503

and if I look at the last uploaded value of that test I see that is the value it's picking up.

I'm not sure what I'm doing wrong here, and why the is not locally updating after the combine step. @ian-r-rose do you have any idea what could be happening here?

EDIT: I forgot to modify to push individual artifacts, hence the DB was not updating.

ian-r-rose · 2022-08-05T21:33:52Z

detect_regressions.py

+    # convert dict to dataframe to write to xml to use later ask ian if this is what he had in mind.
+    # df_stats = pandas.DataFrame.from_dict(stats_dict, orient="index")
+    # df_stats.to_xml("stats.xml")


This is what I had in mind: https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#adding-a-job-summary

Looks like it's unstructured markdown, not XML

So it seems that we can do df.to_markdown(). Now after seeing the example in the I assume that what you want in the summary is a summary of the regressions as a table, not all the stats for every test.

Having a summary of the regressions as a nice markdown I think is possible. I'm thinking something like every row is a test_name, and as columns we have [runtime_ver, regression_type, mean, last_val, last_val-1, last_val-2, thershold] .

Would something like this work for a summary?

Sure, I don't have a strong idea of the best presentation, so if you come up with something that you think is useful, that would be a great starting point.

ncclementi · 2022-08-05T22:34:26Z

@ian-r-rose I know you mentioned that you preferred having this as a script instead of a test. I noticed that by running this as a script and not a test, we don't get the red on the CI run.

If the reason behind having a script is that we want to be able to run this again maybe when creating the static page and doing a nice table there with all the stats we could just save the dict with all the data, upload it as an artifact and use it during the statics step. Thoughts?

ian-r-rose · 2022-08-05T22:45:31Z

@ian-r-rose I know you mentioned that you preferred having this as a script instead of a test. I noticed that by running this as a script and not a test, we don't get the red on the CI run.

I'm not sure I follow -- I see a red X on your regressions test here. It should be a failure if the last command of the script has a non-zero exit code (which your assert accomplishes)

If the reason behind having a script is that we want to be able to run this again maybe when creating the static page and doing a nice table there with all the stats we could just save the dict with all the data, upload it as an artifact and use it during the statics step. Thoughts?

I'm not against an artifact, but it does seem a little over-complicated to me, since it's not expensive to re-compute these values. Maybe if we start wanting to do a monte-carlo Bayesian changepoint analysis or something...

ncclementi · 2022-08-05T22:49:15Z

I'm not sure I follow -- I see a red X on your regressions test here. It should be a failure if the last command of the script has a non-zero exit code (which your assert accomplishes)

Yes, I meant to say the traceback wasn't red like here for example. But I guess it's ok.

I'm not against an artifact, but it does seem a little over-complicated to me, since it's not expensive to re-compute these values. Maybe if we start wanting to do a monte-carlo Bayesian changepoint analysis or something...

👍

ian-r-rose · 2022-08-05T22:54:18Z

Yes, I meant to say the traceback wasn't red like here for example. But I guess it's ok.

Oh, interesting. I'm not sure how to accomplish that! I'll poke around google

ian-r-rose · 2022-08-05T23:18:49Z

Looks like GitHub actions supports ANSI colors: https://github.com/ian-r-rose/gha-test/runs/7700189190?check_suite_focus=true

You could try to use something like rich to format nice text output if you wanted to (with the force_terminal option).

ncclementi · 2022-08-06T00:00:44Z

.github/workflows/tests.yml

+        run: |
+          python detect_regressions.py
+
+      - name: Create regressions summary


It seems that this step doesn't run when the previous one failed I might have to use create and artifact and create a separate workflow, not sure yet.

You can add an if: always() or similar condition to get around that.

ncclementi · 2022-08-08T21:47:42Z

I was able to get some extra color on the regression report, locally I see all the regressions in red, but in CI I don't.
https://github.com/coiled/coiled-runtime/runs/7734995707?check_suite_focus=true

I was able to get the markdown table https://github.com/coiled/coiled-runtime/actions/runs/2820970035/attempts/1#summary-7734995707

ncclementi · 2022-08-10T14:47:56Z

After a conversation with @ian-r-rose :

Split the script into two separate functions so we can reuse the detect_regressions() as a separate script.
Add a check to only compute stats when having more than 6 values. I'd like this number to be bigger but since we don't have enough entries I'm using 6 for now.
Add a check to only compute regressions stats if the tests are not obsolete (there is a run in the latest 7 days)

I think this is ready for a thorough review, so far I haven't reverted changes that allow things to run on this PR. That would be the latest step.

cc: @jrbourbeau in case you want to take a look too.

ian-r-rose

Thanks @ncclementi!

.github/workflows/tests.yml

ci/environment-regressions.yml

detect_regressions.py

ian-r-rose · 2022-08-11T00:35:52Z

detect_regressions.py

+            f"\x1b[31m Regressions detected {len(regressions)}: \n{''.join(regressions)} \x1b[0m"
+        )
+    else:
+        assert not regressions


This should never happen, since you've already checked if regressions above.

detect_regressions.py

ian-r-rose

Thanks @ncclementi! I'll take another look tomorrow when I'm fresher, but this looks like it's ready to start pestering us

ian-r-rose · 2022-08-12T16:17:59Z

detect_regressions.py

                        metric_threshold = (
                            df_test[metric][-13:-3].mean()
-                            + 2 * df_test[metric][-13:-3].std()
+                            + 1 * df_test[metric][-13:-3].std()


I'd still advocate for 2 standard deviations: a ~40% false positive rate would get annoying pretty quickly

Yes I agree, I was just trying to get regressions on purpose to see things work, I'll revert this back soon.

ncclementi · 2022-08-12T18:09:22Z

Things work as expected see https://github.com/coiled/coiled-runtime/runs/7810509623?check_suite_focus=true and summary https://github.com/coiled/coiled-runtime/actions/runs/2847981410#summary-7810509623

I'll move to 2 standard deviations as 1 is to strict, and I'll start removing things that were only to test on this PR.

ian-r-rose · 2022-08-12T18:32:48Z

.github/workflows/tests.yml

+          python detect_regressions.py
+
+      - name: Create regressions summary
+        if: always() && github.ref == 'refs/heads/main' && github.repository == 'coiled/coiled-runtime'


I don't think we need the ref or repository check here

do we want to run this on PRs too? I thought since we are only updating the database on ref = main, we should keep this. Or will this be cover since the check it's also at the step level.

Isn't this job gated by the same check above? I just mean that it's redundant.

That being said, I actually think that running this on PRs would be good practice. We mostly just don't want to upload the new db on PRs

ian-r-rose · 2022-08-12T18:46:56Z

.github/workflows/tests.yml

-
  process-results:
    needs: [runtime, benchmarks, stability]
    name: Combine separate benchmark results


So, if we wanted to run the regression check on PRs, I think we'd have to remove the github.ref check on process-results, but then add it to the "Upload benchmark db" step. Then regressions could just always() run.

ian-r-rose · 2022-08-12T18:59:14Z

This looks great! Let's watch CI to make sure things behave as we expect, then merge!

final upload to s3

ncclementi · 2022-08-12T21:16:15Z

@ian-r-rose thanks for pushing that fix, it looks like we don't have regressions as of this PR. : )

ian-r-rose · 2022-08-12T21:18:43Z

Let's take it for a spin!

test regression report

c634c61

ian-r-rose reviewed Aug 4, 2022

View reviewed changes

update db on PR and make sleep obvious

64cb0f0

ncclementi commented Aug 4, 2022

View reviewed changes

tests/test_detect_regressions.py Outdated Show resolved Hide resolved

ncclementi added 5 commits August 5, 2022 16:42

push individual artifact

f53598d

raise only if the last 3 mark regression

cdf67d4

cleanup

fe1ebca

use script instead of test

517f660

use script not test

a2144c2

ncclementi changed the title ~~Add regression detection [WIP]~~ Add regression detection Aug 5, 2022

ncclementi changed the title ~~Add regression detection~~ Add regression detection [DNM yet] Aug 5, 2022

ian-r-rose reviewed Aug 5, 2022

View reviewed changes

remove sleep

b777fd2

add tabulate dependency needed for to_md

b91343b

ncclementi added 2 commits August 5, 2022 19:19

use md for summary

a97c17b

get regressions df to md for summary

5025f60

ncclementi commented Aug 6, 2022

View reviewed changes

ncclementi added 5 commits August 8, 2022 12:07

try if always for summary step

a768c93

add category to report

197fc28

try force color

3b2bb8c

use red color directly in f-string

d3014d1

remove unnecessary quote

756a4ad

cleanup

05616f2

ian-r-rose self-requested a review August 10, 2022 15:35

ian-r-rose reviewed Aug 11, 2022

View reviewed changes

ncclementi added 5 commits August 11, 2022 17:45

consolitate env file and fix typo

1925978

remove regressions env file

5f5e45c

cleanup conditions for passed test

0845b5f

update script after review and cleanup

9933c08

fix conflicts

37e5b84

ian-r-rose reviewed Aug 12, 2022

View reviewed changes

detect_regressions.py Outdated Show resolved Hide resolved

ian-r-rose approved these changes Aug 12, 2022

View reviewed changes

remove list, use df only, test one std only

bcd7178

ian-r-rose reviewed Aug 12, 2022

View reviewed changes

ncclementi added 2 commits August 12, 2022 14:12

use 2 std

ab90108

revert changes to run only on main

5677763

ian-r-rose reviewed Aug 12, 2022

View reviewed changes

final cleanup

3f5049e

ian-r-rose reviewed Aug 12, 2022

View reviewed changes

run regression on PR only upload on main

a70d0f9

Always benchmark for individual jobs, we really only need to gate the

f25b813

final upload to s3

ian-r-rose merged commit 1758b3e into main Aug 12, 2022

jrbourbeau mentioned this pull request Aug 19, 2022

Revisit regression detection calculation #270

Closed

ncclementi deleted the regression-vis branch December 28, 2022 19:05

Conversation

ncclementi commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncclementi commented Aug 4, 2022

Uh oh!

Uh oh!

ncclementi commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncclementi commented Aug 5, 2022

Uh oh!

ian-r-rose commented Aug 5, 2022

Uh oh!

ncclementi commented Aug 5, 2022

Uh oh!

ian-r-rose commented Aug 5, 2022

Uh oh!

ian-r-rose commented Aug 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncclementi commented Aug 8, 2022

Uh oh!

ncclementi commented Aug 10, 2022

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ncclementi commented Aug 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ian-r-rose commented Aug 12, 2022

Uh oh!

ncclementi commented Aug 12, 2022

Uh oh!

ian-r-rose commented Aug 12, 2022

Uh oh!

Reviewers

Assignees

ncclementi commented Aug 4, 2022 •

edited

Loading

ncclementi commented Aug 4, 2022 •

edited

Loading