Skip to content

Add regression detection [DNM yet]#226

Merged
ian-r-rose merged 28 commits intomainfrom
regression-vis
Aug 12, 2022
Merged

Add regression detection [DNM yet]#226
ian-r-rose merged 28 commits intomainfrom
regression-vis

Conversation

@ncclementi
Copy link
Contributor

@ncclementi ncclementi commented Aug 4, 2022

Comment on lines +369 to +376
# - name: Upload benchmark db
# env:
# AWS_ACCESS_KEY_ID: ${{ secrets.RUNTIME_CI_BOT_AWS_ACCESS_KEY_ID }}
# AWS_SECRET_ACCESS_KEY: ${{ secrets.RUNTIME_CI_BOT_AWS_SECRET_ACCESS_KEY }}
# AWS_DEFAULT_REGION: us-east-2 # this is needed for boto for some reason
# DB_NAME: benchmark.db
# run: |
# aws s3 cp $DB_NAME s3://coiled-runtime-ci/benchmarks/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for being careful here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ian-r-rose is it safe to uncomment this now? Unless there are any other things you want to check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're ready

@ncclementi
Copy link
Contributor Author

Raising an exception with a summary of what's failing seems to be working, see https://github.com/coiled/coiled-runtime/runs/7681316108?check_suite_focus=true.

But I've just realized that I didn't use the latest run to compare because the individual databases only get updated on main, so I modify that to test it.

Separate comment: Currently the threshold is (mean + 1 std) where mean is computed with the latest 10 runs excluding the one being evaluated for regression. That seems to be tight, although in some cases we will be missing regression problems due to averaging, in cases like test_download_throughput[s3fs] in https://coiled.github.io/coiled-runtime/coiled-upstream-py3.9.html

@ncclementi
Copy link
Contributor Author

ncclementi commented Aug 4, 2022

I'm having problems with getting the last value of the timeseries being the one of the run that just happened. It seems the database is not getting updated because I'm using the database artifact after the single DBs were merged and I keep getting the incorrect last value.

  • I put a sleep on benchamark/test_coiled.py::test_default_cluster_spinup_time and that should have triggered the exception and it didn't.
  • If I look at one of the failed examples, I see
E               runtime= 'coiled-0.0.3-py3.7', name= 'test_shuffle_parquet', duration_last = 226.18878746032715, dur_threshold= 223.97877921614503 

and if I look at the last uploaded value of that test I see that is the value it's picking up.
Screen Shot 2022-08-04 at 6 55 21 PM

I'm not sure what I'm doing wrong here, and why the is not locally updating after the combine step. @ian-r-rose do you have any idea what could be happening here?

EDIT: I forgot to modify to push individual artifacts, hence the DB was not updating.

@ncclementi ncclementi changed the title Add regression detection [WIP] Add regression detection Aug 5, 2022
@ncclementi ncclementi changed the title Add regression detection Add regression detection [DNM yet] Aug 5, 2022
Comment on lines +108 to +110
# convert dict to dataframe to write to xml to use later ask ian if this is what he had in mind.
# df_stats = pandas.DataFrame.from_dict(stats_dict, orient="index")
# df_stats.to_xml("stats.xml")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I had in mind: https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#adding-a-job-summary

Looks like it's unstructured markdown, not XML

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems that we can do df.to_markdown(). Now after seeing the example in the I assume that what you want in the summary is a summary of the regressions as a table, not all the stats for every test.

Having a summary of the regressions as a nice markdown I think is possible. I'm thinking something like every row is a test_name, and as columns we have [runtime_ver, regression_type, mean, last_val, last_val-1, last_val-2, thershold] .

Would something like this work for a summary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I don't have a strong idea of the best presentation, so if you come up with something that you think is useful, that would be a great starting point.

@ncclementi
Copy link
Contributor Author

@ian-r-rose I know you mentioned that you preferred having this as a script instead of a test. I noticed that by running this as a script and not a test, we don't get the red on the CI run.

If the reason behind having a script is that we want to be able to run this again maybe when creating the static page and doing a nice table there with all the stats we could just save the dict with all the data, upload it as an artifact and use it during the statics step. Thoughts?

@ian-r-rose
Copy link
Contributor

@ian-r-rose I know you mentioned that you preferred having this as a script instead of a test. I noticed that by running this as a script and not a test, we don't get the red on the CI run.

I'm not sure I follow -- I see a red X on your regressions test here. It should be a failure if the last command of the script has a non-zero exit code (which your assert accomplishes)

If the reason behind having a script is that we want to be able to run this again maybe when creating the static page and doing a nice table there with all the stats we could just save the dict with all the data, upload it as an artifact and use it during the statics step. Thoughts?

I'm not against an artifact, but it does seem a little over-complicated to me, since it's not expensive to re-compute these values. Maybe if we start wanting to do a monte-carlo Bayesian changepoint analysis or something...

@ncclementi
Copy link
Contributor Author

I'm not sure I follow -- I see a red X on your regressions test here. It should be a failure if the last command of the script has a non-zero exit code (which your assert accomplishes)

Yes, I meant to say the traceback wasn't red like here for example. But I guess it's ok.

I'm not against an artifact, but it does seem a little over-complicated to me, since it's not expensive to re-compute these values. Maybe if we start wanting to do a monte-carlo Bayesian changepoint analysis or something...

👍

@ian-r-rose
Copy link
Contributor

Yes, I meant to say the traceback wasn't red like here for example. But I guess it's ok.

Oh, interesting. I'm not sure how to accomplish that! I'll poke around google

@ian-r-rose
Copy link
Contributor

Looks like GitHub actions supports ANSI colors: https://github.com/ian-r-rose/gha-test/runs/7700189190?check_suite_focus=true

You could try to use something like rich to format nice text output if you wanted to (with the force_terminal option).

run: |
python detect_regressions.py

- name: Create regressions summary
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this step doesn't run when the previous one failed I might have to use create and artifact and create a separate workflow, not sure yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add an if: always() or similar condition to get around that.

@ncclementi
Copy link
Contributor Author

I was able to get some extra color on the regression report, locally I see all the regressions in red, but in CI I don't.
https://github.com/coiled/coiled-runtime/runs/7734995707?check_suite_focus=true

I was able to get the markdown table https://github.com/coiled/coiled-runtime/actions/runs/2820970035/attempts/1#summary-7734995707

@ncclementi
Copy link
Contributor Author

After a conversation with @ian-r-rose :

  • Split the script into two separate functions so we can reuse the detect_regressions() as a separate script.
  • Add a check to only compute stats when having more than 6 values. I'd like this number to be bigger but since we don't have enough entries I'm using 6 for now.
  • Add a check to only compute regressions stats if the tests are not obsolete (there is a run in the latest 7 days)

I think this is ready for a thorough review, so far I haven't reverted changes that allow things to run on this PR. That would be the latest step.

cc: @jrbourbeau in case you want to take a look too.

@ian-r-rose ian-r-rose self-requested a review August 10, 2022 15:35
Copy link
Contributor

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ncclementi!

f"\x1b[31m Regressions detected {len(regressions)}: \n{''.join(regressions)} \x1b[0m"
)
else:
assert not regressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen, since you've already checked if regressions above.

Copy link
Contributor

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ncclementi! I'll take another look tomorrow when I'm fresher, but this looks like it's ready to start pestering us

metric_threshold = (
df_test[metric][-13:-3].mean()
+ 2 * df_test[metric][-13:-3].std()
+ 1 * df_test[metric][-13:-3].std()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still advocate for 2 standard deviations: a ~40% false positive rate would get annoying pretty quickly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, I was just trying to get regressions on purpose to see things work, I'll revert this back soon.

@ncclementi
Copy link
Contributor Author

Things work as expected see https://github.com/coiled/coiled-runtime/runs/7810509623?check_suite_focus=true and summary https://github.com/coiled/coiled-runtime/actions/runs/2847981410#summary-7810509623

I'll move to 2 standard deviations as 1 is to strict, and I'll start removing things that were only to test on this PR.

python detect_regressions.py

- name: Create regressions summary
if: always() && github.ref == 'refs/heads/main' && github.repository == 'coiled/coiled-runtime'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the ref or repository check here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to run this on PRs too? I thought since we are only updating the database on ref = main, we should keep this. Or will this be cover since the check it's also at the step level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this job gated by the same check above? I just mean that it's redundant.

That being said, I actually think that running this on PRs would be good practice. We mostly just don't want to upload the new db on PRs


process-results:
needs: [runtime, benchmarks, stability]
name: Combine separate benchmark results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if we wanted to run the regression check on PRs, I think we'd have to remove the github.ref check on process-results, but then add it to the "Upload benchmark db" step. Then regressions could just always() run.

@ian-r-rose
Copy link
Contributor

This looks great! Let's watch CI to make sure things behave as we expect, then merge!

@ncclementi
Copy link
Contributor Author

@ian-r-rose thanks for pushing that fix, it looks like we don't have regressions as of this PR. : )

@ian-r-rose
Copy link
Contributor

Let's take it for a spin!

@ian-r-rose ian-r-rose merged commit 1758b3e into main Aug 12, 2022
@ncclementi ncclementi deleted the regression-vis branch December 28, 2022 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance regression visibility

2 participants