Conversation
.github/workflows/tests.yml
Outdated
| # - name: Upload benchmark db | ||
| # env: | ||
| # AWS_ACCESS_KEY_ID: ${{ secrets.RUNTIME_CI_BOT_AWS_ACCESS_KEY_ID }} | ||
| # AWS_SECRET_ACCESS_KEY: ${{ secrets.RUNTIME_CI_BOT_AWS_SECRET_ACCESS_KEY }} | ||
| # AWS_DEFAULT_REGION: us-east-2 # this is needed for boto for some reason | ||
| # DB_NAME: benchmark.db | ||
| # run: | | ||
| # aws s3 cp $DB_NAME s3://coiled-runtime-ci/benchmarks/ |
There was a problem hiding this comment.
Thank you for being careful here :)
There was a problem hiding this comment.
@ian-r-rose is it safe to uncomment this now? Unless there are any other things you want to check?
|
Raising an exception with a summary of what's failing seems to be working, see https://github.com/coiled/coiled-runtime/runs/7681316108?check_suite_focus=true. But I've just realized that I didn't use the latest run to compare because the individual databases only get updated on main, so I modify that to test it. Separate comment: Currently the threshold is (mean + 1 std) where mean is computed with the latest 10 runs excluding the one being evaluated for regression. That seems to be tight, although in some cases we will be missing regression problems due to averaging, in cases like test_download_throughput[s3fs] in https://coiled.github.io/coiled-runtime/coiled-upstream-py3.9.html |
|
I'm having problems with getting the last value of the timeseries being the one of the run that just happened. It seems the database is not getting updated because I'm using the database artifact after the single DBs were merged and I keep getting the incorrect last value.
E runtime= 'coiled-0.0.3-py3.7', name= 'test_shuffle_parquet', duration_last = 226.18878746032715, dur_threshold= 223.97877921614503 and if I look at the last uploaded value of that test I see that is the value it's picking up. I'm not sure what I'm doing wrong here, and why the is not locally updating after the combine step. @ian-r-rose do you have any idea what could be happening here? EDIT: I forgot to modify to push individual artifacts, hence the DB was not updating. |
detect_regressions.py
Outdated
| # convert dict to dataframe to write to xml to use later ask ian if this is what he had in mind. | ||
| # df_stats = pandas.DataFrame.from_dict(stats_dict, orient="index") | ||
| # df_stats.to_xml("stats.xml") |
There was a problem hiding this comment.
This is what I had in mind: https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#adding-a-job-summary
Looks like it's unstructured markdown, not XML
There was a problem hiding this comment.
So it seems that we can do df.to_markdown(). Now after seeing the example in the I assume that what you want in the summary is a summary of the regressions as a table, not all the stats for every test.
Having a summary of the regressions as a nice markdown I think is possible. I'm thinking something like every row is a test_name, and as columns we have [runtime_ver, regression_type, mean, last_val, last_val-1, last_val-2, thershold] .
Would something like this work for a summary?
There was a problem hiding this comment.
Sure, I don't have a strong idea of the best presentation, so if you come up with something that you think is useful, that would be a great starting point.
|
@ian-r-rose I know you mentioned that you preferred having this as a script instead of a test. I noticed that by running this as a script and not a test, we don't get the red on the CI run. If the reason behind having a script is that we want to be able to run this again maybe when creating the static page and doing a nice table there with all the stats we could just save the dict with all the data, upload it as an artifact and use it during the statics step. Thoughts? |
I'm not sure I follow -- I see a red X on your regressions test here. It should be a failure if the last command of the script has a non-zero exit code (which your
I'm not against an artifact, but it does seem a little over-complicated to me, since it's not expensive to re-compute these values. Maybe if we start wanting to do a monte-carlo Bayesian changepoint analysis or something... |
Yes, I meant to say the traceback wasn't red like here for example. But I guess it's ok.
👍 |
Oh, interesting. I'm not sure how to accomplish that! I'll poke around google |
|
Looks like GitHub actions supports ANSI colors: https://github.com/ian-r-rose/gha-test/runs/7700189190?check_suite_focus=true You could try to use something like rich to format nice text output if you wanted to (with the |
| run: | | ||
| python detect_regressions.py | ||
|
|
||
| - name: Create regressions summary |
There was a problem hiding this comment.
It seems that this step doesn't run when the previous one failed I might have to use create and artifact and create a separate workflow, not sure yet.
There was a problem hiding this comment.
You can add an if: always() or similar condition to get around that.
|
I was able to get some extra color on the regression report, locally I see all the regressions in red, but in CI I don't. I was able to get the markdown table https://github.com/coiled/coiled-runtime/actions/runs/2820970035/attempts/1#summary-7734995707 |
|
After a conversation with @ian-r-rose :
I think this is ready for a thorough review, so far I haven't reverted changes that allow things to run on this PR. That would be the latest step. cc: @jrbourbeau in case you want to take a look too. |
detect_regressions.py
Outdated
| f"\x1b[31m Regressions detected {len(regressions)}: \n{''.join(regressions)} \x1b[0m" | ||
| ) | ||
| else: | ||
| assert not regressions |
There was a problem hiding this comment.
This should never happen, since you've already checked if regressions above.
ian-r-rose
left a comment
There was a problem hiding this comment.
Thanks @ncclementi! I'll take another look tomorrow when I'm fresher, but this looks like it's ready to start pestering us
detect_regressions.py
Outdated
| metric_threshold = ( | ||
| df_test[metric][-13:-3].mean() | ||
| + 2 * df_test[metric][-13:-3].std() | ||
| + 1 * df_test[metric][-13:-3].std() |
There was a problem hiding this comment.
I'd still advocate for 2 standard deviations: a ~40% false positive rate would get annoying pretty quickly
There was a problem hiding this comment.
Yes I agree, I was just trying to get regressions on purpose to see things work, I'll revert this back soon.
|
Things work as expected see https://github.com/coiled/coiled-runtime/runs/7810509623?check_suite_focus=true and summary https://github.com/coiled/coiled-runtime/actions/runs/2847981410#summary-7810509623 I'll move to 2 standard deviations as 1 is to strict, and I'll start removing things that were only to test on this PR. |
.github/workflows/tests.yml
Outdated
| python detect_regressions.py | ||
|
|
||
| - name: Create regressions summary | ||
| if: always() && github.ref == 'refs/heads/main' && github.repository == 'coiled/coiled-runtime' |
There was a problem hiding this comment.
I don't think we need the ref or repository check here
There was a problem hiding this comment.
do we want to run this on PRs too? I thought since we are only updating the database on ref = main, we should keep this. Or will this be cover since the check it's also at the step level.
There was a problem hiding this comment.
Isn't this job gated by the same check above? I just mean that it's redundant.
That being said, I actually think that running this on PRs would be good practice. We mostly just don't want to upload the new db on PRs
|
|
||
| process-results: | ||
| needs: [runtime, benchmarks, stability] | ||
| name: Combine separate benchmark results |
There was a problem hiding this comment.
So, if we wanted to run the regression check on PRs, I think we'd have to remove the github.ref check on process-results, but then add it to the "Upload benchmark db" step. Then regressions could just always() run.
|
This looks great! Let's watch CI to make sure things behave as we expect, then merge! |
final upload to s3
|
@ian-r-rose thanks for pushing that fix, it looks like we don't have regressions as of this PR. : ) |
|
Let's take it for a spin! |

Uh oh!
There was an error while loading. Please reload this page.