Generate a representative 100k sample#1032
Conversation
|
GitHub action is unavailable as it's introduced on this branch. AWS build (triggered locally) ongoing: |
95c5942 to
329b2f6
Compare
corneliusroemer
left a comment
There was a problem hiding this comment.
Looks good, I suppose there's no harm in setting this up and test/improve as we go.
| 50k_late: | ||
| group_by: "year month region" | ||
| max_sequences: 50000 | ||
| min_date: "--min-date 1Y" |
There was a problem hiding this comment.
Does this produce clean boundaries? Or will a certain month be (partially) double sampled?
There was a problem hiding this comment.
I'll say that this sample isn't supposed to be truly representative for proper analyses -- it's just a way to test things locally and get an idea of what the output would look like.
There was a problem hiding this comment.
I think if we get subsampling right, the "100k" dataset should produce nearly indistinguishable output for ncov global 6m (for example) to using the entire 14M sequence database.
There was a problem hiding this comment.
This should produce clean boundaries since 1Y translates to an exact day.
Both --min-date and --max-date are inclusive so there may be double-sampling of that day, but no more than that.
There was a problem hiding this comment.
This should produce clean boundaries since 1Y translates to an exact day.
Assuming the month that falls in is X, that will leave some genomes from X in both samples (50k_early, 50k_late). This'll partition things evenly according to month, so in 50k_early there'll be ~50,000*1/24 from X and in 50k_late there'll be ~50,000*1/12 samples. Won't matter for the 6M build, but a little less than ideal.
There was a problem hiding this comment.
Oh you're right, yes there will be double sampling of the month due to --group-by month.
The behavior with the exact date cutoff is worth pointing out. For example:
--max-date 2022-03-08 --group-by monthwill sample the first week of March as much as each of the previous full months.--min-date 2022-03-08 --group-by monthwill sample the remainder of March as much as each of the following full months.
There was a problem hiding this comment.
Thanks for the clarifications! Added a comment to the build YAML explaining this.
P.S. This isn't just for our builds, it's going to happen everywhere. Even a simple single-sample scheme grouped by year month will oversample the current month unless we are running things on the final day of the month.
A subsampled input dataset (metadata + sequences) is useful for local testing / development purposes. It is not expensive to compute this weekly, so a GitHub action is added to run each monday. (No timestamps are used as it's not intended to be accessed retrospectively.) Helpful review comments and code from @corneliusroemer, @trvrb, @victorlin and @tsibley (#1032). Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
329b2f6 to
a41a9f6
Compare
|
Finally got back to this. Have incorporated all comments & am currently running on AWS. Update: Succeeded after 2h40m, based on this price page that's around $2 of compute. I'll probably add a 21L version to this |
A subsampled input dataset (metadata + sequences) is useful for local testing / development purposes. It is not expensive to compute this weekly, so a GitHub action is added to run each monday. (No timestamps are used as it's not intended to be accessed retrospectively.) There was a previous `config["upload"]` parameter which was only used by nextstrain team builds and was removed in 9abad2a (December 2021). This parameter remained in some build YAMLs and has been removed here. The new usage of the parameter is now documented. Helpful review comments and code from @corneliusroemer, @trvrb, @victorlin and @tsibley (#1032). Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
a41a9f6 to
6b7c9e9
Compare
This sample (metadata + sequences) is useful for local testing / dev work. It is not expensive to compute this weekly, so GitHub actions is set to run each monday. No timestamps are used as it's not intended to be accessed retrospectively.