Generate a representative 100k sample by jameshadfield · Pull Request #1032 · nextstrain/ncov

jameshadfield · 2022-11-28T01:36:13Z

This sample (metadata + sequences) is useful for local testing / dev work. It is not expensive to compute this weekly, so GitHub actions is set to run each monday. No timestamps are used as it's not intended to be accessed retrospectively.

jameshadfield · 2022-11-28T01:58:47Z

GitHub action is unavailable as it's introduced on this branch. AWS build (triggered locally) ongoing: AWS Batch Job ID: 05c94df1-a0ca-4209-92a1-6714aa89db58. Cloudwatch logs here

corneliusroemer

Looks good, I suppose there's no harm in setting this up and test/improve as we go.

corneliusroemer · 2022-11-28T21:49:49Z

+    50k_late:
+      group_by: "year month region"
+      max_sequences: 50000
+      min_date: "--min-date 1Y"


Does this produce clean boundaries? Or will a certain month be (partially) double sampled?

Good question! cc @victorlin

I'll say that this sample isn't supposed to be truly representative for proper analyses -- it's just a way to test things locally and get an idea of what the output would look like.

I think if we get subsampling right, the "100k" dataset should produce nearly indistinguishable output for ncov global 6m (for example) to using the entire 14M sequence database.

This should produce clean boundaries since 1Y translates to an exact day.

Both --min-date and --max-date are inclusive so there may be double-sampling of that day, but no more than that.

This should produce clean boundaries since 1Y translates to an exact day.

Assuming the month that falls in is X, that will leave some genomes from X in both samples (50k_early, 50k_late). This'll partition things evenly according to month, so in 50k_early there'll be ~50,000*1/24 from X and in 50k_late there'll be ~50,000*1/12 samples. Won't matter for the 6M build, but a little less than ideal.

Oh you're right, yes there will be double sampling of the month due to --group-by month.

The behavior with the exact date cutoff is worth pointing out. For example:

--max-date 2022-03-08 --group-by month will sample the first week of March as much as each of the previous full months.

--min-date 2022-03-08 --group-by month will sample the remainder of March as much as each of the following full months.

Thanks for the clarifications! Added a comment to the build YAML explaining this.

P.S. This isn't just for our builds, it's going to happen everywhere. Even a simple single-sample scheme grouped by year month will oversample the current month unless we are running things on the final day of the month.

@corneliusroemer

A subsampled input dataset (metadata + sequences) is useful for local testing / development purposes. It is not expensive to compute this weekly, so a GitHub action is added to run each monday. (No timestamps are used as it's not intended to be accessed retrospectively.) Helpful review comments and code from @corneliusroemer, @trvrb, @victorlin and @tsibley (#1032). Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>

jameshadfield · 2023-04-10T04:12:46Z

Finally got back to this. Have incorporated all comments & am currently running on AWS. Update: Succeeded after 2h40m, based on this price page that's around $2 of compute.

I'll probably add a 21L version to this builds.yml, but will do so after the merge of #1029.

@corneliusroemer

A subsampled input dataset (metadata + sequences) is useful for local testing / development purposes. It is not expensive to compute this weekly, so a GitHub action is added to run each monday. (No timestamps are used as it's not intended to be accessed retrospectively.) There was a previous `config["upload"]` parameter which was only used by nextstrain team builds and was removed in 9abad2a (December 2021). This parameter remained in some build YAMLs and has been removed here. The new usage of the parameter is now documented. Helpful review comments and code from @corneliusroemer, @trvrb, @victorlin and @tsibley (#1032). Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>

jameshadfield commented Nov 28, 2022

View reviewed changes

Comment thread nextstrain_profiles/100k/config.yaml Outdated

jameshadfield force-pushed the 100k-sample branch from 95c5942 to 329b2f6 Compare November 28, 2022 21:12

jameshadfield requested a review from a team November 28, 2022 21:17

corneliusroemer reviewed Nov 28, 2022

View reviewed changes

trvrb reviewed Nov 28, 2022

View reviewed changes

Comment thread nextstrain_profiles/100k/config.yaml Outdated

tsibley reviewed Dec 1, 2022

View reviewed changes

Comment thread nextstrain_profiles/100k/README.md Outdated

tsibley reviewed Dec 1, 2022

View reviewed changes

Comment thread nextstrain_profiles/100k/README.md Outdated

tsibley suggested changes Dec 1, 2022

View reviewed changes

Comment thread .github/workflows/rebuild-100k.yml Outdated

Comment thread workflow/snakemake_rules/common.smk Outdated

corneliusroemer assigned tsibley and unassigned tsibley Mar 30, 2023

jameshadfield force-pushed the 100k-sample branch from 329b2f6 to a41a9f6 Compare April 10, 2023 03:48

jameshadfield force-pushed the 100k-sample branch from a41a9f6 to 6b7c9e9 Compare April 10, 2023 22:22

tsibley mentioned this pull request Apr 12, 2023

Include 21L rooted Nextstrain build for GISAID data #1029

Merged

jameshadfield merged commit 16b34cb into master Apr 16, 2023

jameshadfield deleted the 100k-sample branch April 16, 2023 21:51

Conversation

jameshadfield commented Nov 28, 2022

Uh oh!

jameshadfield commented Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

corneliusroemer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

corneliusroemer Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

jameshadfield Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

jameshadfield Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

trvrb Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

victorlin Nov 29, 2022

Choose a reason for hiding this comment

Uh oh!

jameshadfield Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victorlin Nov 30, 2022

Choose a reason for hiding this comment

Uh oh!

jameshadfield Apr 10, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameshadfield commented Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jameshadfield commented Nov 28, 2022 •

edited

Loading

jameshadfield Nov 30, 2022 •

edited

Loading

jameshadfield commented Apr 10, 2023 •

edited

Loading