Skip to content

Generate a representative 100k sample#1032

Merged
jameshadfield merged 1 commit intomasterfrom
100k-sample
Apr 16, 2023
Merged

Generate a representative 100k sample#1032
jameshadfield merged 1 commit intomasterfrom
100k-sample

Conversation

@jameshadfield
Copy link
Copy Markdown
Member

This sample (metadata + sequences) is useful for local testing / dev work. It is not expensive to compute this weekly, so GitHub actions is set to run each monday. No timestamps are used as it's not intended to be accessed retrospectively.

@jameshadfield
Copy link
Copy Markdown
Member Author

jameshadfield commented Nov 28, 2022

GitHub action is unavailable as it's introduced on this branch. AWS build (triggered locally) ongoing: AWS Batch Job ID: 05c94df1-a0ca-4209-92a1-6714aa89db58. Cloudwatch logs here

Comment thread nextstrain_profiles/100k/config.yaml Outdated
Copy link
Copy Markdown
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I suppose there's no harm in setting this up and test/improve as we go.

Comment thread .github/workflows/rebuild-100k.yml Outdated
Comment thread .github/workflows/rebuild-100k.yml Outdated
Comment thread .github/workflows/rebuild-100k.yml Outdated
Comment thread .github/workflows/rebuild-100k.yml Outdated
Comment thread .github/workflows/rebuild-100k.yml
Comment thread nextstrain_profiles/100k/README.md
50k_late:
group_by: "year month region"
max_sequences: 50000
min_date: "--min-date 1Y"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this produce clean boundaries? Or will a certain month be (partially) double sampled?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! cc @victorlin

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll say that this sample isn't supposed to be truly representative for proper analyses -- it's just a way to test things locally and get an idea of what the output would look like.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we get subsampling right, the "100k" dataset should produce nearly indistinguishable output for ncov global 6m (for example) to using the entire 14M sequence database.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should produce clean boundaries since 1Y translates to an exact day.

Both --min-date and --max-date are inclusive so there may be double-sampling of that day, but no more than that.

Copy link
Copy Markdown
Member Author

@jameshadfield jameshadfield Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should produce clean boundaries since 1Y translates to an exact day.

Assuming the month that falls in is X, that will leave some genomes from X in both samples (50k_early, 50k_late). This'll partition things evenly according to month, so in 50k_early there'll be ~50,000*1/24 from X and in 50k_late there'll be ~50,000*1/12 samples. Won't matter for the 6M build, but a little less than ideal.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right, yes there will be double sampling of the month due to --group-by month.

The behavior with the exact date cutoff is worth pointing out. For example:

  • --max-date 2022-03-08 --group-by month will sample the first week of March as much as each of the previous full months.
  • --min-date 2022-03-08 --group-by month will sample the remainder of March as much as each of the following full months.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications! Added a comment to the build YAML explaining this.

P.S. This isn't just for our builds, it's going to happen everywhere. Even a simple single-sample scheme grouped by year month will oversample the current month unless we are running things on the final day of the month.

Comment thread nextstrain_profiles/100k/config.yaml Outdated
Comment thread nextstrain_profiles/100k/README.md Outdated
Comment thread nextstrain_profiles/100k/README.md Outdated
Comment thread .github/workflows/rebuild-100k.yml Outdated
Comment thread workflow/snakemake_rules/common.smk Outdated
@corneliusroemer corneliusroemer assigned tsibley and unassigned tsibley Mar 30, 2023
jameshadfield added a commit that referenced this pull request Apr 10, 2023
A subsampled input dataset (metadata + sequences) is useful for local
testing / development purposes. It is not expensive to compute this
weekly, so a GitHub action is added to run each monday.
(No timestamps are used as it's not intended to be accessed
retrospectively.)

Helpful review comments and code from @corneliusroemer, @trvrb,
@victorlin and @tsibley (#1032).

Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
@jameshadfield
Copy link
Copy Markdown
Member Author

jameshadfield commented Apr 10, 2023

Finally got back to this. Have incorporated all comments & am currently running on AWS. Update: Succeeded after 2h40m, based on this price page that's around $2 of compute.

I'll probably add a 21L version to this builds.yml, but will do so after the merge of #1029.

A subsampled input dataset (metadata + sequences) is useful for local
testing / development purposes. It is not expensive to compute this
weekly, so a GitHub action is added to run each monday.
(No timestamps are used as it's not intended to be accessed
retrospectively.)

There was a previous `config["upload"]` parameter which was only used
by nextstrain team builds and was removed in 9abad2a
(December 2021). This parameter remained in some build YAMLs and has
been removed here. The new usage of the parameter is now documented.

Helpful review comments and code from @corneliusroemer, @trvrb,
@victorlin and @tsibley (#1032).

Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects

Development

Successfully merging this pull request may close these issues.

5 participants