Skip to content

Commit c334b31

Browse files
committed
Expand subsampling docs
Add an example for --subsample-max-sequences and describe a scenario which leads to actual size differing from target size.
1 parent 20a0f3b commit c334b31

File tree

1 file changed

+31
-0
lines changed

1 file changed

+31
-0
lines changed

src/guides/bioinformatics/filtering-and-subsampling.rst

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,37 @@ sequence per month from each country:
132132
--output-sequences subsampled_sequences.fasta \
133133
--output-metadata subsampled_metadata.tsv
134134
135+
An alternative to ``--sequences-per-group`` is ``--subsample-max-sequences``.
136+
This is useful if you don't know how many groups the metadata will be
137+
partitioned into but you have a target sample size. For example, target 100
138+
total sequences:
139+
140+
.. code-block:: bash
141+
142+
augur filter \
143+
--sequences data/sequences.fasta \
144+
--metadata data/metadata.tsv \
145+
--min-date 2012 \
146+
--exclude exclude.txt \
147+
--group-by country year month \
148+
--subsample-max-sequences 100 \
149+
--output-sequences subsampled_sequences.fasta \
150+
--output-metadata subsampled_metadata.tsv
151+
152+
``augur filter`` will automatically determine a value for
153+
``--sequences-per-group`` based on the number of available groups and sample
154+
uniformly.
155+
156+
.. note::
157+
158+
For these options, the number of targeted sequences per group does not take
159+
into account the actual number of sequences available in the input data. For
160+
example, consider a dataset with 200 sequences available from 2023 and 100
161+
sequences available from 2024. ``--group-by year --subsample-max-sequences
162+
300`` is equivalent to ``--group-by year --sequences-per-group 150``. This
163+
will take 150 sequences from 2023 and all 100 sequences from 2024 for a total
164+
of 250 sequences, which is less than the target of 300.
165+
135166
Subsampling using multiple ``augur filter`` commands
136167
====================================================
137168

0 commit comments

Comments
 (0)