@@ -132,6 +132,37 @@ sequence per month from each country:
132132 --output-sequences subsampled_sequences.fasta \
133133 --output-metadata subsampled_metadata.tsv
134134
135+ An alternative to ``--sequences-per-group `` is ``--subsample-max-sequences ``.
136+ This is useful if you don't know how many groups the metadata will be
137+ partitioned into but you have a target sample size. For example, target 100
138+ total sequences:
139+
140+ .. code-block :: bash
141+
142+ augur filter \
143+ --sequences data/sequences.fasta \
144+ --metadata data/metadata.tsv \
145+ --min-date 2012 \
146+ --exclude exclude.txt \
147+ --group-by country year month \
148+ --subsample-max-sequences 100 \
149+ --output-sequences subsampled_sequences.fasta \
150+ --output-metadata subsampled_metadata.tsv
151+
152+ ``augur filter `` will automatically determine a value for
153+ ``--sequences-per-group `` based on the number of available groups and sample
154+ uniformly.
155+
156+ .. note ::
157+
158+ For these options, the number of targeted sequences per group does not take
159+ into account the actual number of sequences available in the input data. For
160+ example, consider a dataset with 200 sequences available from 2023 and 100
161+ sequences available from 2024. ``--group-by year --subsample-max-sequences
162+ 300 `` is equivalent to ``--group-by year --sequences-per-group 150 ``. This
163+ will take 150 sequences from 2023 and all 100 sequences from 2024 for a total
164+ of 250 sequences, which is less than the target of 300.
165+
135166Subsampling using multiple ``augur filter `` commands
136167====================================================
137168
0 commit comments