Expand subsampling docs

victorlin · victorlin · commit c334b314ee32 · 2024-08-16T14:31:49.000-07:00
Add an example for --subsample-max-sequences and describe a scenario
which leads to actual size differing from target size.
diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -132,6 +132,37 @@ sequence per month from each country:
      --output-sequences subsampled_sequences.fasta \
      --output-metadata subsampled_metadata.tsv
 
+An alternative to ``--sequences-per-group`` is ``--subsample-max-sequences``.
+This is useful if you don't know how many groups the metadata will be
+partitioned into but you have a target sample size. For example, target 100
+total sequences:
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences data/sequences.fasta \
+     --metadata data/metadata.tsv \
+     --min-date 2012 \
+     --exclude exclude.txt \
+     --group-by country year month \
+     --subsample-max-sequences 100 \
+     --output-sequences subsampled_sequences.fasta \
+     --output-metadata subsampled_metadata.tsv
+
+``augur filter`` will automatically determine a value for
+``--sequences-per-group`` based on the number of available groups and sample
+uniformly.
+
+.. note::
+
+   For these options, the number of targeted sequences per group does not take
+   into account the actual number of sequences available in the input data. For
+   example, consider a dataset with 200 sequences available from 2023 and 100
+   sequences available from 2024. ``--group-by year --subsample-max-sequences
+   300`` is equivalent to ``--group-by year --sequences-per-group 150``. This
+   will take 150 sequences from 2023 and all 100 sequences from 2024 for a total
+   of 250 sequences, which is less than the target of 300.
+
 Subsampling using multiple ``augur filter`` commands
 ====================================================