Skip to content

Introducing sample2: fix potential sampling bias issue and ensure deterministic -n behavior#566

Merged
shenwei356 merged 5 commits intoshenwei356:masterfrom
stahiga:dev-sample-algo
Jan 4, 2026
Merged

Introducing sample2: fix potential sampling bias issue and ensure deterministic -n behavior#566
shenwei356 merged 5 commits intoshenwei356:masterfrom
stahiga:dev-sample-algo

Conversation

@stahiga
Copy link
Contributor

@stahiga stahiga commented Jan 4, 2026

This PR introduces a new subcommand sample2 to address a few issues in the existing sample subcommand:

Problems with existing sample -n

The current implementation uses proportion-based Bernoulli sampling with an amplification factor and early termination. This approach has several limitations:

  1. Sampling bias: Records appearing earlier in the input have a higher probability of being selected.
  2. Non-deterministic sample size: Does not guarantee an exact target count.
  3. Limited control: Two-pass mode (-2) with proportion sampling (-p) doesn't provide explicit number control.

sample2 subcommand solution

-n -2 (Reservoir sampling for fixed-size sampling)

  • Provides unbiased, fixed-size sampling with controlled memory usage
  • Guarantees exact target count with equal probability for each record
  • Memory efficient: tested on large datasets with minimal memory footprint
    • 2,195,354 records: <200 MB memory usage (output: 38GB long read FASTQ)
    • 124,437,023 records: 2.05GB memory usage (output: 43GB short read FASTQ)

-n (Partial in-memory shuffle)

  • Produces exact down-sampled size
  • Uses complete dataset in memory for unbiased sampling (partitial shuffling)

-p -2 (Two-pass reservoir sampling)

  • Two-pass mode now takes effect when -p is specified with -2
  • Target sample size is explicitly defined as floor(p × n)
  • Combines proportion flexibility with exact count guarantee

-p (Original behavior)

  • Retains the original Bernoulli sampling behavior for backward compatibility

Backward compatibility

The original sample subcommand remains unchanged to ensure backward compatibility.

@shenwei356 shenwei356 merged commit 94203ac into shenwei356:master Jan 4, 2026
@shenwei356
Copy link
Owner

Thanks !

@shenwei356
Copy link
Owner

Thanks a lot!

@stahiga
Copy link
Contributor Author

stahiga commented Jan 4, 2026

You're welcome! Glad to contribute to the project.

shenwei356 added a commit that referenced this pull request Jan 4, 2026
@shenwei356
Copy link
Owner

I've also updated the help message to recommend users to use seqkit sample2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants