Skip to content

Sampling sequences by number is not truly random #386

@alvanuffelen

Description

@alvanuffelen

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

  • describe the problem
    When using seqkit sample -n, the first number of sequences in the file will a higher chance of being picked than later sequences.
  • provide a reproducible example
    Porting your logic:

    seqkit/seqkit/cmd/sample.go

    Lines 155 to 163 in 22f71ff

    proportion = float64(number) / float64(len(records))
    for _, record := range records {
    // if <-randg <= proportion {
    if rand.Float64() <= proportion {
    n++
    record.FormatToWriter(outfh, config.LineWidth)
    if n == number {
    break

    to Python:
import random
import matplotlib.pyplot as plt

# Toy data to sample from
sample = range(1, 101)

# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}

# Iterate a million times
for _ in range(1_000_000):
    # Counter for the number of selected elements in each iteration
    added_element_counter = 0
    for element in sample:
        if random.random() <= proportion:
            element_count[element] += 1
            added_element_counter += 1
        if number == added_element_counter:
            break

plt.bar(element_count.keys(), height=element_count.values())
plt.show()

image

A solution could be to shuffle the records before iterating over them:

import random
import matplotlib.pyplot as plt

# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}
for _ in range(1_000_000):
    # Toy data to sample from
    sample = list(range(1, 101))
    # Shuffle the list each iteration
    random.shuffle(sample)
    # Counter for the number of selected elements in each iteration
    added_element_counter = 0
    for element in sample:
        if random.random() <= proportion:
            element_count[element] += 1
            added_element_counter += 1
        if number == added_element_counter:
            break

plt.bar(element_count.keys(), height=element_count.values())
plt.xlabel('Element')
plt.ylabel('Number of times element was picked.')
plt.tight_layout()
plt.show()

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions