-
Notifications
You must be signed in to change notification settings - Fork 176
Closed
Description
Prerequisites
- make sure you're are using the latest version by
seqkit version - read the usage
Describe your issue
- describe the problem
When usingseqkit sample -n, the first number of sequences in the file will a higher chance of being picked than later sequences. - provide a reproducible example
Porting your logic:
Lines 155 to 163 in 22f71ff
proportion = float64(number) / float64(len(records)) for _, record := range records { // if <-randg <= proportion { if rand.Float64() <= proportion { n++ record.FormatToWriter(outfh, config.LineWidth) if n == number { break
to Python:
import random
import matplotlib.pyplot as plt
# Toy data to sample from
sample = range(1, 101)
# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}
# Iterate a million times
for _ in range(1_000_000):
# Counter for the number of selected elements in each iteration
added_element_counter = 0
for element in sample:
if random.random() <= proportion:
element_count[element] += 1
added_element_counter += 1
if number == added_element_counter:
break
plt.bar(element_count.keys(), height=element_count.values())
plt.show()A solution could be to shuffle the records before iterating over them:
import random
import matplotlib.pyplot as plt
# Wanted number of elements (-n)
number = 10
# Calculate proportion based of the wanted number and total number
proportion = number / len(sample)
# Init dictionary to count the number of times each element was selected
element_count = {element: 0 for element in range(1, 101)}
for _ in range(1_000_000):
# Toy data to sample from
sample = list(range(1, 101))
# Shuffle the list each iteration
random.shuffle(sample)
# Counter for the number of selected elements in each iteration
added_element_counter = 0
for element in sample:
if random.random() <= proportion:
element_count[element] += 1
added_element_counter += 1
if number == added_element_counter:
break
plt.bar(element_count.keys(), height=element_count.values())
plt.xlabel('Element')
plt.ylabel('Number of times element was picked.')
plt.tight_layout()
plt.show()Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels

