Skip to content

Add note on probability for interleave_randomly #1063

@random-eight

Description

@random-eight

After reading discussion elsewhere (see references), i'm not sure if people in general are aware of the differences in probability distributions of different implementations. The documentation for the current implementation (#1048) doesn't specify which distribution it uses. This issue is discussed in multiple answers and comments in the references, but i'll give my own version here.

Consider 2 lists of 2 elements each:

a = [1, 2]
b = [3, 4]
c = list(interleave_randomly(a, b))

There are 6 possible values for c: [1, 2, 3, 4], [1, 3, 2, 4], [1, 3, 4, 2], [3, 1, 2, 4], [3, 1, 4, 2], and [3, 4, 1, 2]. People may expect, at least subconsciously, that all of these have a 1/6 chance of happening, but in the current implementation, [1, 2, 3, 4] and [3, 4, 1, 2] have a 1/4 chance each, while the rest have a 1/8 chance each.

These discrepancies are more pronounced if one list is much longer than another—the elements of the shorter list are much more likely to appear at the beginning of the result, rather than spread evenly among the other elements.


In order to make the distribution fair, the lengths of the inputs must be known in advance, which isn't always feasible for arbitrary iterables. Plus, there may be cases in which the distribution isn't a real concern, or the current version is actually more correct. This is why i'm not proposing to fix this, but just to add some documentation.

That being said, if a fairer version is desired, i have a couple possible implementations. Would those go here or in a separate issue?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions