Skip to content

Seeding #18

@curiousleo

Description

@curiousleo

Initialising PRNGs correctly is hard. For casual users in particular, it should be easy to Do The Right Thing.

Typical things users should be able to do and the common solutions are:

  1. User: initialise a PRNG without specifying a seed at all
    Solution: draw non-deterministic randomness from a system source to create a good seed
  2. User: initialise a PRNG from a low entropy constant (e.g. small integers like 1337 or 42) without getting punished for it
    Solution: mix user-provided seed deterministically to turn low-entropy seed (e.g. with lots of zero bytes) into high-entropy seed
  3. User: initialise multiple instances of a PRNG
    Solution: provide a seed sequence with a spawn function, see below

Prior art

Some RNG APIs have a separate "seeding" layer that takes care of this, see #13.

C++ randutils / seed_seq

  • Initialising a PRNG without parameters uses a system randomness source to seed it (source here)
  • Builds on top of the seed_seq idea from the C++ standard library; this is a kind of low level PRNG that can only fill arrays of 32-bit numbers with random values, nothing else - its only purpose is to seed other (more convenient and faster) PRNGs.

numpy SeedSequence

  • Initialising a SeedSequence without parameters uses a system randomness source to seed it
  • SeedSequence.spawn initialises a number of new SeedSequence instances. The unique spawn key is hashed into the sequence's entropy
  • SeedSequence.generate_state returns an array of random numbers for seeding
  • PRNG constructors take a SeedSequence from which they draw their seed

mwc-random

  • create and initialize combine a user-provided seed with a fixed default seed - this is a form of deterministic mixing
  • withSystemRandom and createSystemRandom seed the PRNG with system randomness

splitmix

  • seedSMGen uses the given seeds without mixing
  • mkSMGen mixes the given Word64 before initialising the SMGen
  • initSMGen uses system time
  • newSMGen splits an new generator off from the global generator

random right now

  • mkStdGen (public) throws away all but the lowest 32 bits of the given Int and then creates the StdGen via mkStdGen32. createStdGen (private) does the same, but takes an Integer.
  • split is used to create new RNG instances

Design considerations

Mixing user seeds

random as it is designed right now provides no method that deterministically mixes the user seed. This is easily fixed e.g. by calling out to the relevant splitmix function.

Spawning and splitting

In the random API, split is part of the RandomGen class. This means that there is an assumption that each RNG must be split in its own way.
This is in contrast to the C++ and numpy APIs, where there is a source of seeds separate from any particular RNG implementation (seed_seq and SeedSequence, respectively) which is used to seed all RNGs.

For concreteness' sake, what I have in mind is something along those lines:

-- | Size of the entropy pool, in bytes (default 4 or 8)
type PoolSize = Int

-- | Non-deterministic seed using system randomness source
systemSeed :: IO SeedSeq
systemSeedWithPoolSize :: PoolSize -> IO SeedSeq
-- | Deterministic seed generated from user specified input
fixedSeed :: Word64 -> SeedSeq
fixedSeedWithPoolSize :: PoolSize -> Word64 -> SeedSeq

-- | Seed a single 'RandomGen'
spawn :: RandomGen g => SeedSeq -> (g, SeedSeq)
-- | Seed multiple 'RandomGen'
spawnN :: RandomGen g => SeedSeq -> Int -> ([g], SeedSeq)

This is a simplified translation of the numpy / C++ approach. In the background, it would use something like this initialize function in RandomGen to create RNGs.

Pros of such an API with a separate, opaque source of seeds:

  • Allows us to build a 'good seed by default' API for any compatible RNG (see systemSeed and fixedSeed above)
  • Intent is communicated clearly: Seed is meant to be used for seeding, so its design constraints are different: it must create reasonable seeds for any RNG, but it can be slower than other RNGs because in the expected use case, only a small number of seeds are required compared to the overall number of random values generated by RNGs over the course of a program

Cons:

  • A bad source of seeds would lead to a 'bad seed by default' API
  • It might interact weirdly with some particular RNG algorithm

Implementation notes

entropy could be used to get entropy from the system.
The PRNG behind Seed could be

  • the non-cryptographic algorithm designed by O'Neill for randutils, which is also used by numpy: it passes statistical tests and is now widely used
  • a cryptographic RNG

Thoughts?

Edit: renamed Seed to SeedSeq to avoid clashes with the existing definition of Seed inside MonadRandom.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions