Fast sampling of special binary strings
We are going to define a simple little language. A word in this language is a binary string where the longest run of consecutive $0$s, is shorter than every run of $1$s which is not contained within a larger run (a "maximal" run). So for example: \begin{align} 001110111110011 \end{align} which we will write as \begin{align} 0^21^301^50^21^2 \end{align} is not legal, because the following red segment is just as long as the blue segment: \begin{align} {\color{red}0^2}1^301^50^2{\color{blue}1^2} \end{align} The string: \begin{align} 0^31^401^80^31^5 \end{align} is valid, because the longest run of $0$s has length three and the shortest maximal run of $1$s has length four.
Strings of the form $0^n$ are somewhat ambiguous in the above description. Technically they should be allowed, since there are no runs of $1$s at all, but they seem to in some sense violate the spirit of the description. Whether they are included is up to you, it will not make a huge difference for the challenge.
Challenge
Given a non-negative integer $n$, uniformly sample from the words of length $n$ in the language. Your algorithm must have sub-exponential average-case time complexity in terms of the input value.
You may choose a word using a built-in source of randomness, or you can accept as an additional input a source of randomness (e.g. a infinite list of 1s and 0s, or a stateful blackbox function).
This is code-golf so the goal is to minimize the size of your source code as measured in bytes.
Non-solutions
An easy would-be solution is to sample randomly on binary strings of the correct length until you get something in the language.
We will show that this does not work. This gives a uniform distribution but fails because the average-case time complexity is exponential. On average you will fail exponentially many times before you succeed.
To see this notice that the string $010$ is never a legal substring in our language, regardless of context. So there are at most 7 possible length 3 segments that can extend a valid word at any point.
So the rate at which words in our language grow is $O\left(\sqrt[3]{7}^n\right)$ (we use the $O$-notation here precisely, meaning it is bounded above) while all binary strings grow at a rate of $2^n$. $2 = \sqrt[3]{8}>\sqrt[3]{7}\approx 1.913$.
You might try to generate random strings which don't contain $010$ instead (this can be done uniformly and efficiently), and then reject the ones that don't work. However a general version of this argument can show that any finite set of forbidden substrings will either rule out some words you want to sample (i.e. make the sampling non-uniform) or it will require exponentially many samples (i.e. the base of the exponent is too large).

1 comment thread