Statisfaction - I can't get no

The MAP is not the territory

Rémi Bardenet — Mon, 24 Nov 2025 23:00:00 GMT

TL;DR: The MAP estimator is sometimes a Bayes estimator in disguise, but this comes at a price.

Say you are inferring a parameter , and you have come up with a prior density with respect to the Lebesgue measure on , and a family of densities for your observations, with respect to some reference measure on the space where the data live. After observing data , a common practice, especially in the literature on inverse problems, is to estimate by maximizing the posterior density With the right assumptions on the two densities in the RHS of Equation 1, the argmax is unique, thus justifying the definition. The MAP estimator is popular in inverse problems, e.g. for restoring corrupted images, where is typically Gaussian or Poisson and the prior typically expresses a regularization, e.g. a soft constraint on coefficients in a basis or a frame. This popularity of the MAP estimator is largely explained, I think, by the availability of efficient numerical optimization procedures to solve Equation 1 for the likelihood-prior pairs that are common in inverse problems.

Yet some Bayesians dislike the MAP estimator. For starters, the primitives of Bayesian inferential procedures are usually probability measures, not their densities. In particular, I can arbitrarily change the MAP estimator by modifying e.g. the prior density in Equation 1 on a set of Lebesgue measure zero. That alone was enough of an argument for me against the MAP until about ten years ago. At that time, I saw a talk by Marcelo Pereyra in Bordeaux, presenting this paper. Marcelo was trying to salvage the MAP estimator, by casting the MAP as a Bayes action in a (twisted) decision-theoretical framework. I remember thinking a lot about this at the time, after which I put these thoughts in a mental drawer for a while. At coffee time during the last GRETSI, Rémi Gribonval mentioned his past work on exactly this issue, and I couldn’t but reopen that drawer. I thought the basic ingredients of this discussion would make a nice blog post.

As a palate cleanser before the theorems, the picture in Figure 1 is Alfred Korzybski (1879-1950), the Polish-American philosopher of science who coined the sentence making the punny title of this post. According to Wikipedia, his views are that our understanding of the world is impeded by our nervous system, language, etc. and that mathematics are a language that helps us formulate a discourse that best approximates reality. Amusingly, this resonates with the post’s content: we will see that talking of the MAP as maximizing a posterior, and thus intuitively according some modelling role to the densities appearing in Equation 1, is maybe not be the best way to express the mental assumptions we are making on the world when choosing the MAP estimator.

Figure 1: Left: Alfred Korzybski, whose Wikipedia picture would actually benefit from some denoising. Right: Denoising is classically implemented as a MAP, here I used a Gaussian likelihood and a “total variation” prior; see the documentation of scikit-image.

We start with part of Theorem 3.1 in Marcelo Pereyra’s above-mentioned paper. With the notation of Equation 1, let be minus the log density of the posterior obtained from and . Note that I use to say ``up to an additive constant” here. Assume is strongly convex and , and that decays fast as grows. Consider the so-called Bregman divergence The result states that and in particular the minimum in the RHS is unique. Informally, the proof follows from plugging the definition of in the expectation, and noting that the only non-trivial term is The latter integral is zero under the right decay assumption by the divergence theorem.

Now, Equation 2 implies that the MAP estimator is a Bayes estimator, in the sense that it maximizes an expected utility (equivalently, minimizes an expected loss) with respect to a probability measure, here the posterior measure. The twist is that the loss function depends on the posterior through its negative log density , and in particular it depends on the data . In subjective Bayes terms, the utility is state-dependent. This violates the most common sets of axioms of Bayesian decision theory, such as the Anscombe-Aumann axioms presented in Schervish’s book. In particular, this makes it hard to interpret the posterior as a degree of belief. Yet, I have grown inclined to weaken my definition of being Bayesian, and it would be interesting to understand how the choice of as a loss impacts the statistician’s ranking on actions. As a remark of independent interest, is not symmetric, and if one reverts and in Equation 2, Pereyra shows that the Bayes action becomes the posterior mean!

I can’t have an exhaustive bibliography in this post, but I should at least mention that Pereyra’s result generalizes an earlier result by Burger and Lucka who focussed on Gaussian likelihoods. Pereyra also mentions a generalization to non-Gaussian likelihoods akin to his by Burger, Dong, and Sciacchitano. Pereyra also cites a 2011 paper by Rémi Gribonval, which was the start of a line of work by Gribonval and Nikolova. The next result I’d like to cover is in the last paper in that line of work, a 2019 paper by Gribonval and Nikolova. According to a footnote, Mila Nikolova passed away during their writing of the paper. I have good memories of her lectures on optimization in Cachan.

Gribonval and Nikolova’s take is rather different. They start from a mean posterior estimator, where the posterior is defined by what I will call the initial likelihood-prior pair. Under conditions on the initial likelihood, they manage to rewrite the mean posterior estimator in the form of Equation 1, for a different likelihood-prior pair than the initial one. Let me call this new pair of densities appearing in the MAP reformulation the computational pair. For the authors, the computational pair of densities are not thought of as modelling the data generation process or a prior belief, they are simply intermediate quantities that appear in a formal rewriting of the original Bayesian estimator. Compared to Pereyra and Burger et al., the procedure has the benefit of keeping the loss function untouched: it remains the squared loss throughout. The price is to pay is, from what I understand, a limited number of initial likelihoods that can be treated, and a rather intricate definition of the computational prior. An important message from the paper is that, if you choose to go for a MAP estimator (say, a LASSO estimator in linear regression, or the total-variation denoiser I used in Figure 1), your likelihood-prior pair is of the computational kind: your modelling choices are encoded in the implicit initial pair of densities.

Their fundamental tool is their Lemma 1 on proximal operators, i.e. operators that map data to the solution of a regularized least-squares problem. Formally, for a function that is not identically , define Proximal operators are a key notion in optimization of non-differentiable functions, as solving a regularized least squares problem can intuitively replace a gradient descent step. In a companion paper, Gribonval and Nikolova had found a characterization of proximal operators, and they apply it here to posterior means: under (stringent) conditions on the initial likelihood-prior pairs, the mean of the posterior can be rewritten as a MAP for a different likelihood-prior pair. They give many examples of the resulting MAP reformulations. To cite only one, their Proposition 1 states that if is a Poisson law, and the prior on is whatever you want, then there exists a function on the positive reals such that Otherly put, the posterior mean for a model with Poisson noise has a MAP formulation as in Equation 1, for a computational likelihood that looks like a Gaussian!

Overall, a MAP can hide a Bayes estimator, at the price of either a data-dependent loss function or because your MAP problem is the proximal rewriting of a posterior mean corresponding to a different likelihood-prior pair! Note that I’ve only scratched the surface of the papers I mention, and they all contain more nuggets than what I dug out.

Testing the RSS feed

Nicolas Chopin — Fri, 03 Oct 2025 22:00:00 GMT

This is just a test. Sorry for the inconvenience. I will delete this post shortly.

Nested sampling and SMC: numerical experiments

Nicolas Chopin — Sun, 29 Jun 2025 22:00:00 GMT

By “popular” demand (i.e., Adrien asked for it), here are the numerical experiments I promised in my previous post. I did these experiments initially to illustrate some points in the talk I gave at MaxEnt 2023.

NS-SMC vs tempering SMC for a Gaussian-like target

Salomone et al discuss how NS (nested sampling, both the vanilla and the SMC versions) may outperform tempering whenever the target distribution exhibits pathologies such as phase transition. It is not easy (to me at least) to grasp how phase transition may occur in a Bayesian posterior, but I suspect it tends to occur when the posterior is multi-modal. Please have a look at their paper for more details on this and their first numerical experiment which illustrates this point.

In this first experiment, I wanted to see whether NS-SMC is competitive with tempering SMC on a less challenging target distribution, i.e. the good old logistic regression posterior, which is typically Gaussian like, and therefore unimodal.

In the plots below, I compare two instances of waste-free SMC, one based on a tempering sequence, the other on the NS sequence discussed in the previous post. Both algorithms derive automatically the next element in these sequences so that the ESS (effective sampling size) is . Both rely on random walk Metropolis kernels, which are calibrated on the current particle sample.

I consider the sonar dataset (dim=61). The plot below show how the MSE (over 100 runs) of the log-marginal likelihood evolves as a function of for both algorithms; considered values for are . This plot is a bit misleading, because, when changes, the CPU cost changes as well: the large is, the larger is the number of intermediate distributions.

So let’s do a second plot, where the axis is the work-normalised MSE; that is, MSE times number of total evaluations of the likelihood (which is a good proxy for overall CPU cost). See below.

Both variants seems to lead to the same level of performance (i.e. CPU vs error trade-off). One point to note is that the best performance for NS is obtained by taking small.

Bottom line: NS-SMC seems indeed competitive with tempering SMC. This is a bit surprising to me (as they rely on two very different sequences of distribution), but it shows that NS-SMC may deserve more scrutiny from the Bayesian computation community, I think.

A limitation of vanilla NS

Another experiment, possibly of more limited interest. For the same type of target distributions (logistic regression posterior, this time for the Pima dataset), the plot below illustrates the bias of vanilla NS as a function of the number of MCMC steps performed at each iteration. Recall that in vanilla NS, you discard one particle at each time (the one with smallest likelihood), choose randomly one of the remaining one, apply MCMC steps to this selected particle, and add back the output to the particle sampler.

This plot suggests that NS may be biased if is too small. I am not sure why this is happening. This may be because NS is valid only when . Or maybe because of the adaptive MCMC strategy I’m using: as in the previous section, I use random walk Metropolis, and I recursively adapt the proposal covariance to the empirical covariance matrix of the particles.

How to replicate these results

I have added some time ago a nested module to particles, which implements both vanilla NS and NS-SMC. The numerical experiments reported above may be reproduced by running the scripts in folder papers/nested.

A simpler nested sampling identity

Nicolas Chopin — Tue, 03 Jun 2025 22:00:00 GMT

In this post, I am trying to come up with a simple introduction to NS (nested sampling), through the lens of SMC samplers. It should be interesting to readers who are familiar with the latter but not with the former.

This post is inspired by this paper by Salomone et al, which has just been accepted in JRSSB. Congrats to the authors!

Set up

Consider a model with parameter , prior , and likelihood . The posterior is then (The Bayesian interpretation is not essential. More generally, could be a proposal distribution, a target distribution, and a function proportional to .)

Let’s now introduce the following family of distributions: In words, is the prior truncated to the region , and the normalising constant is the prior probability that .

If we introduce a sequence , we can use a SMC sampler to approximate recursively and its normalising constant . Note that the particle weights at each time will be 0 or 1 in this particular SMC sampler, since:

The simpler NS identity

Now comes the identity. Let for an arbitrary function of . Then:

Thus, if we implement a SMC sampler that tracks the sequence , we will be able to approximate all the above quantities, and thus, through this identity, to approximate the marginal likelihood, , and posterior moments, .

Choosing the ’s

In practice, we need to choose the ’s. As in tempering, it seems reasonable to set them automatically, in such a way that the ESS (effective sample size) is , for some . Because the weight function is , this amounts to taking to be the upper quantile of the , where the ’s are the particles sampled at time by our SMC sampler. This is what Salomone et al recommend. In this case, we can replace in the identity above by , at least for .

The corresponding estimate will be something like: where I omitted the th term (it has a slightly different expression, i.e. ), and I used the fact that the unweighted sample generated at the beginning of iteration currently targets .

Vanilla NS as a particular waste-free SMC sampler

Now assume that, in your adaptive NS-SMC sampler, you set (or equivalently, ); that is, you discard only one particle, the one with smallest likelihood. In other words, you decide to move as slowly as possible up the likelihood function.

If you’d resample the surviving particles, and apply MCMC step to each of them, you would get a very expensive sampler: increasing means you both increase the cost of a single iteration, and the total number of iterations (since it makes larger).

A cheaper alternative is to choose randomly one of the surviving particles, apply it a MCMC step, and takes the output as your new th particle. Then, you get an algorithm which is very close to the original NS one. In particular, your estimate of becomes: with . (The original NS estimate has replaced by , which should be very close numerically for large .)

This idea of resampling particles, and move only one of them is reminiscent of waste-free SMC. In waste-free SMC, you resample only particles out of , . Then, assuming divides , i.e., for some , you apply to each resampled particle MCMC steps, and gather the resulting states to form a new particle sample of size . What if does not divide , i.e. , ? Then it makes sense to generate MCMC chains of length , and chains of length . This is what happens here, with , .

Why did I say we get a “simpler” identity?

The original NS algorithm by Skilling derives essentially the same identity as above, but through more convoluted steps, which involves the CDF of random variable , when , its inverse, Beta distributions, etc. I find the derivation above simpler (at least, again, if you are familiar with SMC samplers). Of course, in return, you get a justification which is a bit hand-wavy for vanilla NS (but for NS-SMC, it is perfectly solid).

Should I care about NS?

There are two sub-questions:

NS vs SMC-NS: Salomone et al give numerical evidence (and arguments) suggesting that NS-SMC outperforms NS.
SMC-NS vs tempering SMC or other SMC schemes: Salomone et al also give numerical evidence suggesting NS-SMC is competitive with tempering SMC, which is intriguing (and in line with independent numerical experiments I did).

I will elaborate on these two points in my next post (coming soon). In the meantime, feel free to have a look at the aforementioned paper, it is well worth a read.

Numpy broke my heart

Nicolas Chopin — Wed, 24 Apr 2024 22:00:00 GMT

I swear, the title is kind of funny in French (try to figure out why). Anyway, in this post I wanted to dispel a misconception I had until recently on python, numpy and multi-processing, and which led me to say something silly in our SMC book.

no comment

Python and multi-processing

Most modern computers have several CPU cores; I guess even potato computers have a least two these days? On the other hand, a program written in Python will be executed on a single core, because of the GIL. This means that all the other cores will stay idle while you run your program. Which is frustrating when said program takes forever to complete.

There are different ways to make all your CPU cores work for you, but I will discuss the only two ways which I am (a bit familiar) with:

Use joblib or a similar library. (But seriously, just use joblib, it’s great.) This requires a bit of work, as you have to state explicitly which parts of you program may be turned into independent tasks that will be performed in parallel. The typical use case for me is to run several times the same SMC algorithm (perhaps with different parameters, e.g. a different number of particles); see for instance this.
Do nothing, and pray that your program rely on those Numpy operations which are already parallelised for you (multithreaded). Numpy rely on low-level (C/Fortran) linear algebra libraries such as BLAS and LAPACK, and these libraries are able to implement certain operations (e.g. matrix multiplication) on multiple cores. In this case, your python script still runs on a single core, but, when it encounters a multithreaded numpy operation, this operation spawns (temporarily, for this operation only) several threads that are executed on different cores.

My bad

Ok, now for my misconception (a.k.a. what a idiot I am.) When I run on a standard PC the following script, which implements the numerical experiment of Chapter 17 (on SMC samplers) in our book, all the cores are kept busy during the execution. This script does not rely on any form of explicit parallelism. Several SMC samplers are run, but sequentially (I don’t use multiSMC in this script). So clearly it’s numpy that is doing its thing (point 2 above). In fact, by profiling it, one can see that most of the CPU time is spent in the one line that computes the log-likelihood of the logistic regression model, and this involves a matrix multiplication. So this makes sense.

In the book (page 352 if you want to check), I said naively: if you have k cores, you get a x k speed-up for free in this particular experiment. I thought that that was the case, because all my CPU cores were 100% busy the whole time.

However, I did some more testing recently and tried to compare the running time of this script when numpy does multithreading and when it does not. (See here on how you may disable multithreading in numpy.)

On a standard PC, the speed-up is more like… one per-cent?
On a certain cloud-based architecture that I’m currently playing with (and which rely on kubernetes containers), multithreading can actually slow down the script by a factor of 10 or more.

What’s going on?

I am not sure, I’m a bit out of my depth here. I guess what happens is that, for this particular script, the speed-up brought by multithreading is cancelled by the time it takes to generate new threads at the beginning of the numpy operation. (Remember that this must be done each time a line with a multithreaded numpy operation is executed.) In fact, the multithreaded operation seems to be a matrix/vector mutiplication, where the matrix is not very large. (It’s of size , where is the number of particles. I tried to increase N several times over, but it did not change the results.)

And things may get worse in containers, where either numpy might do wrong assumptions on the available resources, or you simply share resources with many other users. (Disclaimer, I don’t know what I’m talking about.)

Also, of course, this kind or results may depend on your hardware, the version of python and related libraries you are using (in particular whether you use the openBLAS version of BLAS of the MKL one which is specific to Intel CPUs, to see this, check the output numpy.config() on your machine.) and so on. The picture below summaries the situation.

Alice decided to better understand multiprocessing in Python

Enter joblib

The discussion above assumes you run a single program, and that Numpy may or may not get access to all the cores. What if you try to implement multi-processing (using joblib, multiprocessing or something else), but each task perform numpy operations? You could have a over-subscription problem, that is, you end up with many threads (more than the number of cores), and the computer wastes a lot of time trying to juggle between all these threads.

Fortunately, joblib is smart enough to tell numpy to calm down and generate fewer threads. This point is discussed here inin the documentation. Well worth a read.

I managed to speed up my script significantly by using joblib, but I still cannot obtain a x 24 factor on my niffy 24-core PC. I am still ~~crying~~ investigating.

Take-home messages

It’s not because all your 20 CPU cores are busy that your script is running 20 times faster.
If you actually want to achieve a substantial speed-up in a multi-core hardware, you might need to try different things, and check the actual results (i.e. measure the total running time).
Read this and this if you want to learn more about multiprocessing and numpy, I found these pages clear and authoritative on this topic.
don’t believe everything you read in books? :-)

Quantum workers in Bernoulli factories

Rémi Bardenet — Tue, 13 Feb 2024 23:00:00 GMT

TL;DR: A quantum computer lets you provably build more general Bernoulli factories than your laptop.

I have grown an interest for quantum computing, both for fun and because it naturally applies to sampling my favourite distribution, determinantal point processes. One of the natural (and still quite open) big questions in quantum computing is, for a given computational task such as solving a linear system, whether having access to a quantum computer gives you any advantage over using your laptop in the smartest way possible. Maybe the quantum computer lets you solve part of your problem faster, or maybe it allows you to solve a more general class of problems. Dale, Jennings, and Rudolph (2015) prove a quantum advantage of the latter kind, for a task that appeals to a computational statistician: a quantum computer gives you access to strictly more Bernoulli factories than your laptop does. In this post, I discuss one of their examples.

Figure 1: An excerpt from an excellent comic strip by Scott Aaronson and Zach Weinersmith.

Bernoulli factories

First, I need to define what a Bernoulli factory is. Loosely speaking, a Bernoulli factory is an algorithm that, when fed with i.i.d draws from a Bernoulli random variable with unknown parameter , outputs a stream of independent Bernoullis with parameter . The algorithm does not have access to the value of , and needs to work for as large a range of values of as possible. For instance, a trick attributed to von Neumann gives you a Bernoulli factory for the constant function , can you guess how? If you have never seen this trick, take a break and think about it. Here is a hint: try to pair Bernoulli draws and define two events of equal probability.

The problem of determining what Bernoulli factories can be constructed on a classical (as opposed to quantum) computer has been answered by Keane and O’Brien (1994). Essentially, it is necessary and sufficient that be continuous on its domain , and that either is constant or there exists an integer such that, for all , In particular, a non-constant should not take the values or in , and cannot approach these extreme values too fast. In particular, the doubling function defined on does not correspond to a Bernoulli factory, while its restriction to does, for any . Another simple example is defined on , which does not correspond to a Bernoulli factory. Yet, the rest of the post shows that does correspond to a specific weakening of the notion of Bernoulli factory, one that is natural in quantum computing.

Quantum computers and quantum coins

Now buckle up, because I need to define a mathematical model for a quantum computer. This model only requires basic algebra, albeit with strange notation. Let be a positive integer, and where the tensor product is taken times. An -qubit quantum computer is a machine that, when fed with

a positive semi-definite, Hermitian operator acting on , with trace norm (the state),
a Hermitian operator on (the observable),

outputs a draw from the random variable , with support included in the spectrum of , defined by Here is the operator that has the same eigenvectors as , but where each eigenvalue is replaced by . The correspondence in Equation 2 between a state-observable pair and a probability distribution on the spectrum of the observable is a cornerstone of quantum physics called Born’s rule, and it is the only bit of quantum theory we shall need. In other words, we see a quantum computer as a procedure to draw from probability distributions parametrized by state-observable pairs. We give two fundamental examples of such state-observable pairs, which can be respectively interpreted as describing one quantum coin and two quantum coins.

The quantum coin. Consider a one-qubit computer, i.e. . Then has dimension , and we fix an orthonormal basis, which we denote by . The strange notation is inherited from physics, and is very practical in computations, as you will see. In short, denote by (a bracket, or bra-ket) the inner product in . Now, a vector in is written (a ket). Similarly, define the linear form (a bra) by By construction, we can write things like so that the bra-ket notation for linear forms and vectors is consistent with the inner product.

Now, remember we have fixed a basis of . For , we define This definition is consistent with earlier notation, as when , for instance. Now, we define a quantum coin as the state . It is the projection onto , and in particular it is a positive semi-definite, Hermitian operator of trace , and hence a valid state. As observable, we take the projection onto the second vector of the basis, which we denote in the bra-ket notation by . What random variable does this state-observable pair define in Equation 2?

Well, the spectrum of the observable is , so we have defined a Bernoulli random variable. Moreover, the probability that it is equal to is given by taking in Equation 2, yielding by cyclicity of the trace. All of this to define a variable! Things get more interesting when you try to create two dependent Bernoulli variables.

Two quantum coins. Consider now a computer with two qubits, so that the Hilbert space is . From our orthonormal basis of , we can build an orthonormal basis of . To keep expressions short, it is customary to write as . To define a pair of quantum coins, we now consider the tensor product of two quantum coins, We think of the corresponding state as two quantum coins. Now consider for your observable an operator with four distinct eigenvalues, say for , each corresponding to eigenvector . In other words, the spectral decomposition of is The random variable , associated through Equation 2 to two quantum coins and our newly defined observable , has support in Moreover, taking in Equation 2, we obtain again by cyclicity of the trace and then carefully distributing our multiplication, noting that most terms are zero by orthogonality. Otherly put, the indices of are a pair of independent Bernoullis with equal parameter . Again, this might feel like a lot of algebraic pain for no gain, but wait for it.

What if we had taken the same state, but with another observable? Say the observable with four distinct eigenvalues , and corresponding eigenvectors Then, the random variable defined by Born’s rule in Equation 2 is supported in with Similarly, and You can check that the four probabilities sum to . This time, if you map, e.g., to the string , to , to , and to , we no longer have independent Bernoulli draws, but a rather strange correlation structure. We shall see that allows us to build a Bernoulli factory that is beyond the reach of a classical computer.

A quantum Bernoulli factory

Imagine the following procedure. Draw the random variable . If you obtain or , then stop, and respectively output and . Otherwise, draw another independent realization of , etc. This is reminiscent of the von Neumann trick we mentioned earlier. What have we achieved? Well, the output is a Bernoulli draw with parameter Repeating the procedure as many times as you want draws, we thus have a Bernoulli factory for in Equation 1, which we know to be beyond the reach of classical Bernoulli factories!

The difference is that our Bernoulli factory is a quantum Bernoulli factory. In particular, our basic resource is (physically) independent copies of . This is asking for strictly more than (statistically) independent Bernoulli draws. Indeed, depending on your observable, two physically independent copies of can give you two i.i.d. Bernoullis , or something more complicated like . If you consider as equivalent the cost of preparing the two types of inputs, i.i.d. Bernoullis on one side and physically independent copies of on the other side, then you have a quantum advantage. It might be a big assumption, but I find it easier to swallow than similar caveats in other quantum advantages that I’ve read about.

Further remarks

The example in this post is from the paper by Dale, Jennings, and Rudolph (2015). The authors further characterize the Bernoulli factories that you can build with only single-qubit operations: they strictly include classical Bernoulli factories and the example from this post. In other words, it is not necessary to use pairs of qubits to build . Since then, there has been more work on quantum Bernoulli factories, for instance considering quantum-to-quantum Bernoulli factories, where the goal is to create independent copies of rather than a stream of Bernoulli random variables.

I thank my group for valuable comments during the writing of this post. One non-consensual point is that I have tried to reduce the quantum formalism to the correspondence in Equation 2 between a state-observable pair and a random variable. This has the advantage of keeping the necessary algebra to a minimum, but it forced me to introduce rather abstract observables, with a spectrum that we only use through its indices. A more standard (but arguably lengthier) treatment might have involved projection-valued measures.

Coulomb rhymes with variance reduction

Rémi Bardenet — Tue, 31 Oct 2023 23:00:00 GMT

… Well, it does rhyme if you read the title aloud with a French accent, hon hon hon.

To paraphrase Nicolas’s previous post, say I want to approximate the integral where is a compact set of . I could use plain old Monte Carlo with nodes, Intuitively, an i.i.d. uniform sample of quadrature nodes will however leave “holes”; see Figure 1 (a). In words, given a realization of the nodes, it is possible to insert a few large balls in that do not contain any . These holes may make us miss some large variations of . Part of the variance of the Monte Carlo estimator in Equation 1 could intuitively be removed if we managed to fill these holes, using some of the nodes that got lumped together by chance.

Many sampling algorithms, such as randomized quasi-Monte Carlo, impose similar space-filling constraints, yielding a random sample with guarantees of “well-spreadedness”. In the paper I describe in this post, Diala Hawat and her two advisors (Raphaël Lachièze-Rey and myself) obtained variance reduction by explicitly trying to fill the holes left by a realization of . In the remainder of the post, I will describe Diala’s main theoretical result.

(a) A Poisson sample

(b) The same sample after repulsion

Figure 1: Note how the repelled sample has fewer visible “holes” and “lumps”. The details of how we implemented the repulsion are interesting in themselves, and can be found in the paper and the associated code.

The basic intuition is to imagine the quadrature nodes as electrons. In physics, electrons (like all charged particles) are subject to the Coulomb force. The Coulomb force exerted by one electron onto another points away from the first electron, with a magnitude that is inversely proportional to the th power of the Euclidean distance between the two. As a result, electrons tend to repel each other, and electrons close to you will push you away harder than electrons at the other side of the support of . This is the behaviour that we would like to emulate, so that our quadrature nodes avoid lumping together and rather go and fill holes where no particle causes any repulsion.

If we solved the differential equation implementing Coulomb’s repulsion on our i.i.d. nodes, however, the points would rapidly leave the support of and “go to infinity”, to make sure that the pairwise distances between nodes are as large as possible. One way to avoid this undesired behaviour is to consider an “infinite” uniform Monte Carlo sample in , so that, wherever an electron looks, there are an infinite number of electrons preventing it from escaping. To make the situation comparable with our initial -point estimator in Equation 1, we also require that there are roughly points inside the region where we integrate . Formally, we consider a homogeneous Poisson point process of intensity in , where is the volume of . Consider the modified Monte Carlo estimator This estimator is very similar to the -point crude Monte Carlo estimator , except the number of evaluations of in the sum is now Poisson-distributed, with mean and variance . What we have gained is that we can now intuitively apply the Coulomb force to the points of , and hope that both before and after repulsion, about points remain in our integration domain . Proving this remains technically thorny, however. For starters, for in , the series defining the Coulomb force exerted on by a collection of points in , namely is not absolutely convergent, so that the order of summation matters. However, it was observed as early as 1943 that, if you sum by increasing distance to the reference point , and is a homogeneous Poisson point process, then the (random) series converges almost surely. Interested readers are referred to a classical paper by Chatterjee, Peled, Peres, and Romik (2010) on the gravitational allocation of Poisson points, one of the inspirations behind Diala’s work.

Putting (important) technical issues aside, we are ready to state the main result of our paper. We prove that, for , the repelled Poisson point process is well-defined, and has on average points in . Moreover, is an unbiased estimator of . Finally, if is , for small enough, the variance of is lower than that of . To sum up, for any integrand, we can in principle reduce the variance of our Monte Carlo estimator by slightly repelling the quadrature nodes away from each other. This is it: by breaking lumps and filling holes in a postprocessing step, we obtain variance reduction over crude Monte Carlo. The proof is not trivial, and relies on the super-harmonicity of the potential behind the Coulomb force.

Let me close with two further pointers to the paper. First, we discuss a particular value of the “step size” parameter in the paper, which has an easily-implemented closed form, and reliably led to variance reduction across our experiments. Second, while our theoretical results only cover the Poisson case so far, we also show experiments on other (stationary) point processes than Poisson, which confirm that variance reduction is also achieved across point processes with varying second-order structure. In Monte Carlo terms, and being very optimistic, some sort of repulsion might become a standard postprocessing step in the future, to reduce the variance of one’s estimator, independently of the law of the nodes (Markov chain, thinned PDMP, you name it).

particles version 0.4: single-run variance estimates, FFBS variants, nested sampling

Nicolas Chopin — Thu, 31 Aug 2023 22:00:00 GMT

Version 0.4 of particles have just been released. Here are the main changes:

Single-run variance estimation for waste-free SMC

Waste-free SMC (Dau & Chopin, 2020) was already implemented in particles (since version 0.3), and even proposed by default. This is a variant of SMC samplers where you resample only particles, apply to each resampled particle MCMC steps, and then gather these states to form the next particle sample; see the paper if you want to know why this is a good idea (short version: this tends to perform better than standard SMC samplers, and to be more robust to the choice of the number of MCMC steps).

What was not yet implemented (but is, in this version) is the single-run variance estimates proposed in the same paper. Here is a simple illustration:

Both plots were obtained from runs of waste-free IBIS (i.e. target at time is the posterior based on the first observations, ) applied to Bayesian logistic regression and the Pima Indians dataset. The red line is the empirical variance of the output, and, since the number of runs is large, it should be close to the true variance. The lower (resp. upper) limit of the grey area is the (resp. ) quantile of the single-run variance estimates obtained from these runs. The considered output is either the posterior mean of the intercept (top) or the log marginal likelihood (bottom).

We can see from these plots that these single-run estimates are quite reliable, and make it possible, in case one uses IBIS, to obtain error bars even from a single run. See the documentation of module smc_samplers (or the scripts in papers/wastefreeSMC) for more details on how you may get such estimates.

New FFBS variants

I have already mentioned in a previous post, on the old blog, that particles now implement new FFBS algorithms (i.e. particle smoothing algorithms that rely on a backward step) that were proposed in this paper. On top of that, particles now also includes a hybrid version of the Paris algorithm.

Nested sampling

I was invited to this nested sampling workshop in Munich, so this gave me some incentive to:

clean up and document the “vanilla” nested sampling implementation which was in module nested.
add to the same module the NS-SMC samplers of Salomone et al (2018) to play with them and do some numerical experiments to illustrate my talk.

I will blog shortly about the interesting results I found (which essentially are in line with Salmone et al).

Other minor changes

Several distributions and a dataset (Liver) were added, see the change log.

Logo

I’ve added a logo. It’s… not great, if anyone has suggestions on how to design a better log, I am all ears.

What’s next?

I guess what’s still missing from the package are stuff like:

the ensemble Kalman filter, which would be reasonably easy to add, and would be useful in various problems;
advanced methods to design better proposals, such as controlled SMC (Heng et al, 2020) or the iterated auxiliary particle filter (Guarniero et al, 2017).

If you have other ideas, let me know.

Feedback

I have not yet looked into how to enable comments on a quarto blog. You can comment by replying to this post on Mastodon, or to the same post on LinkedIn (coming soon); or you can raise an issue on github or send me an e-mail, of course.

Better than Monte Carlo (this post is not about QMC)

Nicolas Chopin — Fri, 18 Aug 2023 22:00:00 GMT

(This is repost from this December 2022 post on the old website, but since math support is so poor on Wordpress, I’d rather have this post published here.)

Say I want to approximate the integral based on evaluations of function . I could use plain old Monte Carlo: whose RMSE (root mean square error) is .

Can I do better? That is, can I design an alternative estimator/algorithm, which performs evaluations and returns a random output, such that its RMSE converge quicker?

Surprisingly, the answer to this question has been known for a long time. If I am ready to focus on functions , Bakhvalov (1959) showed that the best rate I can hope for is That is, there exist algorithms that achieve this rate, and algorithms achieving a better rate simply do not exist.

Ok, but how can I actually design such an algorithm? The proof of Bakhvalov contains a very simple recipe. Say I am able to construct a good approximation of , based on evaluations; assume the approximation error is , . Then I could compute the following estimator, based on a second batch of evaluations: and it is easy to check that this new estimator is unbiased, that its variance is , and therefore its RMSE is . (It is based on evaluations.)

So there is strong relation between Bakhvalov results and function approximation. In fact, the best rate you can achieve for the latter is , which explain the rate above for stochastic quadrature. You can see now why I gave this title to this post. QMC is about using points that are better than random points. But here I’m using IID points, and the improved rate comes from the fact I use a better approximation of .

Here is a simple example of a good function approximation. Take , and that is, split into intervals , and approximate inside a given interval by its value at the centre of the interval. You can quickly check that the approximation error is then provided is . So you get a simple recipe to get the optimal rate for and .

Is it possible to generalise this type of construction to any and any ? The answer is in our recent paper with Mathieu Gerber, which you can find here. You may also want to read Novak (2016), which is a very good entry on stochastic quadrature, and in particular gives a nice overview of Bakhvalov’s and related results.

Welcome to the new, quarto-based version of Statisfaction

Nicolas Chopin — Fri, 18 Aug 2023 22:00:00 GMT

Hey! We have just moved this blog from Wordpress to github. The old version is still available here. The new version is based on quarto, which will make it much easier to write mathematics, e.g. , and code, e.g.

import numpy as np

def fact(n):
    return np.prod(range(1, n + 1))