Model Misspecification and Simulation-Based Inference
July 2025
Model misspecification is common when analyzing real-world data, and its impact can be particularly severe when using machine learning–based approaches such as simulation-based inference. In our new work, led by my former intern Sébastien Pierre, we address this issue by developing a simple and effective strategy to stabilize inference and better exploit our data when observations fall outside the training distribution. Applied to galaxy clustering analyses, our method significantly improves robustness across models while preserving most of the available information. Check out our preprint!
Update 12/25: Accepted in Physical Review D!
Jump into The Well
December 2024
Happy to announce the release of The Well, a large-scale physics dataset collection for machine learning, featuring over 15 TB of data! This resource spans diverse domains, including biological systems, fluid dynamics, acoustic scattering, extragalactic magnetohydrodynamics or supernova explosions. This project led by Ruben Ohana and Michael McCabe will support the training of our physics foundation models at Polymathic AI. We hope it will also serve as a valuable resource for the entire machine learning for science community, sparking new discoveries and innovations. Dive in and explore our website and GitHub repository!
Accepted at NeurIPS 2024!
Bayesian Blind Denoising with Gibbs Diffusion
February 2024
Blind denoising problems are not exclusive to natural image processing; they are also prevalent in many scientific applications where the noise distribution is unknown or hard to model. In our new preprint, we introduce GDiff, a novel solution to blind denoising in a fully Bayesian context. By combining Gibbs sampling and a diffusion model, we build a rigorous method to sample the posterior distribution of the signal and the noise parameters for any kind of diffusion-based signal prior!
We show that GDiff is directly relevant to the analysis of cosmic microwave background (CMB) data, by taking an original view on the problem of separating the CMB from its foregrounds. Have you ever thought of the CMB as the noise of a blind denoising problem, and the foregrounds as the signal? From that perspective, we show that GDiff can directly separate dust and CMB while solving cosmological inference at the same time! Stay tuned for future applications to observational data!
Update 05/24: Accepted at ICML 2024!
Removing Dust from CMB Observations with Diffusion Models
October 2023
Diffusion models have revolutionized the modeling of natural images. Can they also help us to analyze cosmic microwave background (CMB) data? Thanks to my talented intern David Heurtel-Depeiges, and the collaboration of Blaskeley Burkhart and Ruben Ohana, we make a first demonstration of the potential of diffusion models for the separation of Galactic dust and CMB. We show that dust+CMB observations can be seen as the result of a diffusion process that can be reversed in time, thus naturally solving source separation.
We are already working on the next step: a diffusion-based approach for cosmological inference. Stay tuned!
Update 11/23: Spotlight talk at ML4PS NeurIPS 2023 Workshop!
Stacking for Simulation-Based Inference
October 2023
With simulation-based inference, it is typical to end up with a multitude of models/approximations of the same target posterior distribution. This usually results from the investigation of different inference algorithms, different architectures, or can simply be due to the randomness of initialization and stochastic gradients. While most practitioners usually choose to select the best of their models, with Yuling Yao and Justin Domke, we show that there is much better to do, and it's called stacking. We show that models can all be combined at once in a systematic way to improve precision, calibration, coverage, and bias at the same time. Check out our new preprint on Simulation-Based Stacking!
Update 01/24: Accepted at AISTATS 2024!
SimBIG Collaboration: Second Wave of Papers
October 2023
We are taking simulation-based inference for the analysis of galaxy clustering to the next level with our second release of papers! We now explore galaxy clustering data through the lenses of the wavelet scattering transform, convolutional networks, and bispectrum statistics. For each of these, we get new cosmological constraints leveraging non-linear information from the data. Check out our new website for more information!
With Michael Eickenberg, we led the wavelet scattering transform (WST) analysis. The WST statistics capture a wealth of non-Gaussian information from the data improving constraints on cosmological parameters. However, we show in our paper that these statistics might be too rich as they can also capture unrealistic specifics of the forward models, raising model misspecifications issues when applied to observational data. Our next challenge will be to address this in detail!
Update 02/24: Accepted in PRD!
Polymathic AI and Multiple Physics Pretraining
October 2023
I am lucky to be part of the amazing Polymathic AI initiative which aims to create a foundation model for advancing scientific discovery. We recently released a series of paper, check out our blog to find out about it!
In particular, in a project led by Michael McCabe, we introduce “Multiple Physics Pretraining”, an autoregressive task-agnostic pretraining approach for physical surrogate modeling. In this paper, we notably show that a single transformer model trained on a broad range of physical tasks can perform better than task-specific models on a variety of downstream applications.
Update 12/24: Accepted at NeurIPS 2024!
Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures
June 2023
In a 2021 paper, we had introduced a new algorithm to separate astrophysical signals with very distinctive statistical natures. Since then, this method has found interest in various astrophysical applications such as the denoising of dust emission maps, the separation of dust and CIB, or the removal of glitches in seismic data from the InSight Mars mission. With Michael Eickenberg, we now explore some mathematical aspects of this method and provide first denoising benchmarks in our new preprint.
Update 02/24: Published in TMLR!
SimBIG: Simulation-Based Inference of Galaxies
November 2022
Glad to announce the release of the two first papers of the SimBIG collaboration (led by ChangHoon Hahn): letter, mock challenge. The SimBIG framework enables the analysis of cosmological information from galaxy surveys on small nonlinear scales using simulation-based inference. It relies on the SimBIG forward model, which connects the cosmological parameters to realistic mock galaxy surveys. Take a look at how this model compares to BOSS data!
Update 10/23: Published in PNAS and JCAP!
Generative Models of Multi-frequency Dust Emission Maps
August 2022
Check out our recent paper, where we use the Wavelet Phase Harmonic statistics to build generative models of multi-frequency dust emission maps from a single example. Want to try this on your own data? Take a look at the code associated with the paper.
Update 01/23: Published in the Astrophysical Journal!
Wavelet Moments for Cosmological Parameter Estimation
April 2022
I was recently involved in Eickenberg et al. paper, which introduced a new set of wavelet statistics, called "Wavelet Moments", to extract non-Gaussian information from 3D cosmological fields. Fisher forecasts based on the Quijote simulations show that these statistics improve constraints on the cosmological parameters by a factor 5 to 10 with respect to the power spectrum baseline.
Ph.D. Thesis: Statistical Modeling of the Polarized Emission of Interstellar Dust
November 2021
I conducted my Ph.D. research at the LPENS, École Normale Supérieure, Paris, under the supervision of François Levrier and François Boulanger. My work was motivated by challenges in analyzing cosmic microwave background (CMB) data. I focused on the statistical modeling of one of the CMB foregrounds, namely the emission of interstellar dust. These foregrounds constitute major obstacles for the next generation of CMB experiments. I developed data-driven models using the wavelet scattering transform — a technique closely related to the mathematics of convolutional neural networks. You can learn more about this in my Ph.D. thesis.