A Unification of Discrete, Gaussian, and Simplicial Diffusion

Nuria Alina Chandra*, Yucen Lily Li*, Alan N Amin*, Alex Ali, Joshua Rollins, Andrew Gordon Wilson
*Equal Contribution

Description

To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and tradeoffs: discrete diffusion has the most natural domain, Gaussian diffusion has more mature algorithms, and diffusion on the simplex in principle combines the strengths of the other two but in practice suffers from a numerically unstable stochastic processes. Ideally we could see each of these models as instances of the same underlying framework, and enable practitioners to switch between models for downstream applications. However previous theories have only considered connections in special cases. Here we build a theory unifying all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. In particular, we find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models and leverages decades of mathematical genetics literature to unlock stable simplicial diffusion. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. Our experiments show that Wright-Fisher simplicial diffusion is more stable and outperforms previous simplicial diffusion models on conditional DNA generation. We also show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.

This codebase implements the unified Wright-Fisher diffusion framework along with instantiations corresponding to discrete, Gaussian, and simplicial diffusion. We also provide instructions to train unified models on protein data.

Code Usage

Installation

Install dependencies by running pip install . with a recent version of Python.

Train protein models

To train protein models, you can download Uniref50 data from here. Place this data in data/uniref_2020/uniref50/. Then you can train a unified model to do Gaussian, discrete, and simplicial diffusion by running

python3 train.py --config-name=protein_unified

Train DNA models

You can download the Enhancer design dataset from here, and place this data so it has the filepath data/the_code/.... You can train a simplicial diffusion model by running

python3 train.py --config-name=dna_simplicial

Other usage

You can customize the training setup by adding a config file to configs/NEW.yaml and running python3 train.py --config-name=NEW

The train parameters control the training of the diffusion model.
The architecture parameters control the underlying architecture.
The model parameters control the diffusion model setup.
- model.model can be set to UnifiedDiffusion, GaussianDiffusion, DiscreteDiffusion, or SimplicialDiffusion.
- model.schedule_type controls the noise rate function $\beta(t)$ and can be set to linear, or cos.
- model.forward_kwargs controls the forward process, and model.forward_kwargs.ssp determines the usage of the sufficient statistic parameterization. Make sure that ssp is set to true for unified models.
- model.restart can be set to the folder of a checkpoint to restart training

Citation

To cite this paper, please use

@misc{chandra2025unificationdiscretegaussiansimplicial,
      title={A Unification of Discrete, Gaussian, and Simplicial Diffusion}, 
      author={Nuria Alina Chandra and Yucen Lily Li and Alan N. Amin and Alex Ali and Joshua Rollins and Sebastian W. Ober and Aniruddh Raghu and Andrew Gordon Wilson},
      year={2025},
      eprint={2512.15923},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.15923}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
figures		figures
src		src
LICENSE		LICENSE
README.md		README.md
data.py		data.py
nets.py		nets.py
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Unification of Discrete, Gaussian, and Simplicial Diffusion

Description

Code Usage

Installation

Train protein models

Train DNA models

Other usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Unification of Discrete, Gaussian, and Simplicial Diffusion

Description

Code Usage

Installation

Train protein models

Train DNA models

Other usage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages