Communication-Inspired Tokenization for Structured Image Representations

Aram Davtyan • Yusuf Sahin • Yasaman Haghighi • Sebastian Stapf • Pablo Acuaviva • Alexandre Alahi • Paolo Favaro

Official repository of the paper

Website • Paper

Abstract: Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a novel framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, the proposed attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

Installation

Clone the repository and navigate to the root directory:

git clone https://github.com/Araachie/comit.git
cd comit

Create a new conda environment and install all the dependencies:

conda create -n comit python==3.11 -y
conda activate comit
pip install -e .

[Optional] To enable jupyter instead of the last command, run:

pip install -e ".[notebook]"
python -m ipykernel install --user --name comit --display-name "Python (comit)"

Model Zoo

We share the weights of the pretrained COMiT variants in three sizes.

Model name	Layers	Model size	Dataset	Hugging face hub	Model weights
COMiT-B	12	174M	ImageNet1k	cvg-unibe/comit-b	download
COMiT-L	24	610M	ImageNet1k	cvg-unibe/comit-l	download
COMiT-XL	28	900M	ImageNet1k	cvg-unibe/comit-xl	download

Usage

For convenience, we have prepared a demo Jupyter notebook at ./notebooks/demo.ipynb that shows how to use COMiT. We recommend starting with it. Below are some examples of using COMiT to quickly get started.

Example usage, downloading COMiT-XL from the Hugging Face Hub:

from comit import COMiT

model = COMiT.from_pretrained('cvg-unibe/comit-xl')
model.eval().to(device)

With a pretrained COMiT model images can be encoded into token sequences as follows:

with torch.no_grad():
  token_dict = model.tokenize(
      batch,
      global_crop=False,  # Whether to use the global crop as the first observation
      order="adaptive",  # One of ["raster_scan", "random", "adaptive"] or a list of crop indices
      num_crops=3,  # Used to truncate the list of crops to embed
  )

By default the tokenization pipeline returns a list of 256 6-dimensional tokens. If token indices are needed instead, they can be obtained via:

token_ids = model.quantizer.codes_to_indices(token_dict["msgs"])

To visually probe the information in the token sequences, one can decode the tokens back into images:

with torch.no_grad():
  detoken_dict = model.detokenize(
      msgs=token_dict["msgs"],
      offsets=token_dict["offsets"],
      num_steps=10,  # Number of denoising steps
      odesolver="euler",  # The numerical velocity field integration method
      cfg_weight=7.5,  # CFG strength
  )

For convenience we also provide the reconstruct method that pipelines tokenize and detokenize into a single call:

with torch.no_grad():
  rec_dict = model.reconstruct(
      batch,
      global_crop=False,
      order="adaptive",
      num_crops=3,
      num_steps=10,
      odesolver="euler",
      cfg_weight=7.5,
  )

Licensing

Unless otherwise noted, the code in this repository is licensed under LICENSE. This repository includes third-party components under different licenses, including for non-commercial use only, see THIRD_PARTY_NOTICES.md. The overall project is intended for research and academic use.

Citation

If you find this repository helpful, please consider citing our work:

@misc{davtyan2026comit,
      title={Communication-Inspired Tokenization for Structured Image Representations}, 
      author={Aram Davtyan and Yusuf Sahin and Yasaman Haghighi and Sebastian Stapf and Pablo Acuaviva and Alexandre Alahi and Paolo Favaro},
      year={2026},
      eprint={2602.20731},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.20731}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
comit		comit
notebooks		notebooks
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Communication-Inspired Tokenization for Structured Image Representations

Official repository of the paper

Website • Paper

Installation

Model Zoo

Usage

Licensing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Communication-Inspired Tokenization for Structured Image Representations

Official repository of the paper

Website • Paper

Installation

Model Zoo

Usage

Licensing

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages