Michael Tschannen (@mtschannen) / X

Michael Tschannen

258 posts

Michael Tschannen

@mtschannen

Research Scientist @GoogleDeepMind. Representation learning for multimodal understanding and generation. Personal account.

Zurich, Switzerland

Joined June 2012

Pinned
Michael Tschannen
@mtschannen
Jun 3
For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited that we're releasing our latest model aligned with this theme: Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs! 1/
108K
Michael Tschannen
@mtschannen
Dec 5, 2023
Decoder-only models only work with discrete tokens, right? 🤔 Excited to present 🎁GIVT: Generative Infinite-Vocabulary Transformers, a simple way to generate arbitrary vector sequences with real-valued entries using transformer decoder-only models! arxiv.org/abs/2312.02116 1/
142K
Michael Tschannen
@mtschannen
Dec 2, 2024
Have you ever wondered how to train an autoregressive generative transformer on text and raw pixels, without a pretrained visual tokenizer (e.g. VQ-VAE)? We have been pondering this during summer and developed a new model: JetFormer 🌊🤖 arxiv.org/abs/2411.19722 A thread 👇 1/
250K
Michael Tschannen
@mtschannen
Sep 29, 2023
New paper! We ask: How important is VQ for neural discrete representation learning? 🤔 I always felt it should be possible to “absorb” the VQ on top of a deep net into the net and use a simple grid-based quantization instead, without sacrificing much expressivity. Summary 🧵👇
66K
Michael Tschannen
@mtschannen
Mar 22, 2024
We just released a big 🎁GIVT update! 📈 Larger models and improved image generation results across the board 💡 Improved GMM formulation and adapter module 💻 Code, model checkpoints, and a colab are now available at github.com/google-researc… More details below... 1/
Michael Tschannen
@mtschannen
Dec 5, 2023
Decoder-only models only work with discrete tokens, right? 🤔 Excited to present 🎁GIVT: Generative Infinite-Vocabulary Transformers, a simple way to generate arbitrary vector sequences with real-valued entries using transformer decoder-only models! arxiv.org/abs/2312.02116 1/
66K
Michael Tschannen
@mtschannen
Jun 21, 2023
Have you ever wondered how to equip a vanilla ViT with vision, language, and multimodal capabilities using images only? We present a simple contrastive approach - today @CVPR AM-264 We also released ‣ code and models github.com/google-researc… ‣ a colab colab.research.google.com/github/google-…
AK
@_akhaliq
Dec 16, 2022
Image-and-Language Understanding from Pixels Only abs: arxiv.org/abs/2212.08045
24K
Michael Tschannen
@mtschannen
Feb 22, 2025
📢2⃣ Yesterday we released SigLIP 2! TL;DR: Improved high-level semantics, localization, dense features, and multilingual capabilities via drop-in replacement for v1. Bonus: Variants supporting native aspect and variable sequence length. A thread with interesting resources👇
15K
Michael Tschannen
@mtschannen
Apr 17, 2020
New #CVPR2020 paper: Self-Supervised Learning of Video-Induced Visual Invariances (VIVI) arxiv.org/abs/1912.02783 We leverage invariances at the 1) frame level, 2) shot/clip level, 3) video level, to learn transferable image representations from raw videos in the wild. 1/4
Michael Tschannen
@mtschannen
Dec 5, 2023
I'm not much of a meme-creator, but for this one I couldn't resist... arxiv.org/abs/2312.02116
12K
Michael Tschannen
@mtschannen
Dec 2, 2024
Replying to @mtschannen
Learning to generate high-fidelity images with maximum likelihood is tricky. To bias the model towards nicer-looking images we introduce a noise curriculum: Gaussian noise added to the input image and annealed to 0 during training, s.t. high-level details are learned first. 4/
12K
Michael Tschannen
@mtschannen
Oct 2, 2023
How to combine the strengths of masked and autoregressive sequence models? M2T is one way to do this, including 🧑‍🏫 teacher-forced training ⏩ predicting multiple tokens per inference step according to a schedule 📁 activation caching #ICCV2023 1/4
13K
Michael Tschannen
@mtschannen
Mar 12, 2023
It turns out that being smart about the patch embedding is enough to share a single ViT model across different patch sizes to adjust the accuracy/compute tradeoff. It was surprising to me how much more powerful the patch size is as a knob than e.g. depth.
Lucas Beyer (bl16)
@giffmana
Mar 10, 2023
Replying to @giffmana
The nice thing: These 3 models all have the exact same parameter shapes! The only thing that differs is their patchsize, influencing the "sequence length" and hence model capacity. This leads to the main idea: can we train a single B model that handles various patch-sizes? Yes!
8.3K
Michael Tschannen
@mtschannen
Dec 2, 2024
Replying to @mtschannen
We leverage a normalizing flow (“jet”) to obtain a soft-token image representation that is end-to-end trained with a multimodal transformer for next-token prediction. The soft token distribution is modeled with a GMM à la GIVT. 2/
Michael Tschannen
@mtschannen
Dec 5, 2023
Decoder-only models only work with discrete tokens, right? 🤔 Excited to present 🎁GIVT: Generative Infinite-Vocabulary Transformers, a simple way to generate arbitrary vector sequences with real-valued entries using transformer decoder-only models! arxiv.org/abs/2312.02116 1/
8.6K
Michael Tschannen
@mtschannen
May 9, 2019
Differentiable graphics renderers: A new enabler for representation learning? github.com/tensorflow/gra…