Mu Cai (@MuCai7) / X

Mu Cai

360 posts

Mu Cai

@MuCai7

Research @thinkymachines | Previous: multimodal, agents @GoogleDeepMind

Mountain View

pages.cs.wisc.edu/~mucai/

Joined May 2019

Mu Cai
@MuCai7
Apr 1, 2025
I am thrilled to join @GoogleDeepMind as a Research Scientist and continue working on multimodal research!
132K
Mu Cai
@MuCai7
Oct 28, 2025
Our team at Google DeepMind is looking for a research intern (Summer 2026)! Multimodal agentic model, unified model (world model). Looking for candidates with multiple first-author papers in top ML conferences and strong engineering skills. Email: [email protected]
71K
Mu Cai
@MuCai7
Aug 27, 2025
Try Gemini-2.5-Flash Image Generation, world's best image editing model!
Logan Kilpatrick
@OfficialLoganK
Aug 26, 2025
Introducing Gemini 2.5 Flash Image (aka nano-banana), our SOTA image generation and editing model 🍌 As you might have already seen, this model excels at character consistency, creative edits, and has Gemini's world knowledge!
20K
Mu Cai
@MuCai7
Dec 7, 2024
🚨 I’ll be at #NeurIPS2024! 🚨On the industry job market this year and eager to connect in person! 🔍 My research explores multimodal learning, with a focus on object-level understanding and video understanding. 📜 3 papers at NeurIPS 2024: Workshop on Video-Language Models 📅
30K
Mu Cai
@MuCai7
Oct 4, 2024
1/N) All current video models poorly understand videos! Even when videos are less than 10 seconds long! Best model-GPT4o achieves 35.0 while humans get 90.0 in group score. Existing LMMs severely struggle to distinguish temporal differences in Vinoground vinoground.github.io
17K
Mu Cai
@MuCai7
Apr 25, 2025
I am excited to announce that I am not at #ICLR presenting Matryoshka Multimodal Models matryoshka-mm.github.io. 😀 But rather, I am online at Bay Area. Ping me if you have any questions or ideas w.r.t paper! Feel free to read the poster at Hall 3 + Hall 2B #86 this morning!
11K
Mu Cai
@MuCai7
May 28, 2024
Thanks for @_akhaliq 's sharing! (1/N) We propose M3: Matryoshka Multimodal Models, arxiv.org/abs/2405.17430 which (1) reduces the number of visual tokens significantly while maintaining as good performance as vanilla LMM (2) organizes visual tokens in a coarse-to-fine nested way.
arxiv.org
Matryoshka Multimodal Models
Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed...
19K
Mu Cai
@MuCai7
Jan 22, 2025
Two papers are accepted to @iclr_conf #iclr #ICLR2025 (1) Efficient Multimodal LLM — Matryoshka Multimodal Models matryoshka-mm.github.io (2) Multimodal for Robotics — LLaRA: Supercharging Robot Learning Data for Vision-Language Model Policy arxiv.org/abs/2406.20095 I’m
Mu Cai
@MuCai7
May 28, 2024
Thanks for @_akhaliq 's sharing! (1/N) We propose M3: Matryoshka Multimodal Models, arxiv.org/abs/2405.17430 which (1) reduces the number of visual tokens significantly while maintaining as good performance as vanilla LMM (2) organizes visual tokens in a coarse-to-fine nested way.
8.1K
Mu Cai
@MuCai7
Dec 4, 2023
(1/N ) Can a large multimodal model not only understand bboxes, but also understand arbitrary visual prompts (scribble, arrow, etc) without explicit region embedding? Yes! Our latest work ViP-LLaVA arxiv.org/abs/2312.00784 shows an extremely simple but effective approach.
17K
Mu Cai
@MuCai7
May 12, 2025
Thank you @yong_jae_lee! Without the support from you and our group members, it is impossible for me to have such works. I'll miss the days working in our group.
Yong Jae Lee
@yong_jae_lee
May 12, 2025
Congratulations Dr. Mu Cai @MuCai7! Mu is my 8th PhD student and first to start in my group at UW–Madison after my move a few years ago. He made a number of important contributions in multimodal models during his PhD, and recently joined Google DeepMind. I will miss you a lot Mu!
5.3K
Mu Cai
@MuCai7
Oct 15, 2024
1/N) Are current large multimodal models like #GPT4o really good at video understanding? 🚀 We are thrilled to introduce TemporalBench to examine temporal dynamics understanding for LMMs! Our TemporalBench reveals even the SOTA LMM #GPT4o achieves only 38.5, far from
25K
Mu Cai
@MuCai7
Aug 18, 2025
We have an opening for Multimodal LLM!
Alireza Fathi
@alirezafathi
Aug 18, 2025
We are hiring job-boards.greenhouse.io/deepmind/jobs/…
job-boards.greenhouse.io
DeepMind
12K
Mu Cai
@MuCai7
Feb 4, 2025
Want to use the simplest manner to apply multimodal model (LLaVA) to robotics task? Checkout LLaRA (github.com/LostXine/LLaRA, accepted to #ICLR2025 ), which you get a vision-language-action (VLA) policy! Joint work with @XiangLi54505720, @ryoo_michael, et al from Stony Brook U,
Xiang Li
@XiangLi54505720
Feb 3, 2025
(1/5) Excited to present our #ICLR2025 paper, LLaRA, at NYC CV Day! LLaRA efficiently transforms a pretrained Vision-Language Model (VLM) into a robot Vision-Language-Action (VLA) policy, even with a limited amount of training data. More details are in the thread. ⬇️
00:00
GitHub - LostXine/LLaRA: [ICLR'25] LLaRA: Supercharging Robot Learning Data for Vision-Language...
From github.com
4.5K
Mu Cai
@MuCai7
Jun 25, 2025
LLaVA-Prumerge, the first work of Visual Token Reduction for MLLM, finally got accepted after being cited 146 times since last year. Congrats to the team! @yuzhang_shang @yong_jae_lee See how to do MLLM inference much cheaper while holding performance. llava-prumerge.github.io
AI Bites | YouTube Channel
@ai_bites
Mar 25, 2024
visual tokens in current large multimodal models are spatially redundant, indicated by the sparse attention maps. LLaVA-PruMerge proposes to first prune and then merge visual tokens, which can compress the visual tokens by 18 times (14 times on MME/TextVQA) on average while
6.2K