Our team at Google DeepMind is looking for a research intern (Summer 2026)! Multimodal agentic model, unified model (world model). Looking for candidates with multiple first-author papers in top ML conferences and strong engineering skills. Email: [email protected]
Introducing Gemini 2.5 Flash Image (aka nano-banana), our SOTA image generation and editing model š
As you might have already seen, this model excels at character consistency, creative edits, and has Gemini's world knowledge!
šØ Iāll be at #NeurIPS2024! šØOn the industry job market this year and eager to connect in person!
š My research explores multimodal learning, with a focus on object-level understanding and video understanding.
š 3 papers at NeurIPS 2024:
Workshop on Video-Language Models
š
1/N) All current video models poorly understand videos! Even when videos are less than 10 seconds long! Best model-GPT4o achieves 35.0 while humans get 90.0 in group score. Existing LMMs severely struggle to distinguish temporal differences in Vinoground vinoground.github.io
I am excited to announce that I am not at #ICLR presenting Matryoshka Multimodal Models matryoshka-mm.github.io. š
But rather, I am online at Bay Area. Ping me if you have any questions or ideas w.r.t paper!
Feel free to read the poster at Hall 3 + Hall 2B #86 this morning!
Thanks for @_akhaliq 's sharing! (1/N) We propose M3: Matryoshka Multimodal Models, arxiv.org/abs/2405.17430 which (1) reduces the number of visual tokens significantly while maintaining as good performance as vanilla LMM (2) organizes visual tokens in a coarse-to-fine nested way.
Thanks for @_akhaliq 's sharing! (1/N) We propose M3: Matryoshka Multimodal Models, arxiv.org/abs/2405.17430 which (1) reduces the number of visual tokens significantly while maintaining as good performance as vanilla LMM (2) organizes visual tokens in a coarse-to-fine nested way.
(1/N ) Can a large multimodal model not only understand bboxes, but also understand arbitrary visual prompts (scribble, arrow, etc) without explicit region embedding? Yes! Our latest work ViP-LLaVA arxiv.org/abs/2312.00784 shows an extremely simple but effective approach.
Thank you @yong_jae_lee! Without the support from you and our group members, it is impossible for me to have such works. I'll miss the days working in our group.
Congratulations Dr. Mu Cai @MuCai7! Mu is my 8th PhD student and first to start in my group at UWāMadison after my move a few years ago. He made a number of important contributions in multimodal models during his PhD, and recently joined Google DeepMind. I will miss you a lot Mu!
1/N) Are current large multimodal models like #GPT4o really good at video understanding?
š We are thrilled to introduce TemporalBench to examine temporal dynamics understanding for LMMs!
Our TemporalBench reveals even the SOTA LMM #GPT4o achieves only 38.5, far from
Want to use the simplest manner to apply multimodal model (LLaVA) to robotics task? Checkout LLaRA (github.com/LostXine/LLaRA, accepted to #ICLR2025 ), which you get a vision-language-action (VLA) policy! Joint work with
@XiangLi54505720, @ryoo_michael, et al from Stony Brook U,
(1/5)
Excited to present our #ICLR2025 paper, LLaRA, at NYC CV Day!
LLaRA efficiently transforms a pretrained Vision-Language Model (VLM) into a robot Vision-Language-Action (VLA) policy, even with a limited amount of training data.
More details are in the thread. ā¬ļø
LLaVA-Prumerge, the first work of Visual Token Reduction for MLLM, finally got accepted after being cited 146 times since last year.
Congrats to the team! @yuzhang_shang@yong_jae_lee
See how to do MLLM inference much cheaper while holding performance. llava-prumerge.github.io
visual tokens in current large multimodal models are spatially redundant, indicated by the sparse attention maps.
LLaVA-PruMerge proposes to first prune and then merge visual tokens, which can compress the visual tokens by 18 times (14 times on MME/TextVQA) on average while