user avatar
Rowan Zellers
Thinking Machines
@rown
multimodal @thinkymachines. I also like to climb rocks and throw pottery. rowanzellers.com (he/him)
San Francisco, CA
Joined November 2008
Posts
  • Pinned
    user avatar
    People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…
    00:00
  • user avatar
    life update: I've joined @thinkymachines lab! We're building the future of human-AI interaction through open science, research+product co-iteration, and with multimodal at the core. If you're interested in joining our fantastic team - reach out! DMs open 😀
  • user avatar
    I wrote a blog post on why I decided to join OpenAI instead of academia. (after I went on the academic & industry job markets, and got offers from both.) This post (pt2 in a series) took a while 😅- hoping my experience helps others make life decisions!
  • user avatar
    Belated announcement: after finishing my PhD, I’ve joined @OpenAI as a research scientist! I’m excited to continue working towards building multimodal, grounded AI - with a fantastic group of collaborators here.
  • user avatar
    OpenAI is nothing without its people
  • user avatar
    We've released the code and model checkpoints for Grover, our 1.5B parameter neural fake news detector (and generator): github.com/rowanz/grover Excited to see what the community will find, especially regarding how to better spot machine-generated disinformation!
  • user avatar
    Introducing 🍷MERLOT Reserve, a new model that learns about vision, language, and sound all together -- by watching millions of YouTube videos. To appear at #CVPR2022. page: rowanzellers.com/merlotreserve paper: arxiv.org/abs/2201.02639 demo: merlot-reserve.apps.allenai.org 🧵/5
    GIF
  • user avatar
    Interested in computer vision/AI models that perform cognition-level visual reasoning? Today, I'm releasing @PyTorch code and pretrained models for Visual Commonsense Reasoning, available at github.com/rowanz/r2c More info (+paper) at visualcommonsense.com 😃
  • user avatar
    Impressive demo. Worth noting, it has memory -- call it twice and it remembers what was talked about in the last call. Personalization and memory are so critical for voice models to feel "human", and it feels like we're only at the beginning here.
    At Sesame, we believe in a future where computers are lifelike. Today we are unveiling an early glimpse of our expressive voice technology, highlighting our focus on lifelike interactions and our vision for all-day wearable voice companions. sesame.com/voicedemo
    00:00
  • user avatar
    Introducing MERLOT: a new model that learns about language, vision, & the world from 6M YouTube videos. Out-of-the-box, MERLOT has intrinsic notions of multimodal temporal commonsense. When finetuned, we get SOTA performance on 12 video tasks + VCR. arxiv.org/abs/2106.02636
    An overview of our model named "MERLOT" (which is depicted as a wine glass.) We train our model on 6 million youtube videos in a self-supervised way. The model can be applied for a variety of tasks including:

Video QA (example: a video clip of two people interacting, and a question "What's she holding onto before he leaves?")

Commonsense Single-image QA (example: a single image of some people in a diner and the question "why is the man pointing")

Visual story ordering (example: a story "The old man was riding on an escalator. He was almost at the top. His kids were already there) and images for each sentence.)
  • user avatar
    Excited to introduce GPT-4o. Language, vision, and sound -- all together and all in real time. This thing has been so much fun to work on. It's been even more fun to play with -- with moments of magic where things feel totally fluid and I forget I'm video chatting with an AI.
    00:00
  • user avatar
    Today we're announcing Visual Commonsense Reasoning, a new task and large-scale dataset, with the goal to bring AI systems from recognition- to cognition-level understanding. Joint work with @ybisk, Ali Farhadi, and @YejinChoinka. visualcommonsense.com arxiv.org/abs/1811.10830
  • user avatar
    If you’re excited to build the future of multimodal human/ai collaboration, and jam with Andrew, me, and many other talented people across the stack— dm me! 😀
    life update: I joined @thinkymachines! feeling so lucky to build with such a kind, brilliant team, esp pairing with researchers early on as a designer. looking forward to sharing more soon.
  • user avatar
    We've launched GPT4! Among other things -- I'm excited that it can read an image, and analyze it at a level beyond object- or scene recognition, communicating the result in helpful language. Still more followup work to be done characterizing this 😀 openai.com/research/gpt-4
    User: What is funny about this image? Describe it panel by panel.
GPT4: The image shows a package for a "Lightning Cable" adapter with three panels.

Panel 1: A smartphone with a VGA connector (a large, blue, 15-pin connector typically used for computer monitors) plugged into its charging port.

Panel 2: The package for the "Lightning Cable" adapter with a picture of a VGA connector on it.

Panel 3: A close-up of the VGA connector with a small Lightning connector (used for charging iPhones and other Apple devices) at the end.

The humor in this image comes from the absurdity of plugging a large, outdated VGA connector into a small, modern smartphone charging port.