Magenta

Open-sourcing The Infinite Crate DAW plugin

Mon, 09 Mar 2026 12:00:00 -0700

View on GitHub Discuss on Discord Get the plugin

Six months ago we released The Infinite Crate, a DAW plugin that brings the Lyria RealTime music model into Digital Audio Workstations (DAWs) to improve the sampling workflow for producers. Since its release it’s been used by some of our favorite artists — including a wonderful showcase with Daito Manabe in Tokyo — and was featured as an exciting new music tool at NAMM 2026.

Today we’re fully open sourcing the DAW plugin for developers to fork, modify, and make their own under the permissive Apache 2.0 license.

The VST was born out of discussions and studio collaborations with musicians and producers from around the world. Many were intrigued by music models as a creative partner but needed deeper integration into the tools they know and trust — Ableton, Logic, and other DAWs that support VST3/AU plugins. Bridging this gap simplifies audio routing and MIDI-mapping for studio recording and live performance, allowing musicians to focus on what matters: the music.

We architected the plugin using React/TypeScript for the UI layer and JUCE/C++ for DAW connection, audio processing, and websocket audio streaming from the Gemini/Lyria API. This allowed us to rapidly iterate on the frontend using hot-reload (Shadcn/Tailwind), while ensuring latency sensitive operations (audio streaming and playback) happen in a compiled and unmanaged language with a tight clock. State is synced between TypeScript and C++ using Zustand’s state management and nlohmann json.

The plugin is a functional interface that exposes most of the controls available on the Lyria RealTime API to the React frontend and feeds the resulting audio stream into the DAW. Developers can fork the plugin and build creative interfaces and visualizations for the API (like Space DJ, MIDI DJ, or creative controls) directly in the DAW by spinning up the Vite server. Because the frontend uses a standard set of web frameworks it’s easy to explore new interfaces using AI-assisted coding tools like Gemini and Antigravity.

Looking ahead

In the near term, we hope to update the plugin to support on-device inference of the Magenta RealTime open-weights model for offline use. In the long term, we hope to support future music models with improved controls, such as audio and MIDI input.

We hope this open source plugin can support and be built with the growing community of music makers using machine learning as part of their creative process.

Join the discussion on our Discord.

Acknowledgments

We thank: Spencer Salazar for his talk on prototyping DAW plugins in web technologies at ADC 2020, JUCE for implementing a C++ to Web/JS bridge in JUCE 8, Tommy Cappel for rigorous testing, Alberto Lalama and Joyce Xie for their work on the API, Nikhil Bhanu for his work on the windows build, and the DeepMind research team that contributed to Lyria RealTime.

Lyria Camera: Soundtrack your life

Wed, 03 Dec 2025 12:00:00 -0800

Today we’re launching Lyria Camera, an app that uses Lyria RealTime to make music with your camera. By combining Gemini’s image understanding and the Lyria RealTime API, Lyria Camera generates a musical score that adapts to your environment on the fly.

It works by translating the visual scene into musical descriptors via Gemini, producing prompts like Reflective piano, cityscape calm. The Lyria RealTime API uses these terms as prompts to create a continuous stream of music that’s generated on the fly. As you move about your world, the prompts and the music they create will evolve over time.

Try Lyria Camera now or remix it on AI Studio.

volume_off

The world is your instrument

Reward your curiosity: When you’re using Lyria Camera, every image is a new instrument. You can find songs in your sketchbook, at the laundromat or in your breakfast cereal. Film around and see what you can find.
DJ your commute: Point your camera out the train window or mount it on your dashboard. Lyria Camera responds to the shifting scenery—the rhythm of passing streetlights or the calm of an open road—creating a drive-time score that matches your journey beat for beat.
Score your screen: On desktop, try the “Share Screen” feature to use a browser tab instead of your camera. Actually, any app on your computer can be used as a video feed. Try it while you’re working or gaming for a tailor-made soundtrack.

How it Works

Lyria Camera brings together several AI capabilities to create a seamless audiovisual feedback loop.

Multimodal Prompting. This is the bridge between sight and sound. We use Gemini to analyze your camera feed, translating visual cues into rich textual descriptions. These descriptions act as musical instructions, telling Lyria exactly how to interpret and ”play” what you’re seeing.
Continuous & Steerable Generation: The Lyria RealTime API is designed for continuous music generation. Instead of generating a static song, it creates an endless stream of audio that you can “steer” in different directions. This allows the music to morph smoothly from one mood to another without ever stopping or skipping a beat.

What will you build?

Lyria Camera is a great companion for a walk or a drive, and it’s just one thing you can do with the Lyria RealTime API. We built this app to demonstrate the possibilities of continuous, steerable music generation, but the real potential lies in what comes next.

You can try Lyria Camera on your phone or desktop today. For developers ready to push the boundaries further, the Lyria RealTime API can help you build the next generation of music experiences.

Space DJ: Navigating a Musical Universe

Mon, 03 Nov 2025 12:00:00 -0800

Today, we’re excited to launch Space DJ, a web application from Magenta that turns music exploration into an interactive journey through a constellation of sounds. You pilot a spaceship through a galaxy where each star represents a musical genre. As you navigate this universe, Space DJ uses the Lyria RealTime API to generate a continuous stream of music that reflects your position and selections in real-time.

We used the deploy app feature in AI Studio to make this available to everyone!
Try Space DJ now, or view and fork the source code in AI Studio.

volume_off

Fly Through Music

Explore a Musical Universe: Fly through a star constellation where each star is labeled with a music genre. This galaxy is a 3D projection of genre embeddings.
Generate Music in Real-Time: As you fly, the stars close to the spaceship light up and influence the music. Clicking on a star or a point in space anchors your selection. The Lyria Realtime model blends the prompts of nearby genres into a unique musical mashup that evolves dynamically as you move.
Uncover Hidden Connections: Similar genres appear close together in the 3D space. You can also enable “High-Dimensional Neighbors” to find genres that are semantically similar in the original high-dimensional embedding space, even if they aren’t visual neighbors in the projection.
Engage Auto-Pilot: Randomly drift through space for an ever-changing, generative soundscape.

How it Works

Space DJ combines several technologies to create an immersive experience:

Genre Embeddings: We start with text prompts for 300 musical genres out of a 1000 genre dataset. The text is converted into a rich numerical representation (embedding) using the open-source MagentaRT model’s MusicCoca embedder. These 768-dimensional embeddings are then reduced to 128 dimensions using Principal Component Analysis for efficiency.
3D Projection: To render the embeddings in 3D, we use Uniform Manifold Approximation and Projection (UMAP), an algorithm that projects the data into 3D space while trying to preserve its high-dimensional structure. You can tweak UMAP parameters in the settings for different constellation shapes.
Interactive Rendering: The 3D space, spaceship, and stars are rendered in your browser using three.js. You can select how many stars to create and whether to randomize the selection.
Real-Time Audio Synthesis: Your interactions within the 3D space are translated into a set of weighted text prompts (i.e. Deep House: 0.7, Ambient Techno: 0.3) based on proximity. These prompts are sent to the Lyria RealTime API, which synthesizes the music you hear, responding instantly to the spaceship’s position.
Development and Deployment: We used AI Studio to develop the applet through its interactive code editor. We leveraged AI Studio’s Cloud Run integration to deploy the application. This approach simplifies the deployment process and helps protect the Gemini API key by securely proxying requests to the Lyria RealTime API.

A New Frontier for Musical Interaction

Space DJ is an exploration into new ways of interacting with generative AI models for music. We hope to inspire new forms of musical expression and discovery.

Ready to take flight? Try Space DJ Now!

Lyria RealTime VST: The Infinite Crate

Wed, 09 Jul 2025 07:00:01 -0700

🎵Get the plugin

📖 Learn more

Live Generative Music in your DAW

Today, we’re happy to share The Infinite Crate, a DAW plugin prototype that integrates the Lyria RealTime API directly into your favorite music software. Use text prompts to steer a continuously evolving stream of music and feed the audio directly into your DAW for sampling, live performance, or a backing track to jam with.

volume_off

Integrating generative models with existing creative workflows has always been an important part of Magenta’s mission, as it allows people more control and agency in how they use these models in their own practice. Our previous experiments with plugins, including Magenta Studio for manipulating MIDI clips and DDSP VST for realtime audio-to-audio transformations, have over a million downloads combined and have validated for us the value of making these tools creatively accessible.

We hope The Infinite Crate will be a welcome addition to this lineup. We were inspired to create it though our collaborations with musicians such as Jacob Collier and Toro y Moi, where we saw the potential for integrating capabilities similar to MusicFX DJ more directly into studio and live performance workflows.

The Infinite Crate is cross-platform, available for both Mac and Windows, as a VST3 plugin, an AU component, and a standalone app.

Looking ahead

Lyria RealTime is not capable of running locally on consumer hardware, so thus the plugin requires an API key (free for Lyria RealTime) and internet access. We’re excited to explore complementing this approach with more efficient variants that can run locally on consumer hardware such as our recently released open model Magenta RealTime, so stay tuned!

Magenta RealTime: An Open-Weights Live Music Model

Fri, 20 Jun 2025 07:00:01 -0700

Magenta RealTime

Today, we’re happy to share a research preview of Magenta RealTime (Magenta RT), an open-weights live music model that allows you to interactively create, control and perform music in the moment.

Colab Demo

📝Paper

GitHub Code

Model Card

Magenta RT is the latest in a series of models and applications developed as part of the Magenta Project. It is the open-weights cousin of Lyria RealTime, the real-time generative music model powering Music FX DJ and the real-time music API in Google AI Studio, developed by Google DeepMind. Real-time music generation models open up unique opportunities for live music exploration and performance, and we’re excited to see what new tools, experiences, and art you create with them.

As an open-weights model, Magenta RT is targeted towards eventually running locally on consumer hardware (currently runs on free-tier Colab TPUs). It is an 800 million parameter autoregressive transformer model trained on ~190k hours of stock music from multiple sources, mostly instrumental. The model code is available on Github and the weights are available on Google Cloud Storage and Hugging Face under permissive licenses with some additional bespoke terms. To see how to run inference with the model and try it yourself, check out our Colab Demo. You may also customize MagentaRT on your own audio or explore live audio input Options for local, on device inference are coming soon.

How it Works

Live generative music is particularly difficult because it requires both real-time generation (i.e. real-time factor > 1, generating X seconds of audio in less than X seconds), causal streaming (i.e. online generation), and low-latency controllability.

Magenta RT overcomes these challenges by adapting the MusicLM architecture to perform block autoregression. The model generates a continuous stream of music in sequential chunks, each conditioned on the previous audio output (10s of coarse audio tokens) and a style embedding to produce the next audio chunk (2s of fine audio tokens). By manipulating the style embedding (weighted average of text or audio prompt embeddings), players can shape and morph the music in real-time, mixing together different styles, instruments, and musical attributes.

The latency of controls is set by the chunk size, which has a maximum output size of two seconds but can be reduced to increase reactivity. On a Colab free-tier TPU (v2-8 TPU), these two seconds of audio are generated in 1.25 seconds, giving a real-time factor of 1.6.

Compared to the original MusicLM, we’ve upgraded our representations to SpectroStream for high-fidelity (48kHz stereo) audio, which is a successor to SoundStream (Zeghidour+ 21). We also trained a new joint music+text embedding model called MusicCoCa that is influenced by both MuLan (Huang+ 22) and the CoCa models (Yu+ 22). Additional details are provided in the model card and deeper technical descriptions are available in our paper.

Latent Space Exploration… In Real Time

Magenta’s earlier work in latent music models for MIDI clips (MusicVAE, GrooVAE) and instrumental timbre (NSynth), offered a wide range of possible interfaces.

With Magenta RT, it is now possible to traverse the space of multi-instrumental audio: explore the never-before-heard music between genres, unusual instrument combinations, or your own audio samples.

The ability to adjust prompt mixtures in real-time allows you to efficiently explore the sonic landscape and find novel textures and loops to use as part of a larger piece of music.

Real-time interactivity also provides the possibility of this latent exploration being its own type of musical performance, the interpolation through space combined with anchoring of the audio context producing a structure similar to a DJ set or improvisation session. Beyond performance, it can also be used to provide interactive soundscapes for physical spaces like artist installations or virtual spaces like video games.

This opens up a world of possibilities to build new tools and interfaces, and below you can see three example applications built on the Lyria RealTime API in AI Studio. Over time, Magenta RT will open up similar opportunities for on-device applications.

PromptDJ

PromptDJ MIDI

PromptDJ Pad

Why Magenta RealTime?

Enhancing human creativity (not replacing it) has always been at the core of Magenta’s mission. AI, however, can be a double-edged sword for creative agency. It offers new opportunities for accessibility and expression, but it can also create a deluge of more passive creation and consumption compared to traditional methods. With this in mind, we have always strived to build tools that help close the skill gap to make creation more accessible, while also valuing existing musical practices and encouraging people to dig deeper in their own creative journeys. In this regard, real-time interactive music models offer several important advantages that have motivated our research over the years (Piano Genie, DDSP, NSynth, AI Duet, and more).

Live interaction demands more from the player but can offer more in return. The continuous perception-action loop between the human and the model provides access to a creative flow state, centering the experience on the joy of the process over the final product. The higher bandwidth channel of communication and control often results in outputs that are more unique and personal, as every action the player takes (or doesn’t) has an effect.

Finally, live models naturally avoid creating a deluge of passive content, because they intrinsically balance listening with generation in a 1:1 ratio. They create a unique moment in time, shared by the player, the model, and listeners.

While Lyria RealTime provides access to state-of-the-art live music generation to developers and users around the globe, the Magenta Project remains committed to providing more direct access to code and models to enable researchers, artists, and creative coders to further build upon and adapt to achieve their creative goals.

Known Limitations

Coverage of broad musical styles. Magenta RT’s training data primarily consists of Western instrumental music. As a consequence, Magenta RT has incomplete coverage of both vocal performance and the broader landscape of rich musical traditions worldwide. For real-time generation with broader style coverage, we refer users to our Lyria RealTime API.

Vocals. While the model is capable of generating non-lexical vocalizations and humming, it is not conditioned on lyrics and is unlikely to generate actual words. However, there remains some risk of generating explicit or culturally-insensitive lyrical content.

Latency. Because the Magenta RT LLM operates on two second chunks, user inputs for the style prompt may take two or more seconds to influence the musical output.

Limited context. Because the Magenta RT encoder has a maximum audio context window of ten seconds, the model is unable to directly reference music that has been output earlier than that. While the context is sufficient to enable the model to create melodies, rhythms, and chord progressions, the model is not capable of automatically creating longer-term song structures.

Future Work

Magenta RT and Lyria RT are pushing the boundaries of live generative music, and we are happy that Magenta RT marks a return of open releases from Magenta.

We are hard at work at making MagentaRT run locally on your own device - stay tuned for more info!

We are also working on the next generation of real-time models with higher quality, lower latency, and more interactivity, to create truly playable instruments and live accompaniment.

How to cite

Please cite our technical report:

BibTeX:

@article{gdmlyria2025live,
    title={Live Music Models},
    author={Caillon, Antoine and McWilliams, Brian and Tarakajian, Cassie and Simon, Ian and Manco, Ilaria and Engel, Jesse and Constant, Noah and Li, Pen and Denk, Timo I. and Lalama, Alberto and Agostinelli, Andrea and Huang, Anna and Manilow, Ethan and Brower, George and Erdogan, Hakan and Lei, Heidi and Rolnick, Itai and Grishchenko, Ivan and Orsini, Manu and Kastelic, Matej and Zuluaga, Mauricio and Verzetti, Mauro and Dooley, Michael and Skopek, Ondrej and Ferrer, Rafael and Borsos, Zal{\'a}n and van den Oord, {\"A}aron and Eck, Douglas and Collins, Eli and Baldridge, Jason and Hume, Tom and Donahue, Chris and Han, Kehang and Roberts, Adam},
    journal={arXiv:2508.04651},
    year={2025}
}

Introducing Lyria RealTime API

Thu, 12 Jun 2025 07:00:01 -0700

Lyria RealTime API

Lyria team

For the last few years, we have continued to explore how different ways of interacting with generative AI technologies for music can lead to new creative possibilities. A primary focus has been on what we refer to as “live music models”, which can be controlled by a user in real-time.

Lyria RealTime is Google DeepMind’s latest model developed for this purpose, and we are excited to share an experimental API that anyone can use to explore the technology, create some jams, develop an app, or build their own musical instruments. You can try a demo app now in Google AI Studio, fork it to build your own, or have a look at the API documentation. For more details on how Lyria RealTime works, see our technical report.

Here are a few interfaces we have open sourced in Google AI Studio for inspiration that you can easily fork and make your own:

PromptDJ

Our most fully-featured demo allows you to add prompts and use sliders to control their relative impact on the music. Advanced Settings let you try out manual overrides for different musical aspects like note density, tempo, and key.

Try it now !

PromptDJ MIDI

With PromptDJ MIDI, you can use a virtual MIDI controller to mix together text descriptors (that you can edit) and produce a single stream of music. You can even map the knobs to a physical MIDI controller via WebMIDI like Toro y Moi used during the I/O preshow.

Try it now !

PromptDJ Pad

PromptDJ Pad harkens back to our earlier experiments with latent space interfaces NSynth Super and MusicVAE Beat Blender, allowing you to easily explore the space between four editable prompts.

Try it now !

A key advantage of the API is its versatility, allowing it to be called from various platforms, not just web apps. For instance, we’ve developed a VST plugin called The Infinite Crate, which enables a seamless interaction between Lyria RealTime and the digital audio workstation of your choice!

volume_off

Capabilities

With Lyria RealTime, it is possible to traverse the space of multi-instrumental audio: explore the never-before-heard music between genres, unusual instrument combinations, or abstract concepts.

volume_off

The core capabilities of the model and API are:

Generates a continuous stream of 48kHz stereo music.
Low latency – maximum of 2 seconds between control change and effect.
Latent space steering based on a mixture of text descriptors.
Manual control over music features
- Tempo, key.
- Options to reduce or silence particular instrument groups (drums, bass, other).
- Control for density of note onsets.
- Control for spectral brightness.
Sampling temperature and top-k settings (“chaos” control).

volume_off

Interfaces for Live Music Models

One of the things we are most excited about with live music models is the number of novel interfaces they make possible by mapping human actions to musical controls. This harkens back to our earlier work with Magenta.js and the large number of applications it and other earlier Magenta technologies spawned. We hope the Lyria RealTime API will empower even more creativity by developers.

Live music models introduce a different interaction paradigm than text-to-song generators, which have impressive capabilities but lack the instantaneous feedback loops available to players of traditional instruments. The goal of models like Lyria RealTime is to put the human more deeply in the loop, centering the experience on the joy of the process over the final product. The higher bandwidth channel of communication and control often results in outputs that are more unique and personal, as every action the player takes (or doesn’t) has an effect.

In Lyria RealTime, the ability to adjust prompt mixtures and quickly hear the results allows players to efficiently explore the sonic landscape to find novel textures and loops. Real-time interactivity also provides the possibility of this latent exploration being its own type of musical performance, the interpolation through space combined with anchoring of the audio context producing a structure similar to DJ set or improvisation session. Beyond performance, it can also be used to provide interactive soundscapes for physical spaces like artist installations or virtual spaces like video games.

Our first public experiment with Lyria RealTime was MusicFX DJ, which we developed last year as a collaboration with Google Labs. MusicFX DJ allows you to create and conduct a continuous flow of music, and we worked with producers and artists to make the tool more inspiring and useful to musicians and amateurs alike.

At this year’s I/O, Toro y Moi (Chaz Bear) took Lyria RealTime for a spin on stage before the keynote, using a different interface that he operated via a physical MIDI controller. Chaz’s performance leaned deeply into the live nature of the model, improvising with it to lead the crowd on a sonic journey full of surprises for himself and the audience.

Chaz Bear's performance at Google I/O 2025.

How it Works

Live generative music is particularly difficult because it requires both real-time generation (i.e. real-time factor > 1, generating 2 seconds of audio in less than 2 seconds), causal streaming (i.e. online generation), and low-latency controllability.

Lyria RealTime diagram

Lyria RealTime overcomes these challenges by adapting the MusicLM architecture to perform block autoregression. The model generates a continuous stream of music in sequential chunks, each steered by the previous audio output and a style embedding for the next chunk. By manipulating the style embedding (weighted average of text or audio prompt embeddings), players can shape and morph the music in real-time, mixing together different styles, instruments, and musical attributes.

Future Work

We are currently working on the next generation of real-time models with higher quality, lower latency, more interactivity, and on-device operability, to create truly playable instruments and live accompaniment. Stay tuned as we continue working with communities of musicians and developers on these technologies.

Magenta Studio 2.0

Thu, 24 Aug 2023 07:00:01 -0700

TL;DR: Magenta Studio, first released in 2019, has been updated to more seamlessly integrate with Ableton Live. No functionality has changed, there are only UI changes and internal fixes. Please download and enjoy!

If you’re new to Magenta Studio, please read our previous post about what it is and how it works.

What’s New

In the previous version of Magenta Studio, the Max for Live (M4L) plugin would launch a separate application specific to your operating system for each of the tools. Unfortunately, as operating systems were upgraded, sometimes the applications stopped working. Therefore, we made the decision to integrate the tools directly into the Max for Live environment to ensure longer-term stability. The machine learning models are still directly integrated into the M4L plugin and do not require access to the Internet to use.

Upgrading

To upgrade from the previous version of Magenta Studio, you can download the latest version and drop it into Live directly in the place of the old plugin. The functionality has not been altered, only the interface and integration, so it works in exactly the same way.

Documentation

The documentation has been updated to reflect the new interface. The tool-specific videos have not been updated with the new interface, but the functionality is identical.

Support

Please report any issues to the GitHub repository. Thanks for using Magenta Studio!

Acknowledgements

Magenta Studio is based on work by members of the Google DeepMind team’s Magenta project along with contributors to the Magenta and Magenta.js libraries. The plug-ins were implemented by Yotam Mann and extended by Cassie Tarakajian.

The 2023 I/O Preshow – Composed by Dan Deacon (with some help from MusicLM)

Wed, 21 Jun 2023 13:00:00 -0700

Tl;dr: Dan Deacon worked with Google’s latest music AI models to compose the preshow music. Check out the MusicLM demo in the AI Test Kitchen app. Read on for more details about our collaboration with Dan Deacon.

Dan Deacon’s I/O Performance

On several occasions, we have had the pleasure of working with musicians that perform at Google I/O. This is an opportunity for us to bring our latest creative machine learning tools out of the lab and into the hands of the musicians. In previous years, we have worked with YACHT and The Flaming Lips. With YACHT we explored custom symbolic music generation models tailored to the band, and with The Flaming Lips we explored an interaction to bridge the audience and performers.

This year’s I/O pre-show was performed by electronic musician and composer Dan Deacon. With Dan we explored how artists might interact with generative models of music audio and incorporate them into their artistic process. Check out his performance in the video below and read on to learn more about his process using Google’s latest music AI tools:

Dan Deacon's performance at Google I/O 2023.

Dan used two of our new generative models in his performance: MusicLM (paper, demo), which produces music based on a text-based input prompt, and SingSong (paper), which will generate an accompaniment track for an audio-based singing input. Both of these models are part of the AudioLM (paper) family, and they directly produce audio based on the input conditioning (i.e., text or singing) by autoregressively predicting SoundStream (paper) tokens with one or more Transformer language models. SoundStream tokens can then be converted back to raw audio that can be used in conjunction with other audio editing software.

For his performance, Dan used MusicLM to create the chill, relaxing piano groove that’s heard behind his two meditations starring the Duck with Lips. Additionaly, Dan used both MusicLM and SingSong to create the Chiptune song. Most excitingly, Dan didn’t just use both SingSong and MusicLM, but actually extended their capabilities to put his performance together. We’ll discuss more of how Dan shaped the tools–and why it’s important that he did so–in the next section.

Working with Dan

As Dan discusses at around 7 minutes into his performance, he has always been excited by the promise that new technologies bring to the compositional process. Technology has a long and intertwined history with the art of making music. We might not think of things like flutes, violins, or trombones in the same way we think of computers now, but these were revolutionary new technologies when they were first introduced! They can also often seem disruptive at first–at one point in history, microphones caused quite a stir because they let vocalists sing much more softly (opposed to singing so loud they could be heard over the band). Yet in retrospect, microphones changed our relationship to music in many positive ways, enabling us to create, represent, and distribute music in ways that would have been inconceivable beforehand. Importantly, each new technological development expanded the creative palette of musicians, bringing with them new textures, new techniques, and sometimes new conceptions of music itself.

We view our new models as a continuation of music technology’s evolution. We’re incredibly inspired by the opportunity for these new tools to bring new creative capabilities to humanity, while remaining conscious of–and working hard to mitigate–their potential negative consequences. Our goal is and always has been to empower artists and musicians; a crucial piece of empowering musicians is understanding now these new tools situate themselves in different artists’ creative processes. With that in mind, collaborating with Dan was a great opportunity for us to work towards embodying our goals of empowering musicians in the era of generative modeling.

A glimpse of our in-person workshop where we showed our new tools to Dan Deacon.

About a month before I/O, we had a workshop with Dan where we introduced him to MusicLM and SingSong. Initially, Dan found many interesting text prompts to our MusicLM such as “a 600ft trombone.” He started to push the tools past their limit by, for example, playing his synthesizer into SingSong, ignoring that the system was trained on only singing inputs. These initial experiments turned out to be really fun and promising!

As we kept working with Dan, he surprised us by pushing these tools even further. Inspired by “I Am Sitting in a Room” (click here to listen), he fed the output of the SingSong model back into itself… over and over and over. Again, Dan moved beyond the model’s design of accepting singing input; by feeding its own output back into itself, the input audio was out of the distribution that the model had seen during training and we weren’t sure if this would work at all. Yet, not only did it work, but the feedback loop tended to produce music that still accompanies the input; it has the same key, tempo, and style. This was the interaction that Dan designed to compose the Chiptune song, above.

Dan began with a handful of text prompts to MusicLM, and then used the generated audio as input to SingSong and that output back through SingSong for numerous iterations. He was able to create hundreds of audio clips that complemented each other. From these, he handpicked his favorite clips, edited them slightly, and performed them.

We’re very proud to have been a part of Dan’s amazing performance. We’re extremely excited for the direction that this research is headed, and we’re always looking for ways to give musicians new tools to interact with. Check out the Google Keyword blog post to learn more about MusicLM and you can try it yourself by signing up via the AI Test Kitchen app.

Acknowledgements

This year’s I/O pre-show was a huge collaborative effort. We would like to thank everyone involved in making the performance a success (in no particular order): Josh Christman, Daniel Chandler, Meghan Reinhardt, Carolyne De Bellefeuille, Adi Goodrich, Jon Barron, Meghan Reinhardt, Carolyne De Bellefeuille, Irina Blok, Spencer Sterling, Ruben Beddeleem, Ben Poole, Cadie Desbiens-Desmeules, Chris Donahue, Jorge Gonzalez Mendez, Noah Constant, Jesse Engel, Timo Denk, Andrea Agostinelli, Neil Zeghidour, Christian Frank, Mauricio Zuluaga, Hema Manickavasagam, Tom Hume, and Lynn Cherry.

The Wordcraft Writers Workshop: Creative Co-Writing with AI

Thu, 01 Dec 2022 08:00:00 -0800

A core piece of Magenta’s mission is to empower creativity using AI and machine learning. In order to evaluate how well this goal is being achieved, it is important to put tools in the hands of creators, encouraging them to share honest and critical feedback. This feedback can help researchers to thoughtfully develop the next generations of ML-powered creative tools. Most of our prior efforts to engage with creators have been in the domain of music (for example, Magenta Studio and NSynth).

However, human creativity encompasses far more than just music: visual artists paint, draw, and sculpt, and writers craft stories and poetry. In recent years, we’ve seen huge advancements in machine learning techniques that can facilitate creativity in these other modalities. Creative writing is an especially interesting domain because it is so challenging for AI to get right. Even short stories commonly have narrative arcs that span paragraphs or longer, multiple characters with diverging points of view, and a careful balance of familiar archetypes and novel storytelling–all difficult traits for state-of-the-art AI to replicate. At the same time, the omnipresent writer’s block is not a problem at all for neural language models like LaMDA, which can effortlessly generate as many words as you ask them for.

Earlier this year, we invited a cohort of 13 professional creative writers to try their hands at writing stories using Wordcraft, an AI-augmented text editor with a wide range of generative capabilities targeted at creative writing assistance. Wordcraft can suggest story ideas, rewrite text according to user-provided instructions, and elaborate on what has already been written. It also has a chatbot interface where users can engage with LaMDA, Google’s dialog-based language model, about their stories.

A demo of the Wordcraft web application

As in generative music, AI-assisted story writing can be a mixed bag. At its best, Wordcraft made suggestions that were inspiring and surrealistic, and writers applauded its usefulness for ideation and overcoming writer’s block. However, it also had a tendency to rehash tired tropes, and it could take wading through many dull suggestions before finding an interesting one.

All of the writers’ stories are available in the Wordcraft Writer’s Workshop’s digital literary magazine, and a detailed writeup of what we learned about the role machine learning can play in creative writing can be found here.

We hope you enjoy perusing through the stories, and we are excited to hear your ideas about how AI can create valuable creative writing tools.

The Chamber Ensemble Generator and CocoChorales Dataset

Fri, 30 Sep 2022 09:00:00 -0700

In this post, we’re excited to introduce the Chamber Ensemble Generator, a system for generating realistic chamber ensemble performances, and the corresponding CocoChorales Dataset, which contains over 1,400 hours of audio mixes with corresponding source data and MIDI, multi-f₀, and per-note performance annotations.

🎵Audio Examples

📝arXiv Paper

📂Dataset Download Instructions

Github Code

Data is the bedrock that all machine learning systems are built upon. Historically, researchers applying machine learning to music have not had access to the same scale of data that other fields have. Whereas image and language machine learning researchers measure their datasets by the millions or billions of examples, music researchers feel extremely lucky if they can scrape together a few thousand examples for a given task.

Modern machine learning systems require large quantities of annotated data. With music systems, getting annotations for some tasks–like transcription or f₀ estimation–requires tedious work by expert musicians. When annotating a single example correctly is difficult, how can we annotate hundreds of thousands of examples to make enough data to train a machine learning system?

In this post, we introduce a new approach to solving these problems by using generative models to create large amounts of realistic-sounding, finely annotated, freely available music data. We combined two structured generative models–a note generation model, Coconet, and a notes-to-audio generative synthesis model, MIDI-DDSP–into a system we call the Chamber Ensemble Generator. As its name suggests, the Chamber Ensemble Generator (or CEG) can generate performances of chamber ensembles playing in the style of four-part Bach chorales. Listen to the following examples performed by the CEG:

String Ensemble Mixture:

Soprano: Violin 1	Alto: Violin 2

Tenor: Viola	Bass: Cello

Woodwind Ensemble Mixture:

Soprano: Flute	Alto: Oboe

Tenor: Clarinet	Bass: Bassoon

We then used the CEG to create a massive music dataset for machine learning systems. We call this dataset CocoChorales. What’s exciting about the CEG is that it uses a set of structured generative models which provide annotations for many music machine learning applications like automatic music transcription, multi-f₀ estimation, source separation, performance analysis, and more.

Below, we dig deeper into each of these projects.

The Chamber Ensemble Generator

As we mentioned, the Chamber Ensemble Generator (CEG) is a set of two structured generative models that work together to create new chamber ensemble performances of four-part chorales in the style of J.S. Bach.

As seen in the figure above, the constituent models in the CEG are two previous Magenta models: Coconet and MIDI-DDSP. Coconet is a generative model of notes, creating a set of four-instrument music pieces (“note sequences”), harmonized in the style of a Bach Chorale. Each of these four note sequences is then individually synthesized by MIDI-DDSP. MIDI-DDSP is a generative synthesis model that uses Differentiable Digital Signal Processing (DDSP) that turns note sequences into realistic audio that can sound like a number of different instruments (e.g., violin, bassoon, or french horn).

It’s important to note that the CEG is built on structured generative models, i.e., models that have interpretable intermediate representations. On the one hand, this structure leads to a very opinionated view of music. The CEG is limited in ways that other generative music models are not; it cannot generate all styles of music, like a rock and roll ensemble for example. It can only generate chorales. However, many generative music models are notoriously “black boxes,” whose internal structures are difficult to interpret. By being built on a modular set of structured models, the internals of the CEG are easy to understand and modify. This also allows us to create a dataset with many types of annotations that would be tedious or impossible to acquire with other types of generative models (such as annotations of the velocity and vibrato applied to each individual note in a performance). In the next section, we will showcase how these interpretable structures can be used to mitigate biases of these generative models.

The CocoChorales Dataset

CocoChorales is a dataset of 240,000 examples totalling over 1,400 hours of mixture data. The Chamber Ensemble Generator (CEG) was used to create CocoChorales by sampling from the CEG’s two constituent generative models, Coconet and MIDI-DDSP. Using the CEG in this way is an example of dataset “amplification,” whereby a generative model trained on a small dataset is used to produce a much larger dataset. In this case, we are amplifying two very small datasets: Coconet is trained on the J.S. Bach Chorales Dataset, which contains 382 examples, and MIDI-DDSP is trained on URMP, which contains only 44 examples. But, using the CEG, we were able to generate 240,000 examples!

CocoChorales has examples performed by 13 different instruments (violin, viola, cello, double bass, flute, oboe, clarinet, bassoon, saxophone, trumpet, french horn, trombone, and tuba) organized into 4 different types of ensembles: a string ensemble, a brass ensemble, a woodwind ensemble, and a random ensemble (see the CocoChorales dataset page for more info). Each example contains an audio mixture, audio for each source, aligned MIDI, instrument labels, fundamental frequency (f₀) for each instrument, notewise performance characteristics (e.g., vibrato, loudness, brightness etc of each note), and raw synthesis parameters.

What’s cool about using the structured models in the CEG, is that because the system is modular, it is easy to interpret the output of the intermediate steps of the internal CEG models. For example, the MIDI-DDSP model we used tended to produce performances that were oftentimes out of tune and skewed sharp (i.e., frequency of a note being played was often slightly higher than the “proper” tuned frequency of the note in the piece, according to a 12-TET scale). This is visualized by the orange histogram in the above image (labeled “w/o pitch aug”), which shows how in or out of tune each note is once every 4ms (here, 0.0 means perfectly “in tune”). We were able to correct for this systematic bias by directly adjusting the f₀ curves output by the synthesis generation module of the MIDI-DDSP model, as shown by the blue histogram (labeled “w/ pitch aug”), which shows a distribution that is more centered on 0.0 in the figure above. This level of control is hard to achieve with black box generative models, and a big reason why we’re very excited about using the structured models in the CEG.

Downloading the Dataset

We’re really excited to see what the research community can do with the CocoChorales dataset. Further details on the dataset can be found here. Instructions on how to download the dataset can be found at this Github link.

If you want to learn more about either project, please see our arXiv paper. The code for the Chamber Ensemble Generator is available here and usage instructions are here. If you use the Chamber Ensemble Generator or CocoChorales in a research publication, we kindly ask that you use the following bibtex entry to cite it:

@article{wu2022chamber,
  title = {The Chamber Ensemble Generator: Limitless High-Quality MIR Data via Generative Modeling},
  author = {Wu, Yusong and Gardner, Josh and Manilow, Ethan and Simon, Ian and Hawthorne, Curtis and Engel, Jesse},
  journal={arXiv preprint arXiv:2209.14458},
  year = {2022},
}