Takeaways
Just as Meta Segment Anything Model (SAM) revolutionized computer vision by enabling people to segment any object in images and videos, today we’re excited to share a first-of-its-kind model for segmenting sound. We’re introducing SAM Audio, a state-of-the-art unified model that transforms audio processing by making it easy to isolate any sound from complex audio mixtures using natural, multimodal prompts — whether through text, visual cues, or marking time segments. This intuitive approach mirrors how people naturally engage with sound, making audio separation more accessible and useful than ever before.
At the heart of SAM Audio is Perception Encoder Audiovisual (PE-AV), a technical engine that helps drive state-of-the-art performance. Built on the open source Perception Encoder model we shared earlier this year, PE-AV enables the building of more advanced computer vision systems that can assist people in everyday tasks, including sound detection. Think of PE-AV like “the ears” that help SAM Audio function as “the brain” to complete audio segmentation tasks. Together, these models enable many exciting use cases. Imagine a video recording of a band performance and all it takes is one click on the guitar to isolate its audio. SAM Audio can also be used to separate audio with text prompts, such as filtering out loud traffic noise from a video filmed outside. Additionally, our industry-first span prompts help people fix their audio issues all at once, such as filtering out noise from a barking dog during an entire podcast recording.
At Meta, we’re using these advancements to help build the next generation of creative media tools. We see so many potential use cases, including audio clean-up, background noise removal, and other tools to help people enhance their creativity. Today, we’re sharing SAM Audio and PE-AV with the community, along with two research papers offering technical depth on each model. We’re also sharing SAM Audio-Bench, the first in-the-wild audio separation benchmark, and SAM Audio Judge, the first automatic judge model for audio separation.
We’re bringing all of this work together in the Segment Anything Playground, our new platform that lets anyone try our latest models. Starting today, people can select from our collection of audio and video assets or upload their own to explore the capabilities of SAM Audio. As always, we look forward to continuing the conversations we’ve been having about SAM — and for the first time ever, hearing what people create with these groundbreaking new models.
A Unified, Multimodal Prompting Model For Segmenting Audio
Until now, audio segmentation and editing has been a fragmented space, with a variety of tools designed for single-purpose use cases. As a unified model, SAM Audio is the first to support multiple interaction modalities that match how people naturally think about audio, achieving state-of-the-art performance on tasks, such as instrument, speech, and general sound separation for both text-prompted and visual-prompted tasks.
SAM Audio performs reliably across diverse, real-world scenarios — using text, visual, and temporal cues. This approach gives people precise and intuitive control over how audio is separated.
We present three methods for segmenting audio that can be used alone or in any combination to achieve a desired outcome.
Model Architecture
At its core, SAM Audio leverages a generative modeling framework built on a flow-matching diffusion transformer. This architecture takes an audio mixture and one or more prompts, encodes them into a shared representation, and generates the target and residual audio tracks. In tandem with the generative modeling framework, we developed a comprehensive data engine for SAM Audio that addresses the challenge of obtaining large-scale, high-quality separation data. This engine combines advanced audio mixing, automated multimodal prompt generation, and a robust pseudo-labeling pipeline to produce realistic training data for real-world scenarios.
The model is trained on this diverse dataset, which includes real and synthetic mixtures spanning speech, music, and general sound events. Advanced audio data synthesis strategies further enhance the model’s robustness, ensuring reliable performance in a wide range of environments.
Our second model, Perception Encoder Audiovisual, is the engine behind SAM Audio’s results. It powers core components such as the primary captioning model and SAM Audio Judge, our automatic judge model for audio separation. Built on Meta Perception Encoder — an open source model we released in April — PE-AV extends advanced computer vision capabilities to audio. Just as we adapted the model for object detection in SAM 3, we expanded its framework to encode sounds for SAM Audio, enabling the system to separate complex audio mixtures and adapt to real-world scenarios where visual context is important.

By extracting frame-level video features and aligning them with audio representations, the system combines and timestamps audiovisual information. This design allows SAM Audio to accurately separate sources that are visually grounded, such as on-screen speakers or instruments, and to infer off-screen events from scene context.
PE-AV provides robust, semantically rich features by aligning video frames and audio at precise moments in time. This temporal alignment is necessary for matching what’s seen with what’s heard, supporting high-precision multimodal audio separation. Without it, the model would lack the fine-grained visual understanding needed for flexible and perceptually accurate audio segmentation.

Technically, PE-AV integrates several open source components and research advances. Alongside Meta’s Perception Encoder, it uses PyTorchVideo for efficient video processing and FAISS for large-scale semantic search, and it leverages contrastive learning frameworks. The model is trained on over 100 million videos using large-scale multimodal contrastive learning, with data from open datasets and synthetic captioning pipelines to ensure broad coverage and strong generalization. Together, these elements create a flexible, high-performance backbone that supports text, visual, and temporal prompting for a wide range of audio separation and understanding tasks.
SAM Audio Judge
We’re also sharing SAM Audio Judge, a novel evaluation framework and model designed to assess the quality of audio segmentation in a way that closely mirrors human perception. Unlike traditional metrics that rely on comparing segmented audio to reference tracks — often missing the nuances of how people actually hear and judge sound — SAM Audio Judge provides a reference-free, objective metric that evaluates segmented audio based on perceptual criteria. This makes it especially useful for applications where reference signals are unavailable and for benchmarking models in a way that better reflects actual listening experiences.
People can use SAM Audio Judge to benchmark and compare audio separation models across music, speech, and sound effects — gaining insights into output quality and the intrinsic difficulty of audio separation tasks. Building SAM Audio Judge began with the definition of nine perceptual dimensions, including recall, precision, faithfulness, and overall quality. Human ratings were then collected using a detailed annotation guideline and a five-point scale. SAM Audio Judge leverages advanced audio and text encoders, a transformer backbone, and a unique pre-training strategy that improves its ability to judge whether outputs match text prompts. The combination of perceptually aligned criteria, robust data collection, and innovative model architecture enables us to advance the field of audio separation.
SAM Audio-Bench
To ensure consistent and meaningful evaluation of model performance across varying tasks, we created SAM Audio-Bench, a comprehensive audio separation benchmark that covers all major audio domains — speech, music, and general sound effects — and text, visual, and span prompt types. This benchmark enables a fair assessment of separation models, ensuring that progress in the field is measurable and relevant to everyday use cases.
Unlike earlier datasets that use synthetic audio mixes or only cover a narrow set of sounds, SAM Audio-Bench is built using audio and video from a variety of high-quality sources. Each 10-second sample comes with rich, multimodal prompts — like human-drawn visual masks, time markers, and clear text descriptions — enabling controlled evaluation across text, visual, or timing cues. This approach ensures SAM Audio-Bench is more realistic and flexible, supporting everything from speech and music separation to instrument and general sound extraction — all within a single, unified framework.

SAM Audio-Bench also pioneers reference-free evaluation, making it possible to evaluate audio separation without needing isolated reference tracks. It combines human listening tests with the SAM Audio Judge model, delivering reliable results — even when the original audio stems aren’t available. By bringing together real-world audio, multimodal prompts, and coverage across different sound domains, SAM Audio-Bench sets a new standard for testing audio separation systems in ways that better reflect how they’ll actually be used outside the lab.
Results
SAM Audio represents a significant advancement in audio separation technology, outperforming previous state-of-the-art models across a wide range of benchmarks and tasks. The model not only significantly surpasses prior work in universal audio separation, it also matches the performance of the best domain-specific models across all audio categories, including speech, music, and general sounds. Its multimodal prompting — supporting text, visual, and click-based inputs — enables flexible and open-domain segmentation, making it suitable for both in-the-wild and professional audio scenarios.

Performance evaluations show that SAM Audio achieves state-of-the-art results in modality-specific tasks, with mixed-modality prompting (such as combining text and span inputs) delivering even stronger outcomes than single-modality approaches. Notably, the model operates faster than real-time (RTF ≈ 0.7), processing audio efficiently at scale from 500M to 3B parameters.
While SAM Audio sets a new standard for quality and efficiency, it does have some limitations. Audio as a prompt is not supported, and complete audio separation without prompting is outside its scope. Additionally, separating between highly similar audio events — like isolating a single singer from a chorus or an instrument from an orchestra — remains a challenge.
Looking Forward: The Future of Audio AI
We’re excited to bring audio to the Segment Anything collection of models and believe SAM Audio is the all-around best audio separation model available today. Our unified approach also enables new possibilities for understanding complex acoustic environments and responding to natural prompts across diverse modalities.
By making professional-grade audio separation available through intuitive, natural prompts, we aim to empower creators, researchers, and developers to explore new forms of expression and build applications that were previously out of reach. In addition to exploring future integrations of SAM Audio in our products, we’ve partnered with Starkey, the largest manufacturer of hearing aids in the US, and 2gether-International, a leading startup accelerator for disabled founders. Both partners are exploring how models like SAM Audio can further advance accessibility.
This democratization of audio tools is a step toward more accessible, creative, and inclusive AI. The future of audio-aware AI is just beginning, and we’re excited to support the innovations and discoveries that lie ahead.
Our approach
Latest news
Foundational models