FluidAudio is a Swift SDK for fully local, low-latency audio AI on Apple platforms (iOS 17+, macOS 14+). The SDK provides four core capabilities: Automatic Speech Recognition (ASR), Speaker Diarization, Voice Activity Detection (VAD), and Text-to-Speech (TTS). All inference operations are offloaded to the Apple Neural Engine (ANE) to minimize CPU usage and power consumption while maximizing performance.
This document provides a high-level introduction to the SDK's architecture, design philosophy, and core components. For detailed information about specific subsystems, see:
Sources: README.md1-14 README.md34-43 FluidAudio.podspec1-10
FluidAudio provides four primary audio processing capabilities, each implemented as a separate subsystem with dedicated manager classes:
| Capability | Primary Manager Classes | Model Variants | Use Cases |
|---|---|---|---|
| Automatic Speech Recognition | AsrManager (batch)StreamingAsrManager (real-time) | Parakeet TDT v3 (25 languages) Parakeet TDT v2 (English, high recall) Parakeet EOU (streaming) | Transcription of audio files Real-time dictation Meeting transcripts |
| Speaker Diarization | DiarizerManager (online)OfflineDiarizerManager (batch) | Pyannote pipeline (segmentation + WeSpeaker + VBx) Sortformer (end-to-end neural) | Speaker identification Meeting analytics Multi-speaker audio processing |
| Voice Activity Detection | VadManager | Silero VAD | Speech segmentation Audio preprocessing Silence removal |
| Text-to-Speech | PocketTtsManagerKokoroTtsManager | PocketTTS (streaming, voice cloning) Kokoro (SSML, 48 voices) | Voice synthesis Audio content generation Accessibility features |
All models are open-source with permissive licenses (MIT/Apache 2.0) and automatically download from HuggingFace on first use. Processing operates entirely on-device with no network requests during inference.
Sources: README.md34-43 README.md246-252 README.md286-290 README.md374-380 README.md484-496
The following diagram illustrates the core architecture and how major code entities relate to each other:
Diagram: FluidAudio Core Architecture with Code Entities
The architecture follows a layered design where user code interacts with manager classes, which delegate to model wrappers, which in turn utilize the model registry and download infrastructure. All CoreML models are configured via MLModelConfiguration to target CPU and Neural Engine compute units (.cpuAndNeuralEngine), explicitly avoiding GPU to prevent iOS background restrictions.
Sources: Sources/FluidAudio/ASR/AsrManager.swift49-377 Sources/FluidAudio/DiarizerManager.swift Sources/FluidAudio/VAD/VadManager.swift Sources/FluidAudio/ModelNames.swift1-346 Sources/FluidAudio/DownloadUtils.swift60-295
FluidAudio implements a zero-configuration model management system that automatically downloads and caches models on first use:
Diagram: Model Auto-Download Flow with Code Entities
The system supports three configuration methods:
| Method | Code Entity | Use Case | Example |
|---|---|---|---|
| Default | ModelRegistry.baseURL = "https://huggingface.co" | Standard usage | Automatic |
| Programmatic | ModelRegistry.baseURL = "https://custom.url" | App-embedded config | Set before manager init |
| Environment Variable | REGISTRY_URL or MODEL_REGISTRY_URL | CLI/testing/CI | export REGISTRY_URL=https://mirror.corp |
| Proxy | https_proxy | Corporate firewalls | export https_proxy=http://proxy:8080 |
All downloads occur through DownloadUtils static methods, which handle URLSession configuration, retry logic, and corruption detection. Models are cached per-repository under ~/.cache/fluidaudio/ with automatic re-download if validation fails.
Sources: Sources/FluidAudio/ModelNames.swift1-346 Sources/FluidAudio/DownloadUtils.swift60-295 README.md137-185
FluidAudio is designed specifically for Apple Silicon and optimizes for the Apple Neural Engine (ANE) throughout the processing pipeline:
Diagram: Apple Neural Engine Optimization Architecture
All CoreML models in FluidAudio explicitly configure compute units via MLModelConfiguration:
This configuration ensures:
The CocoaPod specification enforces arm64-only architecture:
This exclusion prevents accidental builds on x86_64 simulators or Intel Macs, where ANE is unavailable and performance would degrade significantly.
Sources: Sources/FluidAudio/ASR/AsrModels.swift61-71 FluidAudio.podspec20-38 README.md10-14
FluidAudio includes two native wrappers for operations where Swift alone is insufficient:
Implements VBx clustering algorithm for offline speaker diarization. Located in Sources/FastClusterWrapper/, this wrapper:
requires_arc = false)performVBxClusteringOfflineDiarizerManagerModule interface defined in Sources/FastClusterWrapper/include/FastClusterWrapper.h.
Provides system-level APIs for memory monitoring and profiling. Located in Sources/MachTaskSelfWrapper/, this wrapper:
task_info, mach_task_self)Sources/MachTaskSelfWrapper/include/module.modulemapBoth wrappers are required dependencies for the Core subspec in the CocoaPods distribution.
Sources: FluidAudio.podspec40-56 Sources/MachTaskSelfWrapper/include/module.modulemap1-5
FluidAudio follows five core design principles:
Models auto-download on first use with automatic corruption detection and recovery. Users never manually download model files or configure paths. The Repo enum and ModelNames enum define all model locations and required files declaratively.
ASR processes audio in stateless chunks (~14.96 seconds with 2-second overlap). Each chunk is transcribed independently without context carryover. This design ensures:
All manager classes use Swift's actor model for thread safety. The codebase strictly prohibits @unchecked Sendable (see CLAUDE.md11-15). Thread safety is achieved through:
@MainActor for UI-related operationsAll neural operations target ANE explicitly via MLModelConfiguration.computeUnits = .cpuAndNeuralEngine. GPU is deliberately avoided because:
The system automatically handles transient failures:
Sources: CLAUDE.md11-15 Sources/FluidAudio/ASR/AsrManager.swift49-377 Sources/FluidAudio/DownloadUtils.swift60-295
| Requirement | Value | Rationale |
|---|---|---|
| Minimum iOS | 17.0 | CoreML ANE API availability |
| Minimum macOS | 14.0 | Matching ANE support |
| Architecture | arm64 only | ANE exclusive to Apple Silicon |
| Swift Version | 5.10+ (6.0+ for swift-format) | Swift concurrency and modern features |
| C++ Standard | C++17 | FastClusterWrapper requirements |
fluidaudiocli) with benchmarking and dataset management commandsThe CLI provides development tools including:
transcribe: Batch and streaming ASRprocess: Speaker diarizationasr-benchmark, vad-benchmark, diarization-benchmark: Accuracy validationdownload: Pre-fetch datasets (LibriSpeech, AMI, MUSAN, VOiCES)Sources: FluidAudio.podspec16-17 README.md87-107 README.md550-552
FluidAudio/
├── Sources/
│ ├── FluidAudio/ # Core library
│ │ ├── ASR/ # AsrManager, StreamingAsrManager, TDT decoder
│ │ ├── Diarizer/ # DiarizerManager, OfflineDiarizerManager
│ │ ├── VAD/ # VadManager, streaming VAD
│ │ ├── TTS/ # PocketTtsManager, KokoroTtsManager
│ │ ├── Shared/ # AudioConverter, ANEOptimizer, DownloadUtils
│ │ ├── ModelNames.swift # Repo enum, ModelNames enum
│ │ └── DownloadUtils.swift # Model download infrastructure
│ ├── FastClusterWrapper/ # C++17 VBx clustering
│ ├── MachTaskSelfWrapper/ # C system APIs
│ └── FluidAudioCLI/ # macOS CLI application
├── Tests/FluidAudioTests/ # Unit and integration tests
├── Documentation/ # API reference, guides
├── Scripts/ # Python benchmark utilities
└── Package.swift # Swift Package Manager manifest
The codebase is organized into subsystem-specific directories with shared infrastructure under Shared/. Each major capability (ASR, Diarization, VAD, TTS) has dedicated manager classes that follow consistent initialization patterns and API conventions.
Sources: Package.swift README.md112-127 (from CLAUDE.md)
FluidAudio achieves high throughput through ANE optimization:
| Subsystem | Performance Metric | Example (M4 Pro) |
|---|---|---|
| ASR Batch | Real-time factor | ~190x (1 hour audio in ~19 seconds) |
| ASR Streaming | Latency | Real-time with <100ms delay |
| Diarization Online | DER on AMI | ~17.7% |
| Diarization Offline | DER on AMI | Lower (VBx clustering) |
| VAD | Frame rate | 256ms hop, <5ms inference |
Performance scales with Apple Silicon generation (M1 < M2 < M3 < M4) and benefits from increased ANE compute units in Pro/Max/Ultra variants.
For detailed performance analysis and benchmark methodology, see Performance and Benchmarks.
Sources: README.md249-251 README.md286-290
Refresh this wiki
This wiki was recently refreshed. Please wait 7 days to refresh again.