Overview

Relevant source files

Purpose and Scope

FluidAudio is a Swift SDK for fully local, low-latency audio AI on Apple platforms (iOS 17+, macOS 14+). The SDK provides four core capabilities: Automatic Speech Recognition (ASR), Speaker Diarization, Voice Activity Detection (VAD), and Text-to-Speech (TTS). All inference operations are offloaded to the Apple Neural Engine (ANE) to minimize CPU usage and power consumption while maximizing performance.

This document provides a high-level introduction to the SDK's architecture, design philosophy, and core components. For detailed information about specific subsystems, see:

ASR implementation details: #3.1
Diarization implementation details: #3.2
VAD implementation details: #3.3
TTS implementation details: #3.4
Installation and setup: #2.1
Model management configuration: #2.3

Sources: README.md1-14 README.md34-43 FluidAudio.podspec1-10

Core Capabilities

FluidAudio provides four primary audio processing capabilities, each implemented as a separate subsystem with dedicated manager classes:

Capability	Primary Manager Classes	Model Variants	Use Cases
Automatic Speech Recognition	`AsrManager` (batch) `StreamingAsrManager` (real-time)	Parakeet TDT v3 (25 languages) Parakeet TDT v2 (English, high recall) Parakeet EOU (streaming)	Transcription of audio files Real-time dictation Meeting transcripts
Speaker Diarization	`DiarizerManager` (online) `OfflineDiarizerManager` (batch)	Pyannote pipeline (segmentation + WeSpeaker + VBx) Sortformer (end-to-end neural)	Speaker identification Meeting analytics Multi-speaker audio processing
Voice Activity Detection	`VadManager`	Silero VAD	Speech segmentation Audio preprocessing Silence removal
Text-to-Speech	`PocketTtsManager` `KokoroTtsManager`	PocketTTS (streaming, voice cloning) Kokoro (SSML, 48 voices)	Voice synthesis Audio content generation Accessibility features

All models are open-source with permissive licenses (MIT/Apache 2.0) and automatically download from HuggingFace on first use. Processing operates entirely on-device with no network requests during inference.

Sources: README.md34-43 README.md246-252 README.md286-290 README.md374-380 README.md484-496

System Architecture

The following diagram illustrates the core architecture and how major code entities relate to each other:

Diagram: FluidAudio Core Architecture with Code Entities

The architecture follows a layered design where user code interacts with manager classes, which delegate to model wrappers, which in turn utilize the model registry and download infrastructure. All CoreML models are configured via MLModelConfiguration to target CPU and Neural Engine compute units (.cpuAndNeuralEngine), explicitly avoiding GPU to prevent iOS background restrictions.

Sources: Sources/FluidAudio/ASR/AsrManager.swift49-377 Sources/FluidAudio/DiarizerManager.swift Sources/FluidAudio/VAD/VadManager.swift Sources/FluidAudio/ModelNames.swift1-346 Sources/FluidAudio/DownloadUtils.swift60-295

Model Management and Auto-Download

FluidAudio implements a zero-configuration model management system that automatically downloads and caches models on first use:

Diagram: Model Auto-Download Flow with Code Entities

Configuration Options

The system supports three configuration methods:

Method	Code Entity	Use Case	Example
Default	`ModelRegistry.baseURL = "https://huggingface.co"`	Standard usage	Automatic
Programmatic	`ModelRegistry.baseURL = "https://custom.url"`	App-embedded config	Set before manager init
Environment Variable	`REGISTRY_URL` or `MODEL_REGISTRY_URL`	CLI/testing/CI	`export REGISTRY_URL=https://mirror.corp`
Proxy	`https_proxy`	Corporate firewalls	`export https_proxy=http://proxy:8080`

All downloads occur through DownloadUtils static methods, which handle URLSession configuration, retry logic, and corruption detection. Models are cached per-repository under ~/.cache/fluidaudio/ with automatic re-download if validation fails.

Sources: Sources/FluidAudio/ModelNames.swift1-346 Sources/FluidAudio/DownloadUtils.swift60-295 README.md137-185

Apple Neural Engine Optimization

FluidAudio is designed specifically for Apple Silicon and optimizes for the Apple Neural Engine (ANE) throughout the processing pipeline:

Hardware Targeting Strategy

Diagram: Apple Neural Engine Optimization Architecture

Compute Unit Assignment

All CoreML models in FluidAudio explicitly configure compute units via MLModelConfiguration:

This configuration ensures:

CPU: Audio preprocessing (mel spectrogram extraction via Accelerate framework), token decoding (SentencePiece)
Neural Engine: All neural network inference (encoders, decoders, embeddings, segmentation)
GPU: Explicitly avoided to prevent iOS background execution restrictions

Platform Constraints

The CocoaPod specification enforces arm64-only architecture:

This exclusion prevents accidental builds on x86_64 simulators or Intel Macs, where ANE is unavailable and performance would degrade significantly.

Sources: Sources/FluidAudio/ASR/AsrModels.swift61-71 FluidAudio.podspec20-38 README.md10-14

Native Wrappers for Performance-Critical Code

FluidAudio includes two native wrappers for operations where Swift alone is insufficient:

FastClusterWrapper (C++17)

Implements VBx clustering algorithm for offline speaker diarization. Located in Sources/FastClusterWrapper/, this wrapper:

Uses C++17 standard library
Has ARC disabled (requires_arc = false)
Provides single public method: performVBxClustering
Called exclusively by OfflineDiarizerManager

Module interface defined in Sources/FastClusterWrapper/include/FastClusterWrapper.h.

MachTaskSelfWrapper (C)

Provides system-level APIs for memory monitoring and profiling. Located in Sources/MachTaskSelfWrapper/, this wrapper:

Uses C-based Mach APIs (task_info, mach_task_self)
Enables memory usage tracking for performance monitoring
Module map defined in Sources/MachTaskSelfWrapper/include/module.modulemap

Both wrappers are required dependencies for the Core subspec in the CocoaPods distribution.

Sources: FluidAudio.podspec40-56 Sources/MachTaskSelfWrapper/include/module.modulemap1-5

Design Philosophy

FluidAudio follows five core design principles:

1. Zero-Configuration Model Management

Models auto-download on first use with automatic corruption detection and recovery. Users never manually download model files or configure paths. The Repo enum and ModelNames enum define all model locations and required files declaratively.

2. Stateless Batch Processing

ASR processes audio in stateless chunks (~14.96 seconds with 2-second overlap). Each chunk is transcribed independently without context carryover. This design ensures:

Consistent benchmark results across runs
Parallel processing capability
No hidden state accumulation
Simplified debugging

3. Actor-Based Concurrency

All manager classes use Swift's actor model for thread safety. The codebase strictly prohibits @unchecked Sendable (see CLAUDE.md11-15). Thread safety is achieved through:

Actor isolation for mutable state
@MainActor for UI-related operations
Explicit synchronization primitives where needed

4. Apple Neural Engine First

All neural operations target ANE explicitly via MLModelConfiguration.computeUnits = .cpuAndNeuralEngine. GPU is deliberately avoided because:

GPU usage triggers iOS background execution restrictions
ANE provides better power efficiency
ANE matches or exceeds GPU performance for supported operations

5. Auto-Recovery and Resilience

The system automatically handles transient failures:

Corrupted models trigger cache deletion and re-download
Network errors retry with exponential backoff
Model compilation timing is logged for performance monitoring
Validation occurs at multiple stages (download, compilation, first inference)

Sources: CLAUDE.md11-15 Sources/FluidAudio/ASR/AsrManager.swift49-377 Sources/FluidAudio/DownloadUtils.swift60-295

Platform Requirements

Requirement	Value	Rationale
Minimum iOS	17.0	CoreML ANE API availability
Minimum macOS	14.0	Matching ANE support
Architecture	arm64 only	ANE exclusive to Apple Silicon
Swift Version	5.10+ (6.0+ for swift-format)	Swift concurrency and modern features
C++ Standard	C++17	FastClusterWrapper requirements

Library vs CLI Distribution

Library: Available on both iOS and macOS via Swift Package Manager and CocoaPods
CLI: macOS-only executable (fluidaudiocli) with benchmarking and dataset management commands

The CLI provides development tools including:

transcribe: Batch and streaming ASR
process: Speaker diarization
asr-benchmark, vad-benchmark, diarization-benchmark: Accuracy validation
download: Pre-fetch datasets (LibriSpeech, AMI, MUSAN, VOiCES)

Sources: FluidAudio.podspec16-17 README.md87-107 README.md550-552

Project Structure

FluidAudio/
├── Sources/
│   ├── FluidAudio/              # Core library
│   │   ├── ASR/                 # AsrManager, StreamingAsrManager, TDT decoder
│   │   ├── Diarizer/            # DiarizerManager, OfflineDiarizerManager
│   │   ├── VAD/                 # VadManager, streaming VAD
│   │   ├── TTS/                 # PocketTtsManager, KokoroTtsManager
│   │   ├── Shared/              # AudioConverter, ANEOptimizer, DownloadUtils
│   │   ├── ModelNames.swift     # Repo enum, ModelNames enum
│   │   └── DownloadUtils.swift  # Model download infrastructure
│   ├── FastClusterWrapper/      # C++17 VBx clustering
│   ├── MachTaskSelfWrapper/     # C system APIs
│   └── FluidAudioCLI/           # macOS CLI application
├── Tests/FluidAudioTests/       # Unit and integration tests
├── Documentation/               # API reference, guides
├── Scripts/                     # Python benchmark utilities
└── Package.swift                # Swift Package Manager manifest

The codebase is organized into subsystem-specific directories with shared infrastructure under Shared/. Each major capability (ASR, Diarization, VAD, TTS) has dedicated manager classes that follow consistent initialization patterns and API conventions.

Sources: Package.swift README.md112-127 (from CLAUDE.md)

Performance Characteristics

FluidAudio achieves high throughput through ANE optimization:

Subsystem	Performance Metric	Example (M4 Pro)
ASR Batch	Real-time factor	~190x (1 hour audio in ~19 seconds)
ASR Streaming	Latency	Real-time with <100ms delay
Diarization Online	DER on AMI	~17.7%
Diarization Offline	DER on AMI	Lower (VBx clustering)
VAD	Frame rate	256ms hop, <5ms inference

Performance scales with Apple Silicon generation (M1 < M2 < M3 < M4) and benefits from increased ANE compute units in Pro/Max/Ultra variants.

For detailed performance analysis and benchmark methodology, see Performance and Benchmarks.

Sources: README.md249-251 README.md286-290

Overview

Relevant source files

Purpose and Scope

This document provides a high-level introduction to the SDK's architecture, design philosophy, and core components. For detailed information about specific subsystems, see:

ASR implementation details: #3.1
Diarization implementation details: #3.2
VAD implementation details: #3.3
TTS implementation details: #3.4
Installation and setup: #2.1
Model management configuration: #2.3

Sources: README.md1-14 README.md34-43 FluidAudio.podspec1-10

Core Capabilities

FluidAudio provides four primary audio processing capabilities, each implemented as a separate subsystem with dedicated manager classes:

Capability	Primary Manager Classes	Model Variants	Use Cases
Automatic Speech Recognition	`AsrManager` (batch) `StreamingAsrManager` (real-time)	Parakeet TDT v3 (25 languages) Parakeet TDT v2 (English, high recall) Parakeet EOU (streaming)	Transcription of audio files Real-time dictation Meeting transcripts
Speaker Diarization	`DiarizerManager` (online) `OfflineDiarizerManager` (batch)	Pyannote pipeline (segmentation + WeSpeaker + VBx) Sortformer (end-to-end neural)	Speaker identification Meeting analytics Multi-speaker audio processing
Voice Activity Detection	`VadManager`	Silero VAD	Speech segmentation Audio preprocessing Silence removal
Text-to-Speech	`PocketTtsManager` `KokoroTtsManager`	PocketTTS (streaming, voice cloning) Kokoro (SSML, 48 voices)	Voice synthesis Audio content generation Accessibility features

Sources: README.md34-43 README.md246-252 README.md286-290 README.md374-380 README.md484-496

System Architecture

The following diagram illustrates the core architecture and how major code entities relate to each other:

Diagram: FluidAudio Core Architecture with Code Entities

Model Management and Auto-Download

FluidAudio implements a zero-configuration model management system that automatically downloads and caches models on first use:

Diagram: Model Auto-Download Flow with Code Entities

Configuration Options

The system supports three configuration methods:

Method	Code Entity	Use Case	Example
Default	`ModelRegistry.baseURL = "https://huggingface.co"`	Standard usage	Automatic
Programmatic	`ModelRegistry.baseURL = "https://custom.url"`	App-embedded config	Set before manager init
Environment Variable	`REGISTRY_URL` or `MODEL_REGISTRY_URL`	CLI/testing/CI	`export REGISTRY_URL=https://mirror.corp`
Proxy	`https_proxy`	Corporate firewalls	`export https_proxy=http://proxy:8080`

Sources: Sources/FluidAudio/ModelNames.swift1-346 Sources/FluidAudio/DownloadUtils.swift60-295 README.md137-185

Apple Neural Engine Optimization

FluidAudio is designed specifically for Apple Silicon and optimizes for the Apple Neural Engine (ANE) throughout the processing pipeline:

Hardware Targeting Strategy

Diagram: Apple Neural Engine Optimization Architecture

Compute Unit Assignment

All CoreML models in FluidAudio explicitly configure compute units via MLModelConfiguration:

This configuration ensures:

CPU: Audio preprocessing (mel spectrogram extraction via Accelerate framework), token decoding (SentencePiece)
Neural Engine: All neural network inference (encoders, decoders, embeddings, segmentation)
GPU: Explicitly avoided to prevent iOS background execution restrictions

Platform Constraints

The CocoaPod specification enforces arm64-only architecture:

This exclusion prevents accidental builds on x86_64 simulators or Intel Macs, where ANE is unavailable and performance would degrade significantly.

Sources: Sources/FluidAudio/ASR/AsrModels.swift61-71 FluidAudio.podspec20-38 README.md10-14

Native Wrappers for Performance-Critical Code

FluidAudio includes two native wrappers for operations where Swift alone is insufficient:

FastClusterWrapper (C++17)

Implements VBx clustering algorithm for offline speaker diarization. Located in Sources/FastClusterWrapper/, this wrapper:

Uses C++17 standard library
Has ARC disabled (requires_arc = false)
Provides single public method: performVBxClustering
Called exclusively by OfflineDiarizerManager

Module interface defined in Sources/FastClusterWrapper/include/FastClusterWrapper.h.

MachTaskSelfWrapper (C)

Provides system-level APIs for memory monitoring and profiling. Located in Sources/MachTaskSelfWrapper/, this wrapper:

Uses C-based Mach APIs (task_info, mach_task_self)
Enables memory usage tracking for performance monitoring
Module map defined in Sources/MachTaskSelfWrapper/include/module.modulemap

Both wrappers are required dependencies for the Core subspec in the CocoaPods distribution.

Sources: FluidAudio.podspec40-56 Sources/MachTaskSelfWrapper/include/module.modulemap1-5

Design Philosophy

FluidAudio follows five core design principles:

1. Zero-Configuration Model Management

2. Stateless Batch Processing

ASR processes audio in stateless chunks (~14.96 seconds with 2-second overlap). Each chunk is transcribed independently without context carryover. This design ensures:

Consistent benchmark results across runs
Parallel processing capability
No hidden state accumulation
Simplified debugging

3. Actor-Based Concurrency

All manager classes use Swift's actor model for thread safety. The codebase strictly prohibits @unchecked Sendable (see CLAUDE.md11-15). Thread safety is achieved through:

Actor isolation for mutable state
@MainActor for UI-related operations
Explicit synchronization primitives where needed

4. Apple Neural Engine First

All neural operations target ANE explicitly via MLModelConfiguration.computeUnits = .cpuAndNeuralEngine. GPU is deliberately avoided because:

GPU usage triggers iOS background execution restrictions
ANE provides better power efficiency
ANE matches or exceeds GPU performance for supported operations

5. Auto-Recovery and Resilience

The system automatically handles transient failures:

Corrupted models trigger cache deletion and re-download
Network errors retry with exponential backoff
Model compilation timing is logged for performance monitoring
Validation occurs at multiple stages (download, compilation, first inference)

Sources: CLAUDE.md11-15 Sources/FluidAudio/ASR/AsrManager.swift49-377 Sources/FluidAudio/DownloadUtils.swift60-295

Platform Requirements

Requirement	Value	Rationale
Minimum iOS	17.0	CoreML ANE API availability
Minimum macOS	14.0	Matching ANE support
Architecture	arm64 only	ANE exclusive to Apple Silicon
Swift Version	5.10+ (6.0+ for swift-format)	Swift concurrency and modern features
C++ Standard	C++17	FastClusterWrapper requirements

Library vs CLI Distribution

Library: Available on both iOS and macOS via Swift Package Manager and CocoaPods
CLI: macOS-only executable (fluidaudiocli) with benchmarking and dataset management commands

The CLI provides development tools including:

transcribe: Batch and streaming ASR
process: Speaker diarization
asr-benchmark, vad-benchmark, diarization-benchmark: Accuracy validation
download: Pre-fetch datasets (LibriSpeech, AMI, MUSAN, VOiCES)

Sources: FluidAudio.podspec16-17 README.md87-107 README.md550-552

Project Structure

FluidAudio/
├── Sources/
│   ├── FluidAudio/              # Core library
│   │   ├── ASR/                 # AsrManager, StreamingAsrManager, TDT decoder
│   │   ├── Diarizer/            # DiarizerManager, OfflineDiarizerManager
│   │   ├── VAD/                 # VadManager, streaming VAD
│   │   ├── TTS/                 # PocketTtsManager, KokoroTtsManager
│   │   ├── Shared/              # AudioConverter, ANEOptimizer, DownloadUtils
│   │   ├── ModelNames.swift     # Repo enum, ModelNames enum
│   │   └── DownloadUtils.swift  # Model download infrastructure
│   ├── FastClusterWrapper/      # C++17 VBx clustering
│   ├── MachTaskSelfWrapper/     # C system APIs
│   └── FluidAudioCLI/           # macOS CLI application
├── Tests/FluidAudioTests/       # Unit and integration tests
├── Documentation/               # API reference, guides
├── Scripts/                     # Python benchmark utilities
└── Package.swift                # Swift Package Manager manifest

Sources: Package.swift README.md112-127 (from CLAUDE.md)

Performance Characteristics

FluidAudio achieves high throughput through ANE optimization:

Subsystem	Performance Metric	Example (M4 Pro)
ASR Batch	Real-time factor	~190x (1 hour audio in ~19 seconds)
ASR Streaming	Latency	Real-time with <100ms delay
Diarization Online	DER on AMI	~17.7%
Diarization Offline	DER on AMI	Lower (VBx clustering)
VAD	Frame rate	256ms hop, <5ms inference

Performance scales with Apple Silicon generation (M1 < M2 < M3 < M4) and benefits from increased ANE compute units in Pro/Max/Ultra variants.

For detailed performance analysis and benchmark methodology, see Performance and Benchmarks.

Sources: README.md249-251 README.md286-290

Overview

Purpose and Scope

Core Capabilities

System Architecture

Model Management and Auto-Download

Configuration Options

Apple Neural Engine Optimization

Hardware Targeting Strategy

Compute Unit Assignment

Platform Constraints

Native Wrappers for Performance-Critical Code

FastClusterWrapper (C++17)

MachTaskSelfWrapper (C)

Design Philosophy

1. Zero-Configuration Model Management

2. Stateless Batch Processing

3. Actor-Based Concurrency

4. Apple Neural Engine First

5. Auto-Recovery and Resilience

Platform Requirements

Library vs CLI Distribution

Project Structure

Performance Characteristics

On this page

Overview

Purpose and Scope

Core Capabilities

System Architecture

Model Management and Auto-Download

Configuration Options

Apple Neural Engine Optimization

Hardware Targeting Strategy

Compute Unit Assignment

Platform Constraints

Native Wrappers for Performance-Critical Code

FastClusterWrapper (C++17)

MachTaskSelfWrapper (C)

Design Philosophy

1. Zero-Configuration Model Management

2. Stateless Batch Processing

3. Actor-Based Concurrency

4. Apple Neural Engine First

5. Auto-Recovery and Resilience

Platform Requirements

Library vs CLI Distribution

Project Structure

Performance Characteristics

On this page