daVinci-MagiHuman

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

SII-GAIR & Sand.ai

✨ Highlights

🧠 Single-Stream Transformer — A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
🎭 Exceptional Human-Centric Quality — Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
🌍 Multilingual — Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
⚡ Blazing Fast Inference — Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.
🏆 State-of-the-Art Results — Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
📦 Fully Open Source — We release the complete model stack: base model, distilled model, super-resolution model, and inference code.

🎬 Demo

video_1.mp4

video_2.mp4

video_3.mp4

video_4.mp4

video_5.MP4

video_6.mp4

video_7.mp4

🏗️ Architecture

daVinci-MagiHuman uses a single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence.

Key design choices:

Component	Description
🥪 Sandwich Architecture	First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities
🕐 Timestep-Free Denoising	No explicit timestep embeddings — the model infers the denoising state directly from input latents
🔀 Per-Head Gating	Learned scalar gates with sigmoid activation on each attention head for training stability
🔗 Unified Conditioning	Denoising and reference signals handled through a minimal unified interface — no dedicated conditioning branches

📊 Performance

Quantitative Quality Benchmark

Model	Visual Quality ↑	Text Alignment ↑	Physical Consistency ↑	WER ↓
OVI 1.1	4.73	4.10	4.41	40.45%
LTX 2.3	4.76	4.12	4.56	19.23%
daVinci-MagiHuman	4.80	4.18	4.52	14.60%

Human Evaluation (2,000 Pairwise Comparisons)

Matchup	daVinci-MagiHuman Win	Tie	Opponent Win
vs Ovi 1.1	80.0%	8.2%	11.8%
vs LTX 2.3	60.9%	17.2%	21.9%

Inference Speed (5-second video, on a single H100 GPU)

Resolution	Base (s)	Super-Res (s)	Decode (s)	Total (s)
256p	1.6	—	0.4	2.0
540p	1.6	5.1	1.3	8.0
1080p	1.6	31.0	5.8	38.4

🚀 Efficient Inference Techniques

⚡ Latent-Space Super-Resolution — Two-stage pipeline: generate at low resolution, then refine in latent space (not pixel space), avoiding an extra VAE decode-encode round trip.
🔄 Turbo VAE Decoder — A lightweight re-trained decoder that substantially reduces decoding overhead.
🔧 Full-Graph Compilation — MagiCompiler fuses operators across Transformer layers for ~1.2x speedup.
💨 Distillation — DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality.

📦 Getting Started

Option 1: Docker (Recommended)

# Recommended: use the prebuilt MagiHuman image (supports full pipeline including SR 1080p)
docker pull sandai/magi-human:latest

docker run -it --gpus all --network host --ipc host \
  -v /path/to/repos:/workspace \
  -v /path/to/checkpoints:/models \
  --name my-magi-human \
  sandai/magi-human:latest \
  bash

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman

If you prefer manual setup, follow Option 2 (Conda) below.

Option 2: Conda

# Create environment
conda create -n davinci python=3.12
conda activate davinci

# Install PyTorch
pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0

# Install Flash Attention (Hopper)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper && python setup.py install && cd ../..

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone and install daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
pip install -r requirements.txt
pip install --no-deps -r requirements-nodeps.txt

# Optional (only for sr-1080p): Install MagiAttention
git clone --recursive https://github.com/SandAI-org/MagiAttention.git
cd MagiAttention
git checkout v1.0.5
git submodule update --init --recursive
pip install -r requirements.txt
pip install --no-build-isolation .

Download Model Checkpoints

Download the complete model stack from HuggingFace and update the paths in the config files under example/.

You will also need the following external models:

Model	Source
Text Encoder	t5gemma-9b-9b-ul2
Audio Model	stable-audio-open-1.0
VAE	Wan2.2-TI2V-5B

🎯 Usage

Before running, update the checkpoint paths in the config files (example/*/config.json) to point to your local model directory.

Note: The first run will be slower due to model compilation and cache warmup. Subsequent runs will match the reported inference speeds.

Base Model (256p)

bash example/base/run.sh

Distilled Model (256p, 8 steps, no CFG)

bash example/distill/run.sh

Super-Resolution to 540p

bash example/sr_540p/run.sh

Super-Resolution to 1080p

bash example/sr_1080p/run.sh

✍️ Prompt Guidance

daVinci-MagiHuman uses an Enhanced Prompt system that rewrites user inputs into detailed performance directions optimized for avatar-style video generation. For the full system prompt specification, see prompts/enhanced_prompt_design.md.

Below is a quick reference for writing effective prompts.

Output Structure

Every enhanced prompt has three parts:

Main Body (150–200 words) — A clinical, chronological description of the character's appearance, facial dynamics, vocal delivery, and static cinematography. Written in English regardless of dialogue language.

Dialogue — Repeats all spoken lines in a structured format:

Dialogue:
<character description, language>: "Line content"

Background Sound — Specifies the most prominent ambient sound:
```
Background Sound:
<Description of the background sound>
```
Use <No prominent background sound> if none.

Quick Example

User input: A man in a yellow shirt says "有的人在一起生活一辈子，还带着假面具呢"

Enhanced prompt (abbreviated):

A young man with short dark hair, wearing a bright yellow polo shirt, sits stationary. His disposition is earnest and slightly agitated... He speaks with a rapid, emphatic tone, his mouth opening wide as he says, "有的人在一起生活一辈子，还带着假面具呢..." His brow furrows, lip muscles showing distinct dynamics...

Dialogue: <Young man in yellow polo, Mandarin>: "有的人在一起生活一辈子，还带着假面具呢..."

Background Sound: <No prominent background sound>

🙏 Acknowledgements

We thank the open-source community, and in particular Wan2.2 and Turbo-VAED, for their valuable contributions.

📄 License

This project is released under the Apache License 2.0.

📖 Citation

@misc{davinci-magihuman-2026,
  title   = {Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model},
  author  = {SII-GAIR and Sand.ai},
  year    = {2026},
  url     = {https://github.com/GAIR-NLP/daVinci-MagiHuman}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
example		example
inference		inference
prompts		prompts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements-nodeps.txt		requirements-nodeps.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

daVinci-MagiHuman

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

✨ Highlights

🎬 Demo

🏗️ Architecture

📊 Performance

Quantitative Quality Benchmark

Human Evaluation (2,000 Pairwise Comparisons)

Inference Speed (5-second video, on a single H100 GPU)

🚀 Efficient Inference Techniques

📦 Getting Started

Option 1: Docker (Recommended)

Option 2: Conda

Download Model Checkpoints

🎯 Usage

✍️ Prompt Guidance

Output Structure

Quick Example

🙏 Acknowledgements

📄 License

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Languages

Folders and files

Latest commit

History

Repository files navigation

daVinci-MagiHuman

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

✨ Highlights

🎬 Demo

🏗️ Architecture

📊 Performance

Quantitative Quality Benchmark

Human Evaluation (2,000 Pairwise Comparisons)

Inference Speed (5-second video, on a single H100 GPU)

🚀 Efficient Inference Techniques

📦 Getting Started

Option 1: Docker (Recommended)

Option 2: Conda

Download Model Checkpoints

🎯 Usage

✍️ Prompt Guidance

Output Structure

Quick Example

🙏 Acknowledgements

📄 License

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Languages

Packages