- π§ Single-Stream Transformer β A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
- π Exceptional Human-Centric Quality β Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
- π Multilingual β Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
- β‘ Blazing Fast Inference β Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.
- π State-of-the-Art Results β Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
- π¦ Fully Open Source β We release the complete model stack: base model, distilled model, super-resolution model, and inference code.
video_1.mp4
video_2.mp4
video_3.mp4 |
video_4.mp4 |
video_5.MP4 |
video_6.mp4 |
video_7.mp4 |
daVinci-MagiHuman uses a single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence.
Key design choices:
| Component | Description |
|---|---|
| π₯ͺ Sandwich Architecture | First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities |
| π Timestep-Free Denoising | No explicit timestep embeddings β the model infers the denoising state directly from input latents |
| π Per-Head Gating | Learned scalar gates with sigmoid activation on each attention head for training stability |
| π Unified Conditioning | Denoising and reference signals handled through a minimal unified interface β no dedicated conditioning branches |
| Model | Visual Quality β | Text Alignment β | Physical Consistency β | WER β |
|---|---|---|---|---|
| OVI 1.1 | 4.73 | 4.10 | 4.41 | 40.45% |
| LTX 2.3 | 4.76 | 4.12 | 4.56 | 19.23% |
| daVinci-MagiHuman | 4.80 | 4.18 | 4.52 | 14.60% |
| Matchup | daVinci-MagiHuman Win | Tie | Opponent Win |
|---|---|---|---|
| vs Ovi 1.1 | 80.0% | 8.2% | 11.8% |
| vs LTX 2.3 | 60.9% | 17.2% | 21.9% |
| Resolution | Base (s) | Super-Res (s) | Decode (s) | Total (s) |
|---|---|---|---|---|
| 256p | 1.6 | β | 0.4 | 2.0 |
| 540p | 1.6 | 5.1 | 1.3 | 8.0 |
| 1080p | 1.6 | 31.0 | 5.8 | 38.4 |
- β‘ Latent-Space Super-Resolution β Two-stage pipeline: generate at low resolution, then refine in latent space (not pixel space), avoiding an extra VAE decode-encode round trip.
- π Turbo VAE Decoder β A lightweight re-trained decoder that substantially reduces decoding overhead.
- π§ Full-Graph Compilation β MagiCompiler fuses operators across Transformer layers for ~1.2x speedup.
- π¨ Distillation β DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality.
# Recommended: use the prebuilt MagiHuman image (supports full pipeline including SR 1080p)
docker pull sandai/magi-human:latest
docker run -it --gpus all --network host --ipc host \
-v /path/to/repos:/workspace \
-v /path/to/checkpoints:/models \
--name my-magi-human \
sandai/magi-human:latest \
bash
# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..
# Clone daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHumanIf you prefer manual setup, follow Option 2 (Conda) below.
# Create environment
conda create -n davinci python=3.12
conda activate davinci
# Install PyTorch
pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0
# Install Flash Attention (Hopper)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper && python setup.py install && cd ../..
# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..
# Clone and install daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
pip install -r requirements.txt
pip install --no-deps -r requirements-nodeps.txt
# Optional (only for sr-1080p): Install MagiAttention
git clone --recursive https://github.com/SandAI-org/MagiAttention.git
cd MagiAttention
git checkout v1.0.5
git submodule update --init --recursive
pip install -r requirements.txt
pip install --no-build-isolation .Download the complete model stack from HuggingFace and update the paths in the config files under example/.
You will also need the following external models:
| Model | Source |
|---|---|
| Text Encoder | t5gemma-9b-9b-ul2 |
| Audio Model | stable-audio-open-1.0 |
| VAE | Wan2.2-TI2V-5B |
Before running, update the checkpoint paths in the config files (example/*/config.json) to point to your local model directory.
Note: The first run will be slower due to model compilation and cache warmup. Subsequent runs will match the reported inference speeds.
Base Model (256p)
bash example/base/run.shDistilled Model (256p, 8 steps, no CFG)
bash example/distill/run.shSuper-Resolution to 540p
bash example/sr_540p/run.shSuper-Resolution to 1080p
bash example/sr_1080p/run.shdaVinci-MagiHuman uses an Enhanced Prompt system that rewrites user inputs into detailed performance directions optimized for avatar-style video generation. For the full system prompt specification, see prompts/enhanced_prompt_design.md.
Below is a quick reference for writing effective prompts.
Every enhanced prompt has three parts:
-
Main Body (150β200 words) β A clinical, chronological description of the character's appearance, facial dynamics, vocal delivery, and static cinematography. Written in English regardless of dialogue language.
-
Dialogue β Repeats all spoken lines in a structured format:
Dialogue: <character description, language>: "Line content" -
Background Sound β Specifies the most prominent ambient sound:
Background Sound: <Description of the background sound>Use
<No prominent background sound>if none.
User input: A man in a yellow shirt says "ζηδΊΊε¨δΈθ΅·ηζ΄»δΈθΎεοΌθΏεΈ¦ηει’ε ·ε’"
Enhanced prompt (abbreviated):
A young man with short dark hair, wearing a bright yellow polo shirt, sits stationary. His disposition is earnest and slightly agitated... He speaks with a rapid, emphatic tone, his mouth opening wide as he says, "ζ η δΊΊ ε¨ δΈ θ΅· η ζ΄» δΈ θΎ εοΌθΏ εΈ¦ η ε ι’ ε · ε’..." His brow furrows, lip muscles showing distinct dynamics...
Dialogue: <Young man in yellow polo, Mandarin>: "ζ η δΊΊ ε¨ δΈ θ΅· η ζ΄» δΈ θΎ εοΌθΏ εΈ¦ η ε ι’ ε · ε’..."
Background Sound: <No prominent background sound>
We thank the open-source community, and in particular Wan2.2 and Turbo-VAED, for their valuable contributions.
This project is released under the Apache License 2.0.
@misc{davinci-magihuman-2026,
title = {Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model},
author = {SII-GAIR and Sand.ai},
year = {2026},
url = {https://github.com/GAIR-NLP/daVinci-MagiHuman}
}
