Jisu Nam1 · Soowon Son1 · Zhan Xu2 · Jing Shi2 · Difan Liu2 · Feng Liu2 · Aashish Misra3 · Seungryong Kim1 · Yang Zhou2
1KAIST AI 2Adobe Research 3Adobe
CVPR 2025
Visual Persona is a foundation model for 🏄 Full-Body Human Customization. Given a reference image of a person, our model generates diverse, customized images while faithfully preserving the full-body appearance — including face, clothing, body shape, and accessories.
- Full-body fidelity: Preserves identity across face, torso, legs, and shoes simultaneously
- Versatile applications: A single model supports multiple downstream tasks via plug-in adapters
- Flexible control: Supports both pose-guided and text-guided generation
| Task | Description |
|---|---|
| Pose-Guided Human Customization | Generate the reference person in arbitrary poses |
| Story Generation | Create consistent multi-scene narratives with the same identity |
| Text-Guided Virtual Try-On | Change clothing while preserving the person's appearance |
| Anime Character Customization | Transfer identity to stylized, non-photorealistic characters |
We recommend using a conda environment with Python 3.10.
conda create -n visual_persona python=3.10 -y
conda activate visual_persona
pip install -r requirements.txtStep 1. Download the DINOv2 ViT-G/14 backbone:
mkdir pretrained_models
cd pretrained_models
wget https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pthStep 2. Download our Visual Persona checkpoints from Google Drive into ./pretrained_models/.
After both steps, the directory should look like:
pretrained_models/
├── dinov2_vitg14_pretrain.pth
├── diffusion_pytorch_model.safetensors
└── wieght.bin
Run the script corresponding to your desired application:
# Base pose-guided generation
python inference.py
# Pose-guided generation with ControlNet
python inference_controlnet.py
# Multi-scene story generation
python inference_controlnet_story.py
# Text-guided virtual try-on
python inference_tryon.py
# Anime / character customization
python inference_anime.pyTip: Each script contains configurable arguments at the top of the file (input image path, prompt, output directory, etc.).
We use SCHP (Self-Correction Human Parsing) to parse input images into five body regions: full-body, face, torso, legs, and shoes. Any other state-of-the-art human parsing method can be substituted.
Download the SCHP ATR checkpoint into ./pretrained_models/:
Then run:
# Parse a single image
python parsing.py --input_path /path/to/image.jpg
# Parse all images in a directory
python parsing.py --input_path /path/to/images/ --output_dir ./parsingIf you find this work useful, please consider citing:
@inproceedings{nam2025visual,
title = {Visual Persona: Foundation Model for Full-Body Human Customization},
author = {Nam, Jisu and Son, Soowon and Xu, Zhan and Shi, Jing and Liu, Difan and Liu, Feng and Misra, Aashish and Kim, Seungryong and Zhou, Yang},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {18630--18641},
year = {2025}
}This project builds on IP-Adapter. We thank the authors for their excellent work.
