Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Open-Sora: Democratizing Efficient Video Production for All

Here we provide an efficient MindSpore implementation of OpenSora, an open-source project that aims to foster innovation, creativity, and inclusivity within the field of content creation.

This repository is built on the models and code released by HPC-AI Tech. We are grateful for their exceptional work and generous contribution to open source.

Open-Sora is still at an early stage and under active development.

📰 News & States

Official News from HPC-AI Tech MindSpore Support
[2025.03.12] 🔥 We released Open-Sora 2.0 (11B). 🎬 11B model achieves on-par performance with 11B HunyuanVideo & 30B Step-Video on 📐VBench & 📊Human Preference. 🛠️ Fully open-source: checkpoints and training codes for training with only $200K. [report] Inference
[2024.06.17] 🔥 HPC-AI released Open-Sora 1.2, which includes 3D-VAE, rectified flow, and score condition. The video quality is greatly improved. [checkpoints] [report] [blog] Text-to-Video
[2024.04.25] 🤗 HPC-AI Tech released the Gradio demo for Open-Sora on Hugging Face Spaces. N.A.
[2024.04.25] 🔥 HPC-AI Tech released Open-Sora 1.1, which supports 2s~15s, 144p to 720p, any aspect ratio text-to-image, text-to-video, image-to-video, video-to-video, infinite time generation. In addition, a full video processing pipeline is released. [checkpoints] [report] Image/Video-to-Video; Infinite time generation; Variable resolutions, aspect ratios, durations
[2024.03.18] HPC-AI Tech released Open-Sora 1.0, a fully open-source project for video generation. ✅ VAE + STDiT training and inference
[2024.03.04] HPC-AI Tech Open-Sora provides training with 46% cost reduction [blog] ✅ Parallel training on Ascend devices

📦 Requirements

mindspore ascend driver cann
>=2.5.0 >=24.0.0 >=8.0.0.beta1

🎥 Demo

The following videos are generated based on MindSpore and Ascend Atlas 800T A2 machines.

OpenSora 2.0 Demo

3s 576×1024 5s 576×1024
00001.mp4
00005.mp4
CaptionA playful dog in a pink coat with a red leash dashes across a muddy field with sparse crops. The camera tracks its energetic movement from right to left against a backdrop of trees and distant power lines under an overcast sky. The realistic, medium shot captures a candid, lively moment in soft, diffused light.
CaptionA coastal landscape painting with a prominent archway is displayed on an easel in a bright studio. A camera pan reveals a table cluttered with art supplies and a potted plant, enhancing the artistic vibe. Large windows and soft natural lighting create a cozy, creative atmosphere.
00000.mp4
00004.mp4
CaptionTwo women sit on a beige couch in a cozy, warmly lit room with a brick wall backdrop. They engage in a cheerful conversation, smiling and toasting red wine in an intimate medium shot.
CaptionA drone camera circles a historic church on a rocky outcrop along the Amalfi Coast, highlighting its stunning architecture, tiered patios, and the dramatic coastal views with waves crashing below and people enjoying the scene in the warm afternoon light.

Tip

To generate better-looking videos, you can try generating in two stages: Text-to-Image and then Image-to-Video.

OpenSora 1.2 Demo

Demo
4s 720×1280 4s 720×1280 4s 720×1280
000-A-Japanese-tram-glides-through-the-snowy-streets-of-a.mp4
006-a-close-up-shot-of-a-woman-standing-in-a-dimly.mp4
015-a-cozy-living-room-scene-with-a-christmas-tree-in.mp4

[!TIP] To generate better-looking videos, you can try generating in two stages: Text-to-Image and then Image-to-Video.

OpenSora 1.1 Demo

Demo

Text-to-Video

16x256x720 16x640x360
Snow falling over multiple houses and trees on winter landscape against night sky. christmas festivity and celebration concept Snow falling over multiple houses and trees on winter landscape against night sky. christmas festivity and celebration concept
Time-Lapse Milky Way above the Mountain Time-Lapse Milky Way above the Mountain
Time Lapse of the rising sun over a tree in an open rural landscape, with clouds in the blue sky beautifully playing with the rays of light A large orange octopus is seen resting on the bottom of the ocean floor, blending in with the sandy and rocky terrain. Its tentacles are spread out around its body, and its eyes are closed. The octopus is unaware of a king crab that is crawling towards it from behind a rock, its claws raised and ready to attack. The crab is brown and spiny, with long legs and antennae. The scene is captured from a wide angle, showing the vastness and depth of the ocean. The water is clear and blue, with rays of sunlight filtering through. The shot is sharp and crisp, with a high dynamic range. The octopus and the crab are in focus, while the background is slightly blurred, creating a depth of field effect.
This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird’s head is tilted slightly to the side, giving the impression of it looking regal and majestic. The background is blurred, drawing attention to the bird’s striking appearance.

Image-to-Video

Input Output


a brown bear in the water with a fish in its mouth


a group of statues on the side of a building, camera pans right

Frame Interpolation

Start Frame End Frame Caption Output
A breathtaking sunrise scene.

Video Editing

Input Output


a snowy forest

Text-to-Image

Caption Output
Bright scene, aerial view,ancient city, fantasy, gorgeous light, mirror reflection, high detail, wide angle lens.
A small cactus with a happy face in the Sahara desert.

OpenSora 1.0 Demo

Demo
2s 512×512 2s 512×512 2s 512×512
009-A-serene-night-scene-in-a-forested-area -The-first 000-A-soaring-drone-footage-captures-the-majestic-beauty-of-a 001-A-majestic-beauty-of-a-waterfall-cascading-down-a-cliff
A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff. The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall.
006-A-bustling-city-street-at-night,-filled-with-the-glow 002-A-vibrant-scene-of-a-snowy-mountain-landscape -The-sky 004-A-serene-underwater-scene-featuring-a-sea-turtle-swimming-through
A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...] The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...] A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...]

Videos are downsampled to .gif for display. Click for original videos. Prompts are trimmed for display, see here for full prompts.

🔆 Features

  • 📍 Open-Sora 2.0 released. Model weights are available here. See report 1.2 for more details.

    • ✅ New backbone that is based on Flux.
    • ✅ Uses reduced patch size of 1 for better training stability and finer details in video generation.
    • ✅ Employs full attention and 3D RoPE.
    • ✅ Uses Deep Compression Autoencoder (DC-AE) for increased spatial compression of 32x with an increased number of 128 latent channels.
    • ✅ Employs two text encoders: T5, which captures complex textual semantics, and CLIP-Large, which improves alignment between text and visual concepts.
  • 📍 Open-Sora 1.2 released. Model weights are available here. See report 1.2 for more details.

    • ✅ Support rectified flow scheduling.
    • ✅ Support more conditioning including fps, aesthetic score, motion strength and camera motion.
    • ✅ Trained our 3D-VAE for temporal dimension compression.
  • 📍 Open-Sora 1.1 with the following features

    • ✅ Improved ST-DiT architecture includes Rotary Position Embedding (RoPE), QK Normalization, longer text length, etc.
    • ✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc.
    • ✅ Support training with any resolution, aspect ratio, and duration.
  • 📍 Open-Sora 1.0 with the following features

    • ✅ Text-to-video generation in 256x256 or 512x512 resolution and up to 64 frames.
    • ✅ Three-stage training: i) 16x256x256 video pretraining, ii) 16x512x512 video fine-tuning, and iii) 64x512x512 videos
    • ✅ Optimized training recipes for MindSpore+Ascend framework (see configs/opensora/train/xxx_ms.yaml)
    • ✅ Acceleration methods: flash attention, recompute (gradient checkpointing), data sink, mixed precision, and graph compilation.
    • ✅ Data parallelism + Optimizer parallelism, allow training on 300x512x512 videos
View more
  • ✅ Following the findings in OpenSora, we also adopt the VAE from Stable Diffusion for video latent encoding.
  • ✅ We pick the STDiT model as our video diffusion transformer following the best practice in OpenSora.
  • ✅ Support T5 text conditioning.
View more
  • Evaluation pipeline.
  • Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, etc.).

Contents

Other useful documents and links are listed below.

Installation

  1. Please install MindSpore 2.7.0 according to the MindSpore official website and install CANN 8.1.RC1 as recommended by the official installation website.

  2. Install requirements

cd examples/opensora_hpcai
pip install -r requirements.txt

In case decord package is not available, try pip install eva-decord. For EulerOS, instructions on ffmpeg and decord installation are as follows.

Details
1. install ffmpeg 4, referring to https://ffmpeg.org/releases
    wget https://ffmpeg.org/releases/ffmpeg-4.0.1.tar.bz2 --no-check-certificate
    tar -xvf ffmpeg-4.0.1.tar.bz2
    mv ffmpeg-4.0.1 ffmpeg
    cd ffmpeg
    ./configure --enable-shared         # --enable-shared is needed for sharing libavcodec with decord
    make -j 64
    make install
2. install decord, referring to https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source
    git clone --recursive https://github.com/dmlc/decord
    cd decord
    rm build && mkdir build && cd build
    cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release
    make -j 64
    make install
    cd ../python
    python3 setup.py install --user

Model Weights

Open-Sora 2.0 Model Weights

Model Model Size URL
MMDiT 11B Download
HunyuanVideo VAE 246M Download
MMDiT with DC-AE 11B Download
DC-AE 459M Download
FLUX.1 [dev] 11B Download
FLUX.1 [dev] AE 83M Download
T5-XXL 11B Download
CLIP-Large 428M Download

You can download the above models automatically using the following command:

python tools/download_convert_st.py "hpcai-tech/Open-Sora-v2"

If you encounter a certificate verify failed error, you can set --disable_ssl_verify to True.

Open-Sora 1.2 Model Weights

Instructions
Model Model size Data URL
STDiT3 (Diffusion) 1.1B 30M Download
VAE 384M 3M Download

The weights above are automatically downloaded from Hugging Face during execution. Local .safetensors weights can also be used. Alternatively, you can use the following commands to convert the model weights to the MindSpore format.

Convert to the MindSpore format
  • Convert STDiT3 to MS checkpoint:
python tools/convert_pt2ms.py --src /path/to/OpenSora-STDiT-v3/model.safetensors --target models/opensora_stdit_v3.ckpt
  • Convert VAE to MS checkpoint:
python tools/convert_vae_3d.py --src /path/to/OpenSora-VAE-v1.2/model.safetensors --target models/OpenSora-VAE-v1.2/model.ckpt
  • The T5 model is identical to OpenSora 1.0 and can be downloaded and converted using the links below.

Open-Sora 1.1 Model Weights

Instructions
  • STDit:
Stage Resolution Model Size Data #iterations URL
2 mainly 144p & 240p 700M 10M videos + 2M images 100k Download
3 144p to 720p 700M 500K HQ videos + 1M images 4k Download

The weights above are automatically downloaded from Hugging Face during execution. Local .safetensors weights can also be used. Alternatively, you can use the following command to convert the model weights to the MindSpore format.

python tools/convert_pt2ms.py --src /path/to/OpenSora-STDiT-v2-stage3/model.safetensors --target models/opensora_v1.1_stage3.ckpt
  • T5 and VAE models are identical to OpenSora 1.0 and can be downloaded and converted using the links below.

Open-Sora 1.0 Model Weights

Instructions

Please prepare the model checkpoints of T5, VAE, and STDiT and put them under models/ folder as follows.

  • T5: You can download and convert the T5 model automatically by running the following command:

    python tools/download_convert_st.py "DeepFloyd/t5-v1_1-xxl"

    If you encounter a certificate verify failed error, you can set --disable_ssl_verify to True.

  • VAE: The model weights are automatically downloaded from Hugging Face during execution.
    Local .safetensors weights can also be used.
    Alternatively, you can use the following command to convert the model weights to the MindSpore format. First, download the .safetensor checkpoint from here. Then execute the following command:

    python tools/convert_vae.py --src /path/to/sd-vae-ft-ema/diffusion_pytorch_model.safetensors --target models/sd-vae-ft-ema.ckpt
  • STDiT: Download OpenSora-v1-16x256x256.pth / OpenSora-v1-HQ-16x256x256.pth / OpenSora-v1-HQ-16x512x512.pth from here

    Convert to ms checkpoint:

    python tools/convert_pt2ms.py --src /path/to/OpenSora-v1-16x256x256.pth --target models/OpenSora-v1-16x256x256.ckpt
    

    Training orders: 16x256x256 $\rightarrow$ 16x256x256 HQ $\rightarrow$ 16x512x512 HQ.

    These model weights are partially initialized from PixArt-α. The number of parameters is 724M. More information about training can be found in HPC-AI Tech's report. More about the dataset can be found in datasets.md from HPC-AI Tech. HQ means high quality.

  • PixArt-α: Download the pth checkpoint from here (for training only)

    Convert to ms checkpoint:

    python tools/convert_pt2ms.py --src /path/to/PixArt-XL-2-512x512.pth --target models/PixArt-XL-2-512x512.ckpt
    

Inference

Open-Sora 2.0 Command Line Inference

Text-to-Video Generation

First, you will need to generate text embeddings with:

# CLIP-Large
TRANSFORMERS_OFFLINE=1 python scripts/v2.0/text_embedding.py \
--model.from_pretrained="DeepFloyd/t5-v1_1-xxl" \
--model.max_length=512 \
--prompts_file=YOUR_PROMPTS.txt \
--output_path=assets/texts/t5_512
# T5
TRANSFORMERS_OFFLINE=1 python scripts/v2.0/text_embedding.py \
--model.from_pretrained="openai/clip-vit-large-patch14" \
--model.max_length=77 \
--prompts_file=YOUR_PROMPTS.txt \
--output_path=assets/texts/clip_77

Repeat the same for negative prompts.

Then, you can generate videos by running the following command:

python scripts/v2.0/inference_v2.py --config=configs/opensora-v2-0/inference/256px.yaml \
text_emb.t5_dir=assets/texts/t5_512 \
text_emb.neg_t5_dir=assets/texts/t5_512_neg \
text_emb.clip_dir=assets/texts/clip_77 \
text_emb.neg_clip_dir=assets/texts/clip_77_neg

Inference Performance

We evaluate the inference performance of text-to-video generation by measuring the average sampling time per step and the total sampling time of a video.

Experiments are conducted on Ascend Atlas 800T A2 machines with MindSpore >=2.6.0 in PyNative mode.

Model Name Stage Cards Batch Size Resolution Precision Step s/image s/video Recipe
FLUX.1 [dev] T2I 1 1 576 x 1024 bf16 50 14.7s - yaml
OpenSora 2.0 T/I2V 1 1 129 x 192 x 336 bf16 50 - 156s yaml
OpenSora 2.0 T/I2V 1 1 77 x 576 x 1024 bf16 50 - 1453s yaml
OpenSora 2.0 T/I2V 1 1 129 x 576 x 1024 bf16 50 - 4973s yaml

Open-Sora 1.2 and 1.1 Command Line Inference

Instructions

Image/Video-to-Video Generation (supports text guidance)

# OSv1.2
python scripts/inference.py --config configs/opensora-v1-2/inference/sample_iv2v.yaml --ckpt_path /path/to/your/opensora-v1-2.ckpt
# OSv1.1
python scripts/inference.py --config configs/opensora-v1-1/inference/sample_iv2v.yaml --ckpt_path /path/to/your/opensora-v1-1.ckpt

For parallel inference, please use mpirun or msrun, and append --use_parallel=True to the inference script referring to scripts/run/run_infer_os_v1.1_t2v_parallel.sh

In the sample_iv2v.yaml, provide such information as loop, condition_frame_length, captions, mask_strategy, and reference_path. See here for more details.

For inference with sequence parallelism using multiple NPUs in Open-Sora 1.2, please use msrun and append --use_parallel True and --enable_sequence_parallelism True to the inference script, referring to scripts/run/run_infer_sequence_parallel.sh. To further accelerate the inference speed, you can use DSP by appending --dsp True, referring to scripts/run/run_infer_sequence_parallel_dsp.sh.

Text-to-Video Generation

To generate a video from text, you can use sample_t2v.yaml or set --reference_path to an empty string '' when using sample_iv2v.yaml.

python scripts/inference.py --config configs/opensora-v1-1/inference/sample_t2v.yaml --ckpt_path /path/to/your/opensora-v1-1.ckpt

Inference Performance

We evaluate the inference performance of text-to-video generation by measuring the average sampling time per step and the total sampling time of a video.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.7.0 graph mode.

model name cards batch size resolution jit level precision scheduler step graph compile s/step s/video recipe
STDiT2-XL/2 1 1 16x640x360 O0 bf16 DDPM 100 1~2 mins 1.56 156.0 yaml
STDiT3-XL/2 1 1 51x720x1280 O0 bf16 RFlow 30 1~2 mins 4.83 155.4 yaml
STDiT3-XL/2 1 1 102x720x1280 O0 bf16 RFlow 30 1~2 mins 8.81 286.9 yaml

Open-Sora 1.0 Command Line Inference

Instructions

You can run text-to-video inference via the script scripts/inference.py as follows.

# Sample 16x256x256 videos
python scripts/inference.py --config configs/opensora/inference/stdit_256x256x16.yaml --ckpt_path models/OpenSora-v1-HQ-16x256x256.ckpt --prompt_path /path/to/prompt.txt

# Sample 16x512x512 videos
python scripts/inference.py --config configs/opensora/inference/stdit_512x512x16.yaml --ckpt_path models/OpenSora-v1-HQ-16x512x512.ckpt --prompt_path /path/to/prompt.txt

# Sample 64x512x512 videos
python scripts/inference.py --config configs/opensora/inference/stdit_512x512x64.yaml --ckpt_path /path/to/your/opensora-v1.ckpt --prompt_path /path/to/prompt.txt

For parallel inference, please use mpirun or msrun, and append --use_parallel=True to the inference script referring to scripts/run/run_infer_t2v_parallel.sh

We also provide a three-stage sampling script run_sole_3stages.sh to reduce memory limitation, which decomposes the whole pipeline into text embedding, text-to-video latent sampling, and vae decoding.

For more usage on the inference script, please run python scripts/inference.py -h

Inference Performance

We evaluate the inference performance of text-to-video generation by measuring the average sampling time per step and the total sampling time of a video.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.3.1 graph mode.

model name cards batch size resolution jit level precision scheduler step graph compile s/step s/video recipe
STDiT-XL/2 1 4 16x256x256 O0 fp32 DDPM 100 2~3 mins 0.39 39.22 yaml
STDiT-XL/2 1 1 16x512x512 O0 fp32 DDPM 100 2~3 mins 1.85 185.00 yaml
STDiT-XL/2 1 1 64x512x512 O0 bf16 DDPM 100 2~3 mins 2.78 278.45 yaml

⚠️ Note: When running parallel inference scripts under scripts/run/ on ModelArts, please unset RANK_TABLE_FILE before the inference starts.

Data Processing

Currently, we are developing the complete pipeline for data processing from raw videos to high-quality text-video pairs. We provide the data processing tools as follows.

View more

The text-video pair data should be organized as follows, for example.

.
├── video_caption.csv
├── video_folder
│   ├── part01
│   │   ├── vid001.mp4
│   │   ├── vid002.mp4
│   │   └── ...
│   └── part02
│       ├── vid001.mp4
│       ├── vid002.mp4
│       └── ...

The video_folder contains all the video files. The csv file video_caption.csv records the relative video path and its text caption in each line, as follows.

video,caption
video_folder/part01/vid001.mp4,a cartoon character is walking through
video_folder/part01/vid002.mp4,a red and white ball with an angry look on its face

Cache Text Embeddings

For acceleration, we pre-compute the t5 embedding before training stdit.

python scripts/infer_t5.py \
    --csv_path /path/to/video_caption.csv \
    --output_path /path/to/text_embed_folder \
    --model_max_length 300     # 300 for OpenSora v1.2, 200 for OpenSora v1.1, 120 for OpenSora 1.0

OpenSora v1 uses text embedding sequence length of 120 (by default). If you want to generate text embeddings for OpenSora v1.1, please change model_max_length to 200.

After running, the text embeddings saved as npz file for each caption will be in output_path. Please change csv_path to your video-caption annotation file accordingly.

Cache Video Embedding (Optional)

If the storage budget is sufficient, you may also cache the video embedding by

python scripts/infer_vae.py \
    --csv_path /path/to/video_caption.csv  \
    --video_folder /path/to/video_folder  \
    --output_path /path/to/video_embed_folder  \
    --vae_checkpoint models/sd-vae-ft-ema.ckpt \
    --image_size 512 \

for parallel running, please refer to scripts/run/run_infer_vae_parallel.sh

For more usage, please check python scripts/infer_vae.py -h

After running, the vae latents saved as npz file for each video will be in output_path.

Finally, the training data should be like follows.

.
├── video_caption.csv
├── video_folder
│   ├── part01
│   │   ├── vid001.mp4
│   │   ├── vid002.mp4
│   │   └── ...
│   └── part02
│       ├── vid001.mp4
│       ├── vid002.mp4
│       └── ...
├── text_embed_folder
│   ├── part01
│   │   ├── vid001.npz
│   │   ├── vid002.npz
│   │   └── ...
│   └── part02
│       ├── vid001.npz
│       ├── vid002.npz
│       └── ...
├── video_embed_folder  # optional
│   ├── part01
│   │   ├── vid001.npz
│   │   ├── vid002.npz
│   │   └── ...
│   └── part02
│       ├── vid001.npz
│       ├── vid002.npz
│       └── ...

Each npz file contains data for the following keys:

  • latent_mean mean of vae latent distribution
  • latent_std: std of vae latent distribution
  • fps: video fps
  • ori_size: original size (h, w) of the video

After caching VAE, you can use them for STDiT training by parsing --vae_latent_folder=/path/to/video_embed_folder to the training script python train.py.

Cache VAE for multi-resolutions (for OpenSora 1.1)

If there are multiple folders named in latent_{h}x{w} format under the --vae_latent_folder folder (which is parsed to train.py), one of resolutions will selected randomly during training. For example:

video_embed_folder
   ├── latent_576x1024
   │   ├── vid001.npz
   │   ├── vid002.npz
   │   └── ...
   └── latent_1024x576
       ├── vid001.npz
       ├── vid002.npz
       └── ...

Training

Open-Sora 1.2

Once you prepare the data in a csv file, you may run the following commands to launch training on a single card.

# standalone training for stage 2
export MS_DEV_ENABLE_KERNEL_PACKET=on

python scripts/train.py --config configs/opensora-v1-2 /train/train_stage2.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \

text_embed_folder is required and used to speed up the training. You can find the instructions on how to generate T5 embeddings here.

For parallel training, use msrun and along with --use_parallel=True:

# distributed training for stage 2
export MS_DEV_ENABLE_KERNEL_PACKET=on

msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
    python scripts/train.py --config configs/opensora-v1-2/train/train_stage2.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --use_parallel True

You can modify the training configuration, including hyper-parameters and data settings, in the yaml file specified by the --config argument.

Multi-Resolution Training

OpenSora v1.2 supports training with multiple resolutions, aspect ratios, and frames based on the bucket method.

To enable dynamic training for STDiT3, please set the bucket_config to fit your datasets and tasks at first. An example (from configs/opensora-v1-2/train/train_stage2.yaml) is

bucket_config:
  # Structure: "resolution": { num_frames: [ keep_prob, batch_size ] }
  "144p": { 1: [ 1.0, 475 ], 51: [ 1.0, 51 ], 102: [ [ 1.0, 0.33 ], 27 ], 204: [ [ 1.0, 0.1 ], 13 ], 408: [ [ 1.0, 0.1 ], 6 ] }
  "256": { 1: [ 0.4, 297 ], 51: [ 0.5, 20 ], 102: [ [ 0.5, 0.33 ], 10 ], 204: [ [ 0.5, 1.0 ], 5 ], 408: [ [ 0.5, 1.0 ], 2 ] }
  "240p": { 1: [ 0.3, 297 ], 51: [ 0.4, 20 ], 102: [ [ 0.4, 0.33 ], 10 ], 204: [ [ 0.4, 1.0 ], 5 ], 408: [ [ 0.4, 1.0 ], 2 ] }
  "360p": { 1: [ 0.5, 141 ], 51: [ 0.15, 8 ], 102: [ [ 0.3, 0.5 ], 4 ], 204: [ [ 0.3, 1.0 ], 2 ], 408: [ [ 0.5, 0.5 ], 1 ] }
  "512": { 1: [ 0.4, 141 ], 51: [ 0.15, 8 ], 102: [ [ 0.2, 0.4 ], 4 ], 204: [ [ 0.2, 1.0 ], 2 ], 408: [ [ 0.4, 0.5 ], 1 ] }
  "480p": { 1: [ 0.5, 89 ], 51: [ 0.2, 5 ], 102: [ 0.2, 2 ], 204: [ 0.1, 1 ] }
  "720p": { 1: [ 0.1, 36 ], 51: [ 0.03, 1 ] }
  "1024": { 1: [ 0.1, 36 ], 51: [ 0.02, 1 ] }
  "1080p": { 1: [ 0.01, 5 ] }
  "2048": { 1: [ 0.01, 5 ] }

Knowing that the optimal bucket config can varies from device to device, we have tuned and provided bucket config that are more balanced on Ascend + MindSpore in configs/opensora-v1-2/train/{stage}_ms.yaml. You may use them for better training performance.

More details on the bucket configuration can be found in Multi-resolution Training with Buckets.

The instruction for launching the dynamic training task is smilar to the previous section. An example running script is scripts/run/run_train_os1.2_stage2.sh.

Open-Sora 1.1

Instructions

Once you prepare the data in a csv file, you may run the following commands to launch training on a single card.

# standalone training for stage 1
python scripts/train.py --config configs/opensora-v1-1/train/train_stage1.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --vae_latent_folder /path/to/video_embed_folder

text_embed_folder and vae_latent_folder are optional and used to speed up the training. You can find more in T5 text embeddings and VAE Video Embeddings

For parallel training, use msrun and along with --use_parallel=True:

# distributed training for stage 1
msrun --master_port=8200 --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
    python scripts/train.py --config configs/opensora-v1-1/train/train_stage1.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --vae_latent_folder /path/to/video_embed_folder \
    --use_parallel True

Multi-Resolution Training

OpenSora v1.1 supports training with multiple resolutions, aspect ratios, and a variable number of frames. This can be enabled in one of two ways:

  1. Provide variable sized VAE embeddings with the --vae_latent_folder option.
  2. Use bucket_config for training with videos in their original format. More on the bucket configuration can be found in Multi-resolution Training with Buckets.

Detailed running command can be referred in scripts/run/run_train_os_v1.1_stage2.sh

Open-Sora 1.0 Training

Instructions

Once the training data including the T5 text embeddings is prepared, you can run the following commands to launch training.

# standalone training, 16x256x256
python scripts/train.py --config configs/opensora/train/stdit_256x256x16_ms.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \

To use the cached video embedding, please replace --video_folder with --video_embed_folder and pass the path to the video embedding folder.

For parallel training, please use msrun and pass --use_parallel=True

# 8 NPUs, 64x512x512
msrun --master_port=8200 --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
    python scripts/train.py --config configs/opensora/train/stdit_512x512x64_ms.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --use_parallel True \

To train in bfloat16 precision, please parse --global_bf16=True

For more usage, please check python scripts/train.py -h. You may also see the example shell scripts in scripts/run for quick reference.

Evaluation

Open-Sora 1.2

Open-Sora 1.2 based on MindSpore and Ascend Atlas 800T A2 machines supports 0s~16s, 144p to 720p, various aspect ratios video generation. The supported configurations are listed below.

image 2s 4s 8s 16s
240p
360p
480p 🆗
720p 🆗 🆗

Here ✅ means that the data is seen during training, and 🆗 means although not trained, the model can inference at that config. Inference for 🆗 requires sequence parallelism.

Training Performance

We evaluate the training performance of Open-Sora v1.2 on the MixKit dataset with high-resolution videos (1080P, duration 12s to 100s).

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.7.0 graph mode.

model name cards batch size resolution precision jit level graph compile s/step recipe
STDiT3-XL/2 8 1 51x720x1280 bf16 O1 100 s 11.24 yaml
STDiT3-XL/2 8 dynamic stage 1 bf16 O1 14 mins 13.17 yaml
STDiT3-XL/2 8 dynamic stage 2 bf16 O1 14 mins 26.04 yaml
STDiT3-XL/2 8 dynamic stage 3 bf16 O1 14 mins 27.83 yaml

Note that the step time of dynamic training can be influenced by the resolution and duration distribution of the source videos.

To reproduce the above performance, you may refer to scripts/run/run_train_os1.2_720x1280x51.sh and scripts/run/run_train_os1.2_stage2.sh.

Below are some generation results after fine-tuning STDiT3 with Stage 2 bucket config on a mixkit subset, which contains 100 text-video pairs. The training set contains 80 1080P videos consisting of natural scenes, flowers, and pets. Here we show the text-to-video generation results on the test set.

480x854x204 480x854x204
019-The-video-begins-with-a-completely-black-screen.-which-quickly.mp4
009-The-video-features-a-person-in-a-white-lace-wedding.mp4
480x854x204 480x854x204
005-The-video-showcases-a-small-dog-with-a-light-brown.mp4
001-The-video-showcases-a-black-and-white-dog-engaging-in.mp4

Open-Sora 1.1

View more

Training Performance

We evaluate the training performance of Open-Sora v1.1 on a subset of the MixKit dataset.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.3.1 graph mode.

model name cards batch size resolution vae cache precision sink jit level graph compile s/step recipe
STDiT3-XL/2 8 1 16x512x512 OFF bf16 OFF O1 13 mins 2.28 yaml
STDiT3-XL/2 8 1 64x512x512 OFF bf16 OFF O1 13 mins 8.57 yaml
STDiT3-XL/2 8 1 24x576x1024 OFF bf16 OFF O1 13 mins 8.55 yaml
STDiT3-XL/2 8 1 64x576x1024 ON bf16 OFF O1 13 mins 18.94 yaml

vae cache: whether vae embedding is pre-computed and cached before training.

Note that T5 text embedding is pre-computed before training.

Here are some generation results after fine-tuning STDiT2 on a mixkit subset.

576x1024x48 576x1024x48
000-a-breathtaking-aerial-view-of-a-vast-landscape.-The-foreground.mp4
001-a-close-up-view-of-a-tree-branch-adorned-with-vibrant.mp4
576x1024x48 576x1024x48
005-a-serene-landscape.-bathed-in-the-soft-glow-of-daylight.mp4
003-a-vibrant-scene-dominated-by-a-cluster-of-pink-bougainvillea.mp4

Open-Sora 1.0

View more

Training Performance

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.3.1 graph mode.

model name cards batch size resolution stage precision sink jit level graph compile s/step recipe
STDiT-XL/2 8 3 16x256x256 1 fp16 ON O1 5~6 mins 1.53 yaml
STDiT-XL/2 8 1 16x512x512 2 fp16 ON O1 5~6 mins 2.47 yaml
STDiT-XL/2 8 1 64x512x512 3 bf16 ON O1 5~6 mins 8.52 yaml

Loss Curves

Training loss curves

16x256x256 Pretraining Loss Curve: train_loss_256x256x16

16x256x256 HQ Training Loss Curve: train_loss_512x512x16

16x512x512 HQ Training Loss Curve: train_loss_512x512x64

Text-to-Video Generation after Fine-tuning

Here are some generation results after fine-tuning STDiT on a subset of WebVid dataset.

512x512x64 512x512x64 512x512x64
001-Cloudy-moscow-kremlin-time-lapse.mp4
003-The-girl-received-flowers-as-a-gift.-a-gift-for.mp4
004-A-baker-turns-freshly-baked-loaves-of-sourdough-bread.mp4

Quality Evaluation

For quality evaluation, please refer to the original HPC-AI Tech evaluation doc for video generation quality evaluation.

VAE Training & Evaluation

A 3D-VAE pipeline consisting of a spatial VAE followed by a temporal VAE is trained in OpenSora v1.1. For more details, refer to VAE Documentation.

Prepare Pretrained Weights

  • Download pretained VAE-2D checkpoint from PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers if you aim to train VAE-3D from spatial VAE initialization.

    Convert to ms checkpoint:

    python tools/convert_vae1.2.py --src /path/to/pixart_sigma_sdxlvae_T5_diffusers/vae/diffusion_pytorch_model.safetensors --target models/sdxl_vae.ckpt --from_vae2d
    
  • Downalod pretrained VAE-3D checkpoint from hpcai-tech/OpenSora-VAE-v1.2 if you aim to train VAEA-3D from the VAE-3D model pre-trained with 3 stages.

    Convert to ms checkpoint:

    python tools/convert_vae1.2.py --src /path/OpenSora-VAE-v1.2/models.safetensors --target models/OpenSora-VAE-v1.2/sdxl_vae.ckpt
    
  • Download lpips mindspore checkpoint from here and put it under 'models/'

Data Preprocess

Before VAE-3D training, we need to prepare a csv annotation file for the training videos. The csv file list the path to each video related to the root video_folder. An example is

video
dance/vid001.mp4
dance/vid002.mp4
...

Taking UCF-101 for example, please download the UCF-101 dataset and extract it to datasets/UCF-101 folder. You can generate the csv annotation by running python tools/annotate_vae_ucf101.py. It will result in two csv files, datasets/ucf101_train.csv and datasets/ucf101_test.csv, for training and testing respectively.

Training

# stage 1 training, 8 NPUs
msrun --worker_num=8 --local_work_num=8 \
python scripts/train_vae.py --config configs/vae/train/stage1.yaml --use_parallel=True --csv_path datasets/ucf101_train.csv --video_folder datasets/UCF-101

# stage 2 training, 8 NPUs
msrun --worker_num=8 --local_work_num=8 \
python scripts/train_vae.py --config configs/vae/train/stage2.yaml --use_parallel=True --csv_path datasets/ucf101_train.csv --video_folder datasets/UCF-101

# stage 3 training, 8 NPUs
msrun --worker_num=8 --local_work_num=8 \
python scripts/train_vae.py --config configs/vae/train/stage3.yaml --use_parallel=True --csv_path datasets/ucf101_train.csv --video_folder datasets/UCF-101

You can change the csv_path and video_folder to train on your own data.

Performance Evaluation

To evaluate the VAE performance, you need to run VAE inference first to generate the videos, then calculate scores on the generated videos:

# video generation and evaluation
python scripts/inference_vae.py --ckpt_path /path/to/you_vae_ckpt --image_size 256 --num_frames=17 --csv_path datasets/ucf101_test.csv --video_folder datasets/UCF-101

You can change the csv_path and video_folder to evaluate on your own data.

Here, we report the training performance and evaluation results on the UCF-101 dataset.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.3.1 graph mode.

model name cards batch size resolution precision jit level graph compile s/step PSNR SSIM recipe
VAE-3D 8 1 17x256x256 bf16 O1 5 mins 1.09 29.02 0.87 yaml

Note that we train with mixed video ang image strategy i.e. --mixed_strategy=mixed_video_image for stage 3 instead of random number of frames (mixed_video_random). Random frame training will be supported in the future.

Long sequence training and inference (sequence parallel)

Training

We support training with the OpenSora v1.2 model using SP (Sequence Parallel) and DSP (Dynamic Sequence Parallel), handling up to 408 frames (~16 seconds) on 4 NPU* cards. Additionally, we have optimized the training speed by implementing micro-batch parallelism in the VAE’s spatial and temporal domains, achieving approximately a 20% speed boost. We evaluate the training performance using the MixKit dataset, which includes high-resolution videos (1080P, duration 12s to 100s). The training performance results are reported below.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.4.0 graph mode.

model name cards batch size resolution sink precision jit level graph compile s/step recipe
STDiT3-XL/2 4 1 408x720x1280 OFF bf16 O1 12 mins 48.30 script
STDiT3-XL/2 4 1 408x720x1280 OFF bf16 O1 12 mins 47.00 script

To prevent the system from running out of memory, ensure you launch the training job on a server with sufficient memory. For 4P training, at least 400GB of memory is required.

Inference

We evaluate the inference performance of text-to-video generation by measuring the average sampling time per step and the total sampling time of a video.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.4.0 graph mode.

model name cards batch size resolution precision scheduler steps jit level graph compile s/step s/video recipe
STDiT3-XL/2 2 1 408x720x1280 bf16 RFlow 30 O0 1~2 mins 26.03 780.00 script
STDiT3-XL/2 2 1 408x720x1280 bf16 RFlow 30 O0 1~2 mins 22.03 660.00 script

Training and Inference Using the FiT-Like Pipeline

View more

⚠️WARNING: This feature is experimental. The official version is under development.

We provide support for training Open-Sora 1.1 using the FiT-Like pipeline as an alternative solution for handling multi-resolution videos, in contrast to the bucketing strategy.

FiT-Like Training

To begin, we need to prepare the VAE (Variational Autoencoder) latents from multi-resolution videos. For instance, if you intend to train at a resolution of up to 512x512 pixels, please run

python script/infer_vae.py \
    --csv_path /path/to/video_caption.csv  \
    --video_folder /path/to/video_folder  \
    --output_path /path/to/video_embed_folder  \
    --vae_checkpoint models/sd-vae-ft-ema.ckpt \
    --image_size 512 \
    --resize_by_max_value True \
    --vae-micro-batch-size 1
    --mode 1

The extracted VAE latent will be saved in the video embedding folder.

Then, to launch a distributed training with eight NPU cards, please run

msrun --worker_num=8 --local_worker_num=8  \
    scripts/train.py --config configs/opensora-v1-1/train/train_stage1_fit.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --vae_latent_folder /path/to/video_embed_folder \
    --use_parallel True \
    --max_image_size 512 \

We evaluated the training performance on MindSpore and Ascend NPUs. The results are as follows.

Model Context Precision BS NPUs Max. Size Train T. (s/step)
STDiT2-XL/2 D910*-MS2.3_master BF16 1 4 16x512x512 2.3

FiT-Like Inference

To sample a video with a resolution of 384x672 using the trained checkpoint. You can run

python scripts/inference_i2v.py --config configs/opensora-v1-1/inference/t2v_fit.yaml \
    --ckpt_path /path/to/your/opensora-v1-1.ckpt \
    --prompt_path /path/to/prompt.txt \
    --image_size 384 672 \
    --max_image_size 512 \

Make sure that the max_image_size parameter remains consistent between your training and inference commands.

Here are some generation results after fine-tuning STDiT on a small dataset:

384x672x16 672x384x16
001-a-breathtaking-view-of-a-mountainous-landscape.-From-a-high.mp4
000-a-close-up-view-of-a-branch-laden-with-white-flowers.mp4

Contribution

Thanks go to the support from the MindSpore team and the open-source contributions from the OpenSora project.

If you wish to contribute to this project, you can refer to the Contribution Guideline.

Acknowledgement

  • ColossalAI: A powerful large model parallel acceleration and optimization system.
  • DiT: Scalable Diffusion Models with Transformers.
  • OpenDiT: An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
  • PixArt: An open-source DiT-based text-to-image model.
  • Flux: A powerful text-to-image generation model.
  • Latte: An attempt to efficiently train DiT for video.
  • HunyuanVideo: Open-Source text-to-video model.
  • StabilityAI VAE: A powerful image VAE model.
  • DC-AE: Deep Compression AutoEncoder for image compression.
  • CLIP: A powerful text-image embedding model.
  • T5: A powerful text encoder.
  • LLaVA: A powerful image captioning model based on Mistral-7B and Yi-34B.
  • PLLaVA: A powerful video captioning model.
  • DSP: Dynamic Sequence Parallel introduced by NUS HPC AI Lab.
  • MiraData: A large-scale video dataset with long durations and structured caption.
@article{opensora,
  title={Open-sora: Democratizing efficient video production for all},
  author={Zheng, Zangwei and Peng, Xiangyu and Yang, Tianji and Shen, Chenhui and Li, Shenggui and Liu, Hongxin and Zhou, Yukun and Li, Tianyi and You, Yang},
  journal={arXiv preprint arXiv:2412.20404},
  year={2024}
}

@article{opensora2,
    title={Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k},
    author={Xiangyu Peng and Zangwei Zheng and Chenhui Shen and Tom Young and Xinying Guo and Binluo Wang and Hang Xu and Hongxin Liu and Mingyan Jiang and Wenjun Li and Yuhui Wang and Anbang Ye and Gang Ren and Qianran Ma and Wanying Liang and Xiang Lian and Xiwen Wu and Yuting Zhong and Zhuangyan Li and Chaoyu Gong and Guojun Lei and Leijun Cheng and Limin Zhang and Minghao Li and Ruijie Zhang and Silan Hu and Shijie Huang and Xiaokang Wang and Yuanheng Zhao and Yuqi Wang and Ziang Wei and Yang You},
    year={2025},
    journal={arXiv preprint arXiv:2503.09642},
}

We are grateful for their exceptional work and generous contribution to open source.