opensora_hpcai

Open-Sora: Democratizing Efficient Video Production for All

Here we provide an efficient MindSpore implementation of OpenSora, an open-source project that aims to foster innovation, creativity, and inclusivity within the field of content creation.

This repository is built on the models and code released by HPC-AI Tech. We are grateful for their exceptional work and generous contribution to open source.

Open-Sora is still at an early stage and under active development.

📰 News & States

Official News from HPC-AI Tech	MindSpore Support
[2025.03.12] 🔥 We released Open-Sora 2.0 (11B). 🎬 11B model achieves on-par performance with 11B HunyuanVideo & 30B Step-Video on 📐VBench & 📊Human Preference. 🛠️ Fully open-source: checkpoints and training codes for training with only $200K. [report]	Inference
[2024.06.17] 🔥 HPC-AI released Open-Sora 1.2, which includes 3D-VAE, rectified flow, and score condition. The video quality is greatly improved. [checkpoints] [report] [blog]	Text-to-Video
[2024.04.25] 🤗 HPC-AI Tech released the Gradio demo for Open-Sora on Hugging Face Spaces.	N.A.
[2024.04.25] 🔥 HPC-AI Tech released Open-Sora 1.1, which supports 2s~15s, 144p to 720p, any aspect ratio text-to-image, text-to-video, image-to-video, video-to-video, infinite time generation. In addition, a full video processing pipeline is released. [checkpoints] [report]	Image/Video-to-Video; Infinite time generation; Variable resolutions, aspect ratios, durations
[2024.03.18] HPC-AI Tech released Open-Sora 1.0, a fully open-source project for video generation.	✅ VAE + STDiT training and inference
[2024.03.04] HPC-AI Tech Open-Sora provides training with 46% cost reduction [blog]	✅ Parallel training on Ascend devices

📦 Requirements

mindspore	ascend driver	cann
>=2.5.0	>=24.0.0	>=8.0.0.beta1

🎥 Demo

The following videos are generated based on MindSpore and Ascend Atlas 800T A2 machines.

OpenSora 2.0 Demo

3s 576×1024	5s 576×1024
00001.mp4	00005.mp4
Caption A playful dog in a pink coat with a red leash dashes across a muddy field with sparse crops. The camera tracks its energetic movement from right to left against a backdrop of trees and distant power lines under an overcast sky. The realistic, medium shot captures a candid, lively moment in soft, diffused light.	Caption A coastal landscape painting with a prominent archway is displayed on an easel in a bright studio. A camera pan reveals a table cluttered with art supplies and a potted plant, enhancing the artistic vibe. Large windows and soft natural lighting create a cozy, creative atmosphere.
00000.mp4	00004.mp4
Caption Two women sit on a beige couch in a cozy, warmly lit room with a brick wall backdrop. They engage in a cheerful conversation, smiling and toasting red wine in an intimate medium shot.	Caption A drone camera circles a historic church on a rocky outcrop along the Amalfi Coast, highlighting its stunning architecture, tiered patios, and the dramatic coastal views with waves crashing below and people enjoying the scene in the warm afternoon light.

Tip

To generate better-looking videos, you can try generating in two stages: Text-to-Image and then Image-to-Video.

OpenSora 1.2 Demo

Demo

4s 720×1280	4s 720×1280	4s 720×1280
000-A-Japanese-tram-glides-through-the-snowy-streets-of-a.mp4	006-a-close-up-shot-of-a-woman-standing-in-a-dimly.mp4	015-a-cozy-living-room-scene-with-a-christmas-tree-in.mp4

[!TIP] To generate better-looking videos, you can try generating in two stages: Text-to-Image and then Image-to-Video.

OpenSora 1.1 Demo

Demo

Text-to-Video

16x256x720	16x640x360

Snow falling over multiple houses and trees on winter landscape against night sky. christmas festivity and celebration concept	Snow falling over multiple houses and trees on winter landscape against night sky. christmas festivity and celebration concept

Time-Lapse Milky Way above the Mountain	Time-Lapse Milky Way above the Mountain

Time Lapse of the rising sun over a tree in an open rural landscape, with clouds in the blue sky beautifully playing with the rays of light	A large orange octopus is seen resting on the bottom of the ocean floor, blending in with the sandy and rocky terrain. Its tentacles are spread out around its body, and its eyes are closed. The octopus is unaware of a king crab that is crawling towards it from behind a rock, its claws raised and ready to attack. The crab is brown and spiny, with long legs and antennae. The scene is captured from a wide angle, showing the vastness and depth of the ocean. The water is clear and blue, with rays of sunlight filtering through. The shot is sharp and crisp, with a high dynamic range. The octopus and the crab are in focus, while the background is slightly blurred, creating a depth of field effect.

This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird’s head is tilted slightly to the side, giving the impression of it looking regal and majestic. The background is blurred, drawing attention to the bird’s striking appearance.

Image-to-Video

Input	Output
a brown bear in the water with a fish in its mouth
a group of statues on the side of a building, camera pans right

Frame Interpolation

Start Frame	End Frame	Caption	Output
		A breathtaking sunrise scene.

Video Editing

Input	Output
a snowy forest

Text-to-Image

Caption	Output
Bright scene, aerial view,ancient city, fantasy, gorgeous light, mirror reflection, high detail, wide angle lens.
A small cactus with a happy face in the Sahara desert.

OpenSora 1.0 Demo

Demo

2s 512×512	2s 512×512	2s 512×512

A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop.	A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff.	The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall.

A bustling city street at night, filled with the glow of car headlights and the ambient light of streetlights. [...]	The vibrant beauty of a sunflower field. The sunflowers are arranged in neat rows, creating a sense of order and symmetry. [...]	A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...]

Videos are downsampled to .gif for display. Click for original videos. Prompts are trimmed for display, see here for full prompts.

🔆 Features

📍 Open-Sora 2.0 released. Model weights are available here. See report 1.2 for more details.
- ✅ New backbone that is based on Flux.
- ✅ Uses reduced patch size of 1 for better training stability and finer details in video generation.
- ✅ Employs full attention and 3D RoPE.
- ✅ Uses Deep Compression Autoencoder (DC-AE) for increased spatial compression of 32x with an increased number of 128 latent channels.
- ✅ Employs two text encoders: T5, which captures complex textual semantics, and CLIP-Large, which improves alignment between text and visual concepts.
📍 Open-Sora 1.2 released. Model weights are available here. See report 1.2 for more details.
- ✅ Support rectified flow scheduling.
- ✅ Support more conditioning including fps, aesthetic score, motion strength and camera motion.
- ✅ Trained our 3D-VAE for temporal dimension compression.
📍 Open-Sora 1.1 with the following features
- ✅ Improved ST-DiT architecture includes Rotary Position Embedding (RoPE), QK Normalization, longer text length, etc.
- ✅ Support image and video conditioning and video editing, and thus support animating images, connecting videos, etc.
- ✅ Support training with any resolution, aspect ratio, and duration.
📍 Open-Sora 1.0 with the following features
- ✅ Text-to-video generation in 256x256 or 512x512 resolution and up to 64 frames.
- ✅ Three-stage training: i) 16x256x256 video pretraining, ii) 16x512x512 video fine-tuning, and iii) 64x512x512 videos
- ✅ Optimized training recipes for MindSpore+Ascend framework (see configs/opensora/train/xxx_ms.yaml)
- ✅ Acceleration methods: flash attention, recompute (gradient checkpointing), data sink, mixed precision, and graph compilation.
- ✅ Data parallelism + Optimizer parallelism, allow training on 300x512x512 videos

✅ Following the findings in OpenSora, we also adopt the VAE from Stable Diffusion for video latent encoding.
✅ We pick the STDiT model as our video diffusion transformer following the best practice in OpenSora.
✅ Support T5 text conditioning.

Evaluation pipeline.
Complete the data processing pipeline (including dense optical flow, aesthetics scores, text-image similarity, etc.).

Installation

Please install MindSpore 2.7.0 according to the MindSpore official website and install CANN 8.1.RC1 as recommended by the official installation website.
Install requirements

cd examples/opensora_hpcai
pip install -r requirements.txt

In case decord package is not available, try pip install eva-decord. For EulerOS, instructions on ffmpeg and decord installation are as follows.

Details

1. install ffmpeg 4, referring to https://ffmpeg.org/releases
    wget https://ffmpeg.org/releases/ffmpeg-4.0.1.tar.bz2 --no-check-certificate
    tar -xvf ffmpeg-4.0.1.tar.bz2
    mv ffmpeg-4.0.1 ffmpeg
    cd ffmpeg
    ./configure --enable-shared         # --enable-shared is needed for sharing libavcodec with decord
    make -j 64
    make install
2. install decord, referring to https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source
    git clone --recursive https://github.com/dmlc/decord
    cd decord
    rm build && mkdir build && cd build
    cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release
    make -j 64
    make install
    cd ../python
    python3 setup.py install --user

Model Weights

Open-Sora 2.0 Model Weights

Model	Model Size	URL
MMDiT	11B	Download
HunyuanVideo VAE	246M	Download
MMDiT with DC-AE	11B	Download
DC-AE	459M	Download
FLUX.1 [dev]	11B	Download
FLUX.1 [dev] AE	83M	Download
T5-XXL	11B	Download
CLIP-Large	428M	Download

You can download the above models automatically using the following command:

python tools/download_convert_st.py "hpcai-tech/Open-Sora-v2"

If you encounter a certificate verify failed error, you can set --disable_ssl_verify to True.

Open-Sora 1.2 Model Weights

Instructions

Model	Model size	Data	URL
STDiT3 (Diffusion)	1.1B	30M	Download
VAE	384M	3M	Download

The weights above are automatically downloaded from Hugging Face during execution. Local .safetensors weights can also be used. Alternatively, you can use the following commands to convert the model weights to the MindSpore format.

Convert to the MindSpore format

Convert STDiT3 to MS checkpoint:

python tools/convert_pt2ms.py --src /path/to/OpenSora-STDiT-v3/model.safetensors --target models/opensora_stdit_v3.ckpt

Convert VAE to MS checkpoint:

python tools/convert_vae_3d.py --src /path/to/OpenSora-VAE-v1.2/model.safetensors --target models/OpenSora-VAE-v1.2/model.ckpt

The T5 model is identical to OpenSora 1.0 and can be downloaded and converted using the links below.

Open-Sora 1.1 Model Weights

Instructions

STDit:

Stage	Resolution	Model Size	Data	#iterations	URL
2	mainly 144p & 240p	700M	10M videos + 2M images	100k	Download
3	144p to 720p	700M	500K HQ videos + 1M images	4k	Download

The weights above are automatically downloaded from Hugging Face during execution. Local .safetensors weights can also be used. Alternatively, you can use the following command to convert the model weights to the MindSpore format.

python tools/convert_pt2ms.py --src /path/to/OpenSora-STDiT-v2-stage3/model.safetensors --target models/opensora_v1.1_stage3.ckpt

T5 and VAE models are identical to OpenSora 1.0 and can be downloaded and converted using the links below.

Open-Sora 1.0 Model Weights

Instructions

Please prepare the model checkpoints of T5, VAE, and STDiT and put them under models/ folder as follows.

T5: You can download and convert the T5 model automatically by running the following command:
```
python tools/download_convert_st.py "DeepFloyd/t5-v1_1-xxl"
```
If you encounter a certificate verify failed error, you can set --disable_ssl_verify to True.
VAE: The model weights are automatically downloaded from Hugging Face during execution.
Local .safetensors weights can also be used.
Alternatively, you can use the following command to convert the model weights to the MindSpore format. First, download the .safetensor checkpoint from here. Then execute the following command:
```
python tools/convert_vae.py --src /path/to/sd-vae-ft-ema/diffusion_pytorch_model.safetensors --target models/sd-vae-ft-ema.ckpt
```
STDiT: Download OpenSora-v1-16x256x256.pth / OpenSora-v1-HQ-16x256x256.pth / OpenSora-v1-HQ-16x512x512.pth from here

Convert to ms checkpoint:
```
python tools/convert_pt2ms.py --src /path/to/OpenSora-v1-16x256x256.pth --target models/OpenSora-v1-16x256x256.ckpt
```
Training orders: 16x256x256 $\rightarrow$ 16x256x256 HQ $\rightarrow$ 16x512x512 HQ.

These model weights are partially initialized from PixArt-α. The number of parameters is 724M. More information about training can be found in HPC-AI Tech's report. More about the dataset can be found in datasets.md from HPC-AI Tech. HQ means high quality.

PixArt-α: Download the pth checkpoint from here (for training only)

Convert to ms checkpoint:

python tools/convert_pt2ms.py --src /path/to/PixArt-XL-2-512x512.pth --target models/PixArt-XL-2-512x512.ckpt

Inference

Open-Sora 2.0 Command Line Inference

Text-to-Video Generation

First, you will need to generate text embeddings with:

# CLIP-Large
TRANSFORMERS_OFFLINE=1 python scripts/v2.0/text_embedding.py \
--model.from_pretrained="DeepFloyd/t5-v1_1-xxl" \
--model.max_length=512 \
--prompts_file=YOUR_PROMPTS.txt \
--output_path=assets/texts/t5_512
# T5
TRANSFORMERS_OFFLINE=1 python scripts/v2.0/text_embedding.py \
--model.from_pretrained="openai/clip-vit-large-patch14" \
--model.max_length=77 \
--prompts_file=YOUR_PROMPTS.txt \
--output_path=assets/texts/clip_77

Repeat the same for negative prompts.

Then, you can generate videos by running the following command:

python scripts/v2.0/inference_v2.py --config=configs/opensora-v2-0/inference/256px.yaml \
text_emb.t5_dir=assets/texts/t5_512 \
text_emb.neg_t5_dir=assets/texts/t5_512_neg \
text_emb.clip_dir=assets/texts/clip_77 \
text_emb.neg_clip_dir=assets/texts/clip_77_neg

Inference Performance

We evaluate the inference performance of text-to-video generation by measuring the average sampling time per step and the total sampling time of a video.

Experiments are conducted on Ascend Atlas 800T A2 machines with MindSpore >=2.6.0 in PyNative mode.

Model Name	Stage	Cards	Batch Size	Resolution	Precision	Step	s/image	s/video	Recipe
FLUX.1 [dev]	T2I	1	1	576 x 1024	bf16	50	14.7s	-	yaml
OpenSora 2.0	T/I2V	1	1	129 x 192 x 336	bf16	50	-	156s	yaml
OpenSora 2.0	T/I2V	1	1	77 x 576 x 1024	bf16	50	-	1453s	yaml
OpenSora 2.0	T/I2V	1	1	129 x 576 x 1024	bf16	50	-	4973s	yaml

Open-Sora 1.2 and 1.1 Command Line Inference

Instructions

Image/Video-to-Video Generation (supports text guidance)

# OSv1.2
python scripts/inference.py --config configs/opensora-v1-2/inference/sample_iv2v.yaml --ckpt_path /path/to/your/opensora-v1-2.ckpt
# OSv1.1
python scripts/inference.py --config configs/opensora-v1-1/inference/sample_iv2v.yaml --ckpt_path /path/to/your/opensora-v1-1.ckpt

For parallel inference, please use mpirun or msrun, and append --use_parallel=True to the inference script referring to scripts/run/run_infer_os_v1.1_t2v_parallel.sh

In the sample_iv2v.yaml, provide such information as loop, condition_frame_length, captions, mask_strategy, and reference_path. See here for more details.

For inference with sequence parallelism using multiple NPUs in Open-Sora 1.2, please use msrun and append --use_parallel True and --enable_sequence_parallelism True to the inference script, referring to scripts/run/run_infer_sequence_parallel.sh. To further accelerate the inference speed, you can use DSP by appending --dsp True, referring to scripts/run/run_infer_sequence_parallel_dsp.sh.

Text-to-Video Generation

To generate a video from text, you can use sample_t2v.yaml or set --reference_path to an empty string '' when using sample_iv2v.yaml.

python scripts/inference.py --config configs/opensora-v1-1/inference/sample_t2v.yaml --ckpt_path /path/to/your/opensora-v1-1.ckpt

Inference Performance

We evaluate the inference performance of text-to-video generation by measuring the average sampling time per step and the total sampling time of a video.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.7.0 graph mode.

model name	cards	batch size	resolution	jit level	precision	scheduler	step	graph compile	s/step	s/video	recipe
STDiT2-XL/2	1	1	16x640x360	O0	bf16	DDPM	100	1~2 mins	1.56	156.0	yaml
STDiT3-XL/2	1	1	51x720x1280	O0	bf16	RFlow	30	1~2 mins	4.83	155.4	yaml
STDiT3-XL/2	1	1	102x720x1280	O0	bf16	RFlow	30	1~2 mins	8.81	286.9	yaml

Open-Sora 1.0 Command Line Inference

Instructions

You can run text-to-video inference via the script scripts/inference.py as follows.

# Sample 16x256x256 videos
python scripts/inference.py --config configs/opensora/inference/stdit_256x256x16.yaml --ckpt_path models/OpenSora-v1-HQ-16x256x256.ckpt --prompt_path /path/to/prompt.txt

# Sample 16x512x512 videos
python scripts/inference.py --config configs/opensora/inference/stdit_512x512x16.yaml --ckpt_path models/OpenSora-v1-HQ-16x512x512.ckpt --prompt_path /path/to/prompt.txt

# Sample 64x512x512 videos
python scripts/inference.py --config configs/opensora/inference/stdit_512x512x64.yaml --ckpt_path /path/to/your/opensora-v1.ckpt --prompt_path /path/to/prompt.txt

For parallel inference, please use mpirun or msrun, and append --use_parallel=True to the inference script referring to scripts/run/run_infer_t2v_parallel.sh

We also provide a three-stage sampling script run_sole_3stages.sh to reduce memory limitation, which decomposes the whole pipeline into text embedding, text-to-video latent sampling, and vae decoding.

For more usage on the inference script, please run python scripts/inference.py -h

Inference Performance

We evaluate the inference performance of text-to-video generation by measuring the average sampling time per step and the total sampling time of a video.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.3.1 graph mode.

model name	cards	batch size	resolution	jit level	precision	scheduler	step	graph compile	s/step	s/video	recipe
STDiT-XL/2	1	4	16x256x256	O0	fp32	DDPM	100	2~3 mins	0.39	39.22	yaml
STDiT-XL/2	1	1	16x512x512	O0	fp32	DDPM	100	2~3 mins	1.85	185.00	yaml
STDiT-XL/2	1	1	64x512x512	O0	bf16	DDPM	100	2~3 mins	2.78	278.45	yaml

⚠️ Note: When running parallel inference scripts under scripts/run/ on ModelArts, please unset RANK_TABLE_FILE before the inference starts.

Data Processing

Currently, we are developing the complete pipeline for data processing from raw videos to high-quality text-video pairs. We provide the data processing tools as follows.

The text-video pair data should be organized as follows, for example.

.
├── video_caption.csv
├── video_folder
│   ├── part01
│   │   ├── vid001.mp4
│   │   ├── vid002.mp4
│   │   └── ...
│   └── part02
│       ├── vid001.mp4
│       ├── vid002.mp4
│       └── ...

The video_folder contains all the video files. The csv file video_caption.csv records the relative video path and its text caption in each line, as follows.

video,caption
video_folder/part01/vid001.mp4,a cartoon character is walking through
video_folder/part01/vid002.mp4,a red and white ball with an angry look on its face

Cache Text Embeddings

For acceleration, we pre-compute the t5 embedding before training stdit.

python scripts/infer_t5.py \
    --csv_path /path/to/video_caption.csv \
    --output_path /path/to/text_embed_folder \
    --model_max_length 300     # 300 for OpenSora v1.2, 200 for OpenSora v1.1, 120 for OpenSora 1.0

OpenSora v1 uses text embedding sequence length of 120 (by default). If you want to generate text embeddings for OpenSora v1.1, please change model_max_length to 200.

After running, the text embeddings saved as npz file for each caption will be in output_path. Please change csv_path to your video-caption annotation file accordingly.

Cache Video Embedding (Optional)

If the storage budget is sufficient, you may also cache the video embedding by

python scripts/infer_vae.py \
    --csv_path /path/to/video_caption.csv  \
    --video_folder /path/to/video_folder  \
    --output_path /path/to/video_embed_folder  \
    --vae_checkpoint models/sd-vae-ft-ema.ckpt \
    --image_size 512 \

for parallel running, please refer to scripts/run/run_infer_vae_parallel.sh

For more usage, please check python scripts/infer_vae.py -h

After running, the vae latents saved as npz file for each video will be in output_path.

Finally, the training data should be like follows.

.
├── video_caption.csv
├── video_folder
│   ├── part01
│   │   ├── vid001.mp4
│   │   ├── vid002.mp4
│   │   └── ...
│   └── part02
│       ├── vid001.mp4
│       ├── vid002.mp4
│       └── ...
├── text_embed_folder
│   ├── part01
│   │   ├── vid001.npz
│   │   ├── vid002.npz
│   │   └── ...
│   └── part02
│       ├── vid001.npz
│       ├── vid002.npz
│       └── ...
├── video_embed_folder  # optional
│   ├── part01
│   │   ├── vid001.npz
│   │   ├── vid002.npz
│   │   └── ...
│   └── part02
│       ├── vid001.npz
│       ├── vid002.npz
│       └── ...

Each npz file contains data for the following keys:

latent_mean mean of vae latent distribution
latent_std: std of vae latent distribution
fps: video fps
ori_size: original size (h, w) of the video

After caching VAE, you can use them for STDiT training by parsing --vae_latent_folder=/path/to/video_embed_folder to the training script python train.py.

Cache VAE for multi-resolutions (for OpenSora 1.1)

If there are multiple folders named in latent_{h}x{w} format under the --vae_latent_folder folder (which is parsed to train.py), one of resolutions will selected randomly during training. For example:

video_embed_folder
   ├── latent_576x1024
   │   ├── vid001.npz
   │   ├── vid002.npz
   │   └── ...
   └── latent_1024x576
       ├── vid001.npz
       ├── vid002.npz
       └── ...

Training

Open-Sora 1.2

Once you prepare the data in a csv file, you may run the following commands to launch training on a single card.

# standalone training for stage 2
export MS_DEV_ENABLE_KERNEL_PACKET=on

python scripts/train.py --config configs/opensora-v1-2 /train/train_stage2.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \

text_embed_folder is required and used to speed up the training. You can find the instructions on how to generate T5 embeddings here.

For parallel training, use msrun and along with --use_parallel=True:

# distributed training for stage 2
export MS_DEV_ENABLE_KERNEL_PACKET=on

msrun --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
    python scripts/train.py --config configs/opensora-v1-2/train/train_stage2.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --use_parallel True

You can modify the training configuration, including hyper-parameters and data settings, in the yaml file specified by the --config argument.

Multi-Resolution Training

OpenSora v1.2 supports training with multiple resolutions, aspect ratios, and frames based on the bucket method.

To enable dynamic training for STDiT3, please set the bucket_config to fit your datasets and tasks at first. An example (from configs/opensora-v1-2/train/train_stage2.yaml) is

bucket_config:
  # Structure: "resolution": { num_frames: [ keep_prob, batch_size ] }
  "144p": { 1: [ 1.0, 475 ], 51: [ 1.0, 51 ], 102: [ [ 1.0, 0.33 ], 27 ], 204: [ [ 1.0, 0.1 ], 13 ], 408: [ [ 1.0, 0.1 ], 6 ] }
  "256": { 1: [ 0.4, 297 ], 51: [ 0.5, 20 ], 102: [ [ 0.5, 0.33 ], 10 ], 204: [ [ 0.5, 1.0 ], 5 ], 408: [ [ 0.5, 1.0 ], 2 ] }
  "240p": { 1: [ 0.3, 297 ], 51: [ 0.4, 20 ], 102: [ [ 0.4, 0.33 ], 10 ], 204: [ [ 0.4, 1.0 ], 5 ], 408: [ [ 0.4, 1.0 ], 2 ] }
  "360p": { 1: [ 0.5, 141 ], 51: [ 0.15, 8 ], 102: [ [ 0.3, 0.5 ], 4 ], 204: [ [ 0.3, 1.0 ], 2 ], 408: [ [ 0.5, 0.5 ], 1 ] }
  "512": { 1: [ 0.4, 141 ], 51: [ 0.15, 8 ], 102: [ [ 0.2, 0.4 ], 4 ], 204: [ [ 0.2, 1.0 ], 2 ], 408: [ [ 0.4, 0.5 ], 1 ] }
  "480p": { 1: [ 0.5, 89 ], 51: [ 0.2, 5 ], 102: [ 0.2, 2 ], 204: [ 0.1, 1 ] }
  "720p": { 1: [ 0.1, 36 ], 51: [ 0.03, 1 ] }
  "1024": { 1: [ 0.1, 36 ], 51: [ 0.02, 1 ] }
  "1080p": { 1: [ 0.01, 5 ] }
  "2048": { 1: [ 0.01, 5 ] }

Knowing that the optimal bucket config can varies from device to device, we have tuned and provided bucket config that are more balanced on Ascend + MindSpore in configs/opensora-v1-2/train/{stage}_ms.yaml. You may use them for better training performance.

More details on the bucket configuration can be found in Multi-resolution Training with Buckets.

The instruction for launching the dynamic training task is smilar to the previous section. An example running script is scripts/run/run_train_os1.2_stage2.sh.

Open-Sora 1.1

Instructions

Once you prepare the data in a csv file, you may run the following commands to launch training on a single card.

# standalone training for stage 1
python scripts/train.py --config configs/opensora-v1-1/train/train_stage1.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --vae_latent_folder /path/to/video_embed_folder

text_embed_folder and vae_latent_folder are optional and used to speed up the training. You can find more in T5 text embeddings and VAE Video Embeddings

For parallel training, use msrun and along with --use_parallel=True:

# distributed training for stage 1
msrun --master_port=8200 --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
    python scripts/train.py --config configs/opensora-v1-1/train/train_stage1.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --vae_latent_folder /path/to/video_embed_folder \
    --use_parallel True

Multi-Resolution Training

OpenSora v1.1 supports training with multiple resolutions, aspect ratios, and a variable number of frames. This can be enabled in one of two ways:

Provide variable sized VAE embeddings with the --vae_latent_folder option.
Use bucket_config for training with videos in their original format. More on the bucket configuration can be found in Multi-resolution Training with Buckets.

Detailed running command can be referred in scripts/run/run_train_os_v1.1_stage2.sh

Open-Sora 1.0 Training

Instructions

Once the training data including the T5 text embeddings is prepared, you can run the following commands to launch training.

# standalone training, 16x256x256
python scripts/train.py --config configs/opensora/train/stdit_256x256x16_ms.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \

To use the cached video embedding, please replace --video_folder with --video_embed_folder and pass the path to the video embedding folder.

For parallel training, please use msrun and pass --use_parallel=True

# 8 NPUs, 64x512x512
msrun --master_port=8200 --worker_num=8 --local_worker_num=8 --log_dir=$output_dir  \
    python scripts/train.py --config configs/opensora/train/stdit_512x512x64_ms.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --use_parallel True \

To train in bfloat16 precision, please parse --global_bf16=True

For more usage, please check python scripts/train.py -h. You may also see the example shell scripts in scripts/run for quick reference.

Evaluation

Open-Sora 1.2

Open-Sora 1.2 based on MindSpore and Ascend Atlas 800T A2 machines supports 0s~16s, 144p to 720p, various aspect ratios video generation. The supported configurations are listed below.

	image	2s	4s	8s	16s
240p	✅	✅	✅	✅	✅
360p	✅	✅	✅	✅	✅
480p	✅	✅	✅	✅	🆗
720p	✅	✅	✅	🆗	🆗

Here ✅ means that the data is seen during training, and 🆗 means although not trained, the model can inference at that config. Inference for 🆗 requires sequence parallelism.

Training Performance

We evaluate the training performance of Open-Sora v1.2 on the MixKit dataset with high-resolution videos (1080P, duration 12s to 100s).

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.7.0 graph mode.

model name	cards	batch size	resolution	precision	jit level	graph compile	s/step	recipe
STDiT3-XL/2	8	1	51x720x1280	bf16	O1	100 s	11.24	yaml
STDiT3-XL/2	8	dynamic	stage 1	bf16	O1	14 mins	13.17	yaml
STDiT3-XL/2	8	dynamic	stage 2	bf16	O1	14 mins	26.04	yaml
STDiT3-XL/2	8	dynamic	stage 3	bf16	O1	14 mins	27.83	yaml

Note that the step time of dynamic training can be influenced by the resolution and duration distribution of the source videos.

To reproduce the above performance, you may refer to scripts/run/run_train_os1.2_720x1280x51.sh and scripts/run/run_train_os1.2_stage2.sh.

Below are some generation results after fine-tuning STDiT3 with Stage 2 bucket config on a mixkit subset, which contains 100 text-video pairs. The training set contains 80 1080P videos consisting of natural scenes, flowers, and pets. Here we show the text-to-video generation results on the test set.

480x854x204	480x854x204
019-The-video-begins-with-a-completely-black-screen.-which-quickly.mp4	009-The-video-features-a-person-in-a-white-lace-wedding.mp4
480x854x204	480x854x204
005-The-video-showcases-a-small-dog-with-a-light-brown.mp4	001-The-video-showcases-a-black-and-white-dog-engaging-in.mp4

Open-Sora 1.1

Training Performance

We evaluate the training performance of Open-Sora v1.1 on a subset of the MixKit dataset.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.3.1 graph mode.

model name	cards	batch size	resolution	vae cache	precision	sink	jit level	graph compile	s/step	recipe
STDiT3-XL/2	8	1	16x512x512	OFF	bf16	OFF	O1	13 mins	2.28	yaml
STDiT3-XL/2	8	1	64x512x512	OFF	bf16	OFF	O1	13 mins	8.57	yaml
STDiT3-XL/2	8	1	24x576x1024	OFF	bf16	OFF	O1	13 mins	8.55	yaml
STDiT3-XL/2	8	1	64x576x1024	ON	bf16	OFF	O1	13 mins	18.94	yaml

vae cache: whether vae embedding is pre-computed and cached before training.

Note that T5 text embedding is pre-computed before training.

Here are some generation results after fine-tuning STDiT2 on a mixkit subset.

576x1024x48	576x1024x48
000-a-breathtaking-aerial-view-of-a-vast-landscape.-The-foreground.mp4	001-a-close-up-view-of-a-tree-branch-adorned-with-vibrant.mp4
576x1024x48	576x1024x48
005-a-serene-landscape.-bathed-in-the-soft-glow-of-daylight.mp4	003-a-vibrant-scene-dominated-by-a-cluster-of-pink-bougainvillea.mp4

Open-Sora 1.0

Training Performance

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.3.1 graph mode.

model name	cards	batch size	resolution	stage	precision	sink	jit level	graph compile	s/step	recipe
STDiT-XL/2	8	3	16x256x256	1	fp16	ON	O1	5~6 mins	1.53	yaml
STDiT-XL/2	8	1	16x512x512	2	fp16	ON	O1	5~6 mins	2.47	yaml
STDiT-XL/2	8	1	64x512x512	3	bf16	ON	O1	5~6 mins	8.52	yaml

Loss Curves

Training loss curves

16x256x256 Pretraining Loss Curve:

16x256x256 HQ Training Loss Curve:

16x512x512 HQ Training Loss Curve:

Text-to-Video Generation after Fine-tuning

Here are some generation results after fine-tuning STDiT on a subset of WebVid dataset.

512x512x64	512x512x64	512x512x64
001-Cloudy-moscow-kremlin-time-lapse.mp4	003-The-girl-received-flowers-as-a-gift.-a-gift-for.mp4	004-A-baker-turns-freshly-baked-loaves-of-sourdough-bread.mp4

Quality Evaluation

For quality evaluation, please refer to the original HPC-AI Tech evaluation doc for video generation quality evaluation.

VAE Training & Evaluation

A 3D-VAE pipeline consisting of a spatial VAE followed by a temporal VAE is trained in OpenSora v1.1. For more details, refer to VAE Documentation.

Prepare Pretrained Weights

Download pretained VAE-2D checkpoint from PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers if you aim to train VAE-3D from spatial VAE initialization.

Convert to ms checkpoint:
```
python tools/convert_vae1.2.py --src /path/to/pixart_sigma_sdxlvae_T5_diffusers/vae/diffusion_pytorch_model.safetensors --target models/sdxl_vae.ckpt --from_vae2d
```
Downalod pretrained VAE-3D checkpoint from hpcai-tech/OpenSora-VAE-v1.2 if you aim to train VAEA-3D from the VAE-3D model pre-trained with 3 stages.

Convert to ms checkpoint:
```
python tools/convert_vae1.2.py --src /path/OpenSora-VAE-v1.2/models.safetensors --target models/OpenSora-VAE-v1.2/sdxl_vae.ckpt
```
Download lpips mindspore checkpoint from here and put it under 'models/'

Data Preprocess

Before VAE-3D training, we need to prepare a csv annotation file for the training videos. The csv file list the path to each video related to the root video_folder. An example is

video
dance/vid001.mp4
dance/vid002.mp4
...

Taking UCF-101 for example, please download the UCF-101 dataset and extract it to datasets/UCF-101 folder. You can generate the csv annotation by running python tools/annotate_vae_ucf101.py. It will result in two csv files, datasets/ucf101_train.csv and datasets/ucf101_test.csv, for training and testing respectively.

Training

# stage 1 training, 8 NPUs
msrun --worker_num=8 --local_work_num=8 \
python scripts/train_vae.py --config configs/vae/train/stage1.yaml --use_parallel=True --csv_path datasets/ucf101_train.csv --video_folder datasets/UCF-101

# stage 2 training, 8 NPUs
msrun --worker_num=8 --local_work_num=8 \
python scripts/train_vae.py --config configs/vae/train/stage2.yaml --use_parallel=True --csv_path datasets/ucf101_train.csv --video_folder datasets/UCF-101

# stage 3 training, 8 NPUs
msrun --worker_num=8 --local_work_num=8 \
python scripts/train_vae.py --config configs/vae/train/stage3.yaml --use_parallel=True --csv_path datasets/ucf101_train.csv --video_folder datasets/UCF-101

You can change the csv_path and video_folder to train on your own data.

Performance Evaluation

To evaluate the VAE performance, you need to run VAE inference first to generate the videos, then calculate scores on the generated videos:

# video generation and evaluation
python scripts/inference_vae.py --ckpt_path /path/to/you_vae_ckpt --image_size 256 --num_frames=17 --csv_path datasets/ucf101_test.csv --video_folder datasets/UCF-101

You can change the csv_path and video_folder to evaluate on your own data.

Here, we report the training performance and evaluation results on the UCF-101 dataset.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.3.1 graph mode.

model name	cards	batch size	resolution	precision	jit level	graph compile	s/step	PSNR	SSIM	recipe
VAE-3D	8	1	17x256x256	bf16	O1	5 mins	1.09	29.02	0.87	yaml

Note that we train with mixed video ang image strategy i.e. --mixed_strategy=mixed_video_image for stage 3 instead of random number of frames (mixed_video_random). Random frame training will be supported in the future.

Long sequence training and inference (sequence parallel)

Training

We support training with the OpenSora v1.2 model using SP (Sequence Parallel) and DSP (Dynamic Sequence Parallel), handling up to 408 frames (~16 seconds) on 4 NPU* cards. Additionally, we have optimized the training speed by implementing micro-batch parallelism in the VAE’s spatial and temporal domains, achieving approximately a 20% speed boost. We evaluate the training performance using the MixKit dataset, which includes high-resolution videos (1080P, duration 12s to 100s). The training performance results are reported below.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.4.0 graph mode.

model name	cards	batch size	resolution	sink	precision	jit level	graph compile	s/step	recipe
STDiT3-XL/2	4	1	408x720x1280	OFF	bf16	O1	12 mins	48.30	script
STDiT3-XL/2	4	1	408x720x1280	OFF	bf16	O1	12 mins	47.00	script

To prevent the system from running out of memory, ensure you launch the training job on a server with sufficient memory. For 4P training, at least 400GB of memory is required.

Inference

We evaluate the inference performance of text-to-video generation by measuring the average sampling time per step and the total sampling time of a video.

All experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.4.0 graph mode.

model name	cards	batch size	resolution	precision	scheduler	steps	jit level	graph compile	s/step	s/video	recipe
STDiT3-XL/2	2	1	408x720x1280	bf16	RFlow	30	O0	1~2 mins	26.03	780.00	script
STDiT3-XL/2	2	1	408x720x1280	bf16	RFlow	30	O0	1~2 mins	22.03	660.00	script

Training and Inference Using the FiT-Like Pipeline

⚠️WARNING: This feature is experimental. The official version is under development.

We provide support for training Open-Sora 1.1 using the FiT-Like pipeline as an alternative solution for handling multi-resolution videos, in contrast to the bucketing strategy.

FiT-Like Training

To begin, we need to prepare the VAE (Variational Autoencoder) latents from multi-resolution videos. For instance, if you intend to train at a resolution of up to 512x512 pixels, please run

python script/infer_vae.py \
    --csv_path /path/to/video_caption.csv  \
    --video_folder /path/to/video_folder  \
    --output_path /path/to/video_embed_folder  \
    --vae_checkpoint models/sd-vae-ft-ema.ckpt \
    --image_size 512 \
    --resize_by_max_value True \
    --vae-micro-batch-size 1
    --mode 1

The extracted VAE latent will be saved in the video embedding folder.

Then, to launch a distributed training with eight NPU cards, please run

msrun --worker_num=8 --local_worker_num=8  \
    scripts/train.py --config configs/opensora-v1-1/train/train_stage1_fit.yaml \
    --csv_path /path/to/video_caption.csv \
    --video_folder /path/to/video_folder \
    --text_embed_folder /path/to/text_embed_folder \
    --vae_latent_folder /path/to/video_embed_folder \
    --use_parallel True \
    --max_image_size 512 \

We evaluated the training performance on MindSpore and Ascend NPUs. The results are as follows.

Model	Context	Precision	BS	NPUs	Max. Size	Train T. (s/step)
STDiT2-XL/2	D910*-MS2.3_master	BF16	1	4	16x512x512	2.3

FiT-Like Inference

To sample a video with a resolution of 384x672 using the trained checkpoint. You can run

python scripts/inference_i2v.py --config configs/opensora-v1-1/inference/t2v_fit.yaml \
    --ckpt_path /path/to/your/opensora-v1-1.ckpt \
    --prompt_path /path/to/prompt.txt \
    --image_size 384 672 \
    --max_image_size 512 \

Make sure that the max_image_size parameter remains consistent between your training and inference commands.

Here are some generation results after fine-tuning STDiT on a small dataset:

384x672x16	672x384x16
001-a-breathtaking-view-of-a-mountainous-landscape.-From-a-high.mp4	000-a-close-up-view-of-a-branch-laden-with-white-flowers.mp4

Contribution

Thanks go to the support from the MindSpore team and the open-source contributions from the OpenSora project.

If you wish to contribute to this project, you can refer to the Contribution Guideline.

Acknowledgement

ColossalAI: A powerful large model parallel acceleration and optimization system.
DiT: Scalable Diffusion Models with Transformers.
OpenDiT: An acceleration for DiT training. We adopt valuable acceleration strategies for training progress from OpenDiT.
PixArt: An open-source DiT-based text-to-image model.
Flux: A powerful text-to-image generation model.
Latte: An attempt to efficiently train DiT for video.
HunyuanVideo: Open-Source text-to-video model.
StabilityAI VAE: A powerful image VAE model.
DC-AE: Deep Compression AutoEncoder for image compression.
CLIP: A powerful text-image embedding model.
T5: A powerful text encoder.
LLaVA: A powerful image captioning model based on Mistral-7B and Yi-34B.
PLLaVA: A powerful video captioning model.
DSP: Dynamic Sequence Parallel introduced by NUS HPC AI Lab.
MiraData: A large-scale video dataset with long durations and structured caption.

@article{opensora,
  title={Open-sora: Democratizing efficient video production for all},
  author={Zheng, Zangwei and Peng, Xiangyu and Yang, Tianji and Shen, Chenhui and Li, Shenggui and Liu, Hongxin and Zhou, Yukun and Li, Tianyi and You, Yang},
  journal={arXiv preprint arXiv:2412.20404},
  year={2024}
}

@article{opensora2,
    title={Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k},
    author={Xiangyu Peng and Zangwei Zheng and Chenhui Shen and Tom Young and Xinying Guo and Binluo Wang and Hang Xu and Hongxin Liu and Mingyan Jiang and Wenjun Li and Yuhui Wang and Anbang Ye and Gang Ren and Qianran Ma and Wanying Liang and Xiang Lian and Xiwen Wu and Yuting Zhong and Zhuangyan Li and Chaoyu Gong and Guojun Lei and Leijun Cheng and Limin Zhang and Minghao Li and Ruijie Zhang and Silan Hu and Shijie Huang and Xiaokang Wang and Yuanheng Zhao and Yuqi Wang and Ziang Wei and Yang You},
    year={2025},
    journal={arXiv preprint arXiv:2503.09642},
}

We are grateful for their exceptional work and generous contribution to open source.

Name		Name	Last commit message	Last commit date
parent directory ..
assets/texts		assets/texts
configs		configs
docs		docs
opensora		opensora
scripts		scripts
tests		tests
tools		tools
README.md		README.md
requirements.txt		requirements.txt

FilesExpand file tree

opensora_hpcai

Directory actions

More options

Directory actions

More options

Latest commit

History

opensora_hpcai

Folders and files

parent directory

README.md

Open-Sora: Democratizing Efficient Video Production for All

Open-Sora is still at an early stage and under active development.

📰 News & States

📦 Requirements

🎥 Demo

OpenSora 2.0 Demo

OpenSora 1.2 Demo

OpenSora 1.1 Demo

Text-to-Video

Image-to-Video

Frame Interpolation

Video Editing

Text-to-Image

OpenSora 1.0 Demo

🔆 Features

Contents

Installation

Model Weights

Open-Sora 2.0 Model Weights

Open-Sora 1.2 Model Weights

Open-Sora 1.1 Model Weights

Open-Sora 1.0 Model Weights

Inference

Open-Sora 2.0 Command Line Inference

Text-to-Video Generation

Inference Performance

Open-Sora 1.2 and 1.1 Command Line Inference

Image/Video-to-Video Generation (supports text guidance)

Text-to-Video Generation

Inference Performance

Open-Sora 1.0 Command Line Inference

Inference Performance

Data Processing

Cache Text Embeddings

Cache Video Embedding (Optional)

Cache VAE for multi-resolutions (for OpenSora 1.1)

Training

Open-Sora 1.2

Multi-Resolution Training

Open-Sora 1.1

Multi-Resolution Training

Open-Sora 1.0 Training

Evaluation

Open-Sora 1.2

Training Performance

Open-Sora 1.1

Training Performance

Open-Sora 1.0

Training Performance

Loss Curves

Text-to-Video Generation after Fine-tuning

Quality Evaluation

VAE Training & Evaluation

Prepare Pretrained Weights

Data Preprocess

Training

Performance Evaluation

Long sequence training and inference (sequence parallel)

Training

Inference

Training and Inference Using the FiT-Like Pipeline

FiT-Like Training

FiT-Like Inference

Contribution

Acknowledgement