Lipsync models

Sync offers a family of lipsync models for different quality and speed needs. sync-3 is our most powerful model with 4K native output and built-in obstruction detection. lipsync-2-pro delivers diffusion-based super resolution for fine facial detail. lipsync-2 balances quality and speed for general use. Pricing ranges from $0.04/sec (lipsync-2) to $0.133/sec (sync-3) at 25 fps. Explore and compare the capabilities of the different models below.

Featurelipsync-2lipsync-2-prosync-3
DescriptionOur most natural lipsyncing model yet. The first model that can preserve the unique speaking style of every speaker.Our highest quality lipsyncing model with diffusion-based super resolution. Enhanced detail preservation for beards, teeth, and facial features.Our most powerful lipsync model. Processes the full shot at once with built-in obstruction detection, 4K native output, and support for extreme angles and partial faces.
Pricing @ 25fps$0.04 — $0.05/sec$0.067 — $0.083/sec$0.107 — $0.133/sec
Accuracy
Speed
StyleLip movements in the unique style of the speakerLip movements in the unique style of the speaker with enhanced detail and fidelityFull speaker style and emotion preservation with wider spatial understanding
Identity Preservation
Teeth
Face Detection
Face Blending
Pose Robustness
Beard
Face Resolution512×512512×512 with enhanced detail preservation4K native output with built-in super resolution
Best forbest lipsync for majority of the videosbetter than the best. lipsync-2 with premium quality, highly recommended for professional needs. Seamlessly generates facial details with beards, wrinkles, and teeth.production-grade lipsync. Handles close-ups, profile shots, obstructions, and complex scenes that other models struggle with.

All models are available in both Studio and API.

Advanced Options

Not all advanced options are available on every model. The table below shows which options you can configure for each lipsync model.

Optionlipsync-2lipsync-2-prosync-3
temperature— (managed natively)
occlusion_detection_enabled— (automatic)
reasoning_enabled— (built-in)
active_speaker_detection

Options marked with ”—” are either not supported or handled automatically by the model. If you include an unsupported option in your request, it will be ignored.

  • Temperature: Controls expressiveness of lip movements (0–1, default 0.5). Available on lipsync-2 and lipsync-2-pro. sync-3 manages expressiveness natively.
  • Obstruction Detection: For challenging video content where faces may be partially hidden by objects, hands, or other elements, you can enable obstruction detection on lipsync-2 and lipsync-2-pro. This feature improves face detection accuracy in complex scenes but comes with slower generation speeds. sync-3 handles obstructions automatically — no configuration needed.
  • Reasoning: Enhanced frame analysis for artifacts and edge cases. Available on lipsync-2-pro. sync-3 includes this capability natively.
  • Active Speaker Detection: Automatically detects and applies lipsync only to active speakers in multi-person videos. Available on all lipsync models.

Caveats

  • Still Frame Limitation: lipsync-2 and lipsync-2-pro require natural speaking motion in the input video to function properly. If your video contains segments with still frames (where the speaker is not actively moving or speaking), lipsync will not work during those portions, even if audio is present. These models use 2-second independent chunks for inference and need to detect natural speaking style to generate appropriate lip movements. Static or still video segments don’t provide the necessary visual cues for the model to create realistic lip synchronization. sync-3 can open silent lips to match audio, though results are generic rather than speaker-style matched. Recommendation: For best results with lipsync-2 and lipsync-2-pro, ensure your input video shows the speaker actively talking throughout the duration you want to lipsync.

Legacy Models

lipsync-1.9.0-beta is our fast legacy lipsync model for simple videos. It uses standard generic lip movements and is the fastest option available at $0.02 — $0.025/sec at 25 fps. While still supported, we recommend lipsync-2 or newer models for better accuracy and style preservation.

FAQs

The character in the input video needs to look like they are talking. Our models learn to mimic the speaking style in the input video. If the character is completely static, the model might not generate lips that move either. sync-3 can open silent lips, but for the most natural results the input should show some speaking motion.

Solution: When creating your AI-generated video, add the text prompt “person is speaking naturally” to your generation. This will create characters with lips that are already moving, which will work much better with our platform.

Absolutely, our latest model is your best bet for this. For best results, be sure to isolate and upload the vocals track, as the instrumental sounds can sometimes interfere with the lipsync quality.

You can lipsync human-like faces, but our models don’t currently support animals or non-humanoid characters.

Please check if the problematic segments have:

  • Multiple speakers in the frame
  • Faces that are too small or in profile view
  • Segments where the speaker in the input video is not speaking
  • Faces that are partially obstructed by objects, hands, or other elements

For multiple people, try masking or cropping out some faces using external tools. For obstruction issues, sync-3 handles obstructions automatically. On lipsync-2 and lipsync-2-pro, enable the occlusion_detection_enabled option in your generation request for better face detection in complex scenes (though it will slow down processing).

Extreme profile view faces can lead to sub-par results on lipsync-2 and lipsync-2-pro. sync-3 natively supports extreme face angles including profiles, over-the-shoulder shots, and non-frontal lip positions. If you’re working with challenging angles, sync-3 is your best option.

lipsync-2 and lipsync-2-pro generate faces at 512×512 resolution, which is usually sufficient for most 1080p videos. If the face in your input video is quite large, you may notice some resolution differences. lipsync-2-pro offers enhanced detail preservation for beards, teeth, and fine facial features. sync-3 generates at 4K native resolution with built-in super resolution, so resolution loss is not an issue.

lipsync-2-pro uses advanced diffusion-based super resolution technology instead of traditional GAN-based approaches. This results in:

  • Enhanced beard resolution: Better handling of facial hair without blurring
  • Improved teeth generation: More consistent and natural-looking teeth across frames
  • Superior detail preservation: Enhanced quality around the mouth region and facial features
  • Better face size handling: Can process larger face regions (up to 350×350 pixels) without quality degradation

The trade-off is slower processing time (1.5-2x slower than lipsync-2) and higher cost, making it ideal for premium quality applications where the highest fidelity is required.

sync-3 is a fundamentally different architecture. Instead of processing video in small independent chunks, sync-3 builds a global understanding of the person across the entire shot and generates all frames at once. This gives it:

  • Long-range consistency: No chunk boundary artifacts or flickering
  • Built-in obstruction detection: Automatically handles hands, microphones, scarves, and other objects blocking the face
  • Extreme angle support: Profile shots, over-the-shoulder, and non-frontal angles work natively
  • 4K native output: Built-in super resolution without quality loss
  • Emotion and style preservation: Preserves the speaker’s cadence, expression, and emotional performance

sync-3 delivers significantly higher quality at a higher price point reflecting its advanced processing pipeline.

The maximum video duration depends on your subscription plan — ranging from 1 minute on Hobbyist up to 30 minutes on Scale+ plans. See the pricing page for your plan’s limit.

Each lipsync model has different per-second pricing at 25 fps. Pricing is not the same across models:

ModelPer-second rate (at 25 fps)
lipsync-1.9.0-beta$0.02 — $0.025/sec
lipsync-2$0.04 — $0.05/sec
lipsync-2-pro$0.067 — $0.083/sec

lipsync-2 costs roughly 2x more than lipsync-1.9.0-beta, and lipsync-2-pro costs roughly 1.7x more than lipsync-2. Choose based on your quality and budget requirements. See the Billing page for full details.

For lipsync-2 and lipsync-2-pro, long videos are automatically divided into 30-40 second chunks for processing. Generations can timeout and fail if your video has too many scene changes within these chunks or if many scenes don’t contain detectable faces.

This happens because the processing pipeline needs to detect and track faces across the video. Rapid scene changes or scenes without faces create additional complexity that can exceed processing time limits.

To fix this:

  • Reduce the number of scene cuts in your video before uploading
  • Ensure faces are visible in most frames of the video
  • Break very long videos with many scenes into shorter, more manageable segments

sync-3 processes shots differently and may handle some of these scenarios better, though very long videos with frequent scene changes can still be challenging.