ByteDance

LatentSync

LatentSync is an end-to-end lip-synchronization model developed by ByteDance that takes a source talking-head video and a target audio track as inputs and produces a new video with mouth movements precisely aligned to the provided speech. It is built on audio-conditioned latent diffusion, meaning it operates directly in the latent space rather than relying on 3D meshes or 2D facial landmarks. The model supports multiple languages and accents, works with diverse speakers and recording conditions, and handles both real people and stylized characters such as anime. A key technical feature of LatentSync is Temporal REPresentation Alignment (TREPA), a technique designed to reduce flicker, jitter, and frame-to-frame artifacts, keeping head pose and lip motion stable across longer sequences. The model is recommended for use with high-resolution source videos — 720p, 1080p, or 4K — paired with clean audio recordings. It is well suited for workflows involving lip-syncing, audio dubbing, digital human creation, and video-audio alignment.

August 2025 50,000 context N/A output

Lip Synchronization Audio Input Processing Video Input Processing Temporal Consistency (TREPA) High-Resolution Output

Overview ↓ About ↓ Capabilities ↓ Pricing ↓ Parameters ↓ Resources ↓ Community ↓ FAQ ↓

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

ByteDance

Input Context Window

The number of tokens supported by the input context window.

50,000 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

N/A tokens

Open Source

Whether the model's code is available for public use.

Release Date

When the model was first released.

August 2025

Knowledge Cut-off Date

When the model's knowledge was last updated.

August 2025

API Providers

The providers that offer this model. This is not an exhaustive list.

ByteDance

Modalities

Types of data this model can process.

Video Audio

What is LatentSync

A fuller summary of positioning, capabilities, and source-specific details for LatentSync.

LatentSync is an end-to-end lip-synchronization model developed by ByteDance that takes a source talking-head video and a target audio track as inputs and produces a new video with mouth movements precisely aligned to the provided speech. It is built on audio-conditioned latent diffusion, meaning it operates directly in the latent space rather than relying on 3D meshes or 2D facial landmarks. The model supports multiple languages and accents, works with diverse speakers and recording conditions, and handles both real people and stylized characters such as anime.

A key technical feature of LatentSync is Temporal REPresentation Alignment (TREPA), a technique designed to reduce flicker, jitter, and frame-to-frame artifacts, keeping head pose and lip motion stable across longer sequences. The model is recommended for use with high-resolution source videos — 720p, 1080p, or 4K — paired with clean audio recordings. It is well suited for workflows involving lip-syncing, audio dubbing, digital human creation, and video-audio alignment.

Capabilities

What LatentSync supports

Lip Synchronization

Aligns mouth movements in a source video to a target audio track end-to-end, without relying on 3D meshes or 2D facial landmarks.

AUD

Audio Input Processing

Accepts an audio URL as a conditioning input, supporting multiple languages and accents across diverse speakers and recording conditions.

VID

Video Input Processing

Takes a source talking-head video URL as input and preserves the subject's identity, pose, background, and scene structure in the output.

Temporal Consistency (TREPA)

Uses Temporal REPresentation Alignment to reduce flicker, jitter, and frame-to-frame artifacts, keeping motion stable over long sequences.

High-Resolution Output

Supports source videos at 720p, 1080p, or 4K resolution, delivering sharp facial detail for both real people and stylized characters.

Pricing for LatentSync

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Input tokens N/A Per million tokens

Output tokens N/A Per million tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

ByteDance

Configuration & Parameters

The configurable options currently documented for this model.

Audio

Audio URL

Audio to be synchronized.

Video

Video URL

Video to be synchronized.

Supported Request Parameters

Parameters currently listed by OpenRouter or the local catalog for this model.

Audio Video

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Model Page Other

→

API Reference Documentation

→

Digital Human Tutorial Documentation

→

LatentSync Paper (arXiv) Research

→

LatentSync GitHub Repository Open Source

→

Community discussion

What people think about LatentSync

LatentSync discussions are most active in r/StableDiffusion, r/comfyui, r/latentsync. Top Reddit threads cluster around benchmark and model-comparison threads.

The strongest match in this snapshot has 840 upvotes and 277 comments.

r/StableDiffusion 58 upvotes 52 comments May 28, 2025

Looking for Lip Sync Models - Anything Better Than LatentSync?

Hi everyone,

I’ve been experimenting with lip sync models for a project where I need to sync lip movements in a video to a given audio file.

I’ve tried **Wav2Lip** and **LatentSync** — I found LatentSync to perform better, but the results are still far from accurate.

Does anyone have recommendations for other models I can try? Preferably **open source** with **fast runtimes**.

Thanks in advance!

Open Reddit thread

r/StableDiffusion 3 upvotes 13 comments August 8, 2025

Has anyone successfully used LatentSync to lip sync video clips (not images)? If you didn't use LatentSync which other open source alternative did you use?

Open Reddit thread

r/StableDiffusion 146 upvotes 32 comments December 12, 2025

Mixing IndexTTS2 + Fast Whisper + LatentSync gives you an open source alternative to Heygen translation

Open Reddit thread

r/comfyui 255 upvotes 43 comments March 18, 2025

Best Lip Sync - LatentSync update to 1.5

LatentSync update to 1.5

Quality and Stability Improvement

workflow:

[https://github.com/ShmuelRonen/ComfyUI-LatentSyncWrapper/blob/main/workflow/latentsync1.5\_comfyui\_basic.json](https://github.com/ShmuelRonen/ComfyUI-LatentSyncWrapper/blob/main/workflow/latentsync1.5_comfyui_basic.json)

online run:

[https://www.comfyonline.app/explore/f6a36d51-ee68-429b-87db-a3314b8a2513](https://www.comfyonline.app/explore/f6a36d51-ee68-429b-87db-a3314b8a2513)

Open Reddit thread

r/StableDiffusion 51 upvotes 54 comments April 19, 2025

A Demo for F5 and Latentsync 1.5 - English voice dubbing for foreign movies and videos

Workflow can be downloaded here:

https://filebin.net/f4boko99u9g99vay

This workflow allows you to generate English audio from European films/videos and lip-synced to the actor using Latentsync 1.5. The generated voice retains the accent and emotional expression from the source voice. For optimal results, use a voice file containing at least five seconds of speech. (This has only been tested with French, German, Italian and Spanish - not sure about other languages)

1. Make sure that the fps is same for all the nodes!

2. Connect the "Background sound" output to the "Stack Audio" node if you want to add the background/ambient sound back to the generated audio.

3. Enable the "Convolution Reverb" node if you want reverb in the generated audio. Read this page for more info: https://github.com/c0ffymachyne/ComfyUI_SignalProcessing

4. Try E2 model as well.

4. The audio generation is fast, it's Latentsync that is time consuming. An efficient method is to disconnect the audio output to Latentsync Sampler, then keep re-generate the audio until you get the result you want. After that, fixed the seed and reconnect the audio output to Latentsync.

4. Sometimes the generated voice sounds like low bitrate audio with a metallic sound - you need to upscale it to improve the quality. There's a few free online options (including Adobe) for AI audio upscaling. I am surprised that there are so many image upscaling models available for ComfyUI, but not even a single one for audio. Otherwise, I would have included it as the final post-processing step for this workflow. If you are proficient in digital audio software (DAW), you can also enhance the sound quality using specialized audio tools.

Open Reddit thread

View more discussions →

FAQ

Common questions about LatentSync

What inputs does LatentSync require?

LatentSync requires two inputs: a video URL pointing to a source talking-head video and an audio URL pointing to the target audio track you want the subject to appear to speak.

What video resolution is recommended for best results?

The model documentation recommends using source videos at 720p, 1080p, or 4K resolution, paired with clean, dry audio recordings for optimal output quality.

Does LatentSync support languages other than English?

Yes, LatentSync is described as multilingual and supports multiple languages and accents, adapting to diverse speakers and recording conditions.

What is the context window for this model?

The model has a context window of 50,000 tokens as listed in its metadata.

Does LatentSync use facial landmark detection or 3D meshes internally?

No. LatentSync operates via audio-conditioned latent diffusion and does not rely on 3D meshes or 2D facial landmarks to generate lip-synced output.

What is the training data cutoff for LatentSync?

According to the model metadata, the training date is listed as August 2025.

More models from ByteDance

Continue browsing adjacent models from the same provider.

← All AI Models