Lip Synchronization
Aligns mouth movements in a source video to a target audio track end-to-end, without relying on 3D meshes or 2D facial landmarks.
LatentSync is an end-to-end lip-synchronization model developed by ByteDance that takes a source talking-head video and a target audio track as inputs and produces a new video with mouth movements precisely aligned to the provided speech. It is built on audio-conditioned latent diffusion, meaning it operates directly in the latent space rather than relying on 3D meshes or 2D facial landmarks. The model supports multiple languages and accents, works with diverse speakers and recording conditions, and handles both real people and stylized characters such as anime. A key technical feature of LatentSync is Temporal REPresentation Alignment (TREPA), a technique designed to reduce flicker, jitter, and frame-to-frame artifacts, keeping head pose and lip motion stable across longer sequences. The model is recommended for use with high-resolution source videos — 720p, 1080p, or 4K — paired with clean audio recordings. It is well suited for workflows involving lip-syncing, audio dubbing, digital human creation, and video-audio alignment.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for LatentSync.
LatentSync is an end-to-end lip-synchronization model developed by ByteDance that takes a source talking-head video and a target audio track as inputs and produces a new video with mouth movements precisely aligned to the provided speech. It is built on audio-conditioned latent diffusion, meaning it operates directly in the latent space rather than relying on 3D meshes or 2D facial landmarks. The model supports multiple languages and accents, works with diverse speakers and recording conditions, and handles both real people and stylized characters such as anime.
A key technical feature of LatentSync is Temporal REPresentation Alignment (TREPA), a technique designed to reduce flicker, jitter, and frame-to-frame artifacts, keeping head pose and lip motion stable across longer sequences. The model is recommended for use with high-resolution source videos — 720p, 1080p, or 4K — paired with clean audio recordings. It is well suited for workflows involving lip-syncing, audio dubbing, digital human creation, and video-audio alignment.
Aligns mouth movements in a source video to a target audio track end-to-end, without relying on 3D meshes or 2D facial landmarks.
Accepts an audio URL as a conditioning input, supporting multiple languages and accents across diverse speakers and recording conditions.
Takes a source talking-head video URL as input and preserves the subject's identity, pose, background, and scene structure in the output.
Uses Temporal REPresentation Alignment to reduce flicker, jitter, and frame-to-frame artifacts, keeping motion stable over long sequences.
Supports source videos at 720p, 1080p, or 4K resolution, delivering sharp facial detail for both real people and stylized characters.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
Audio to be synchronized.
Video to be synchronized.
Parameters currently listed by OpenRouter or the local catalog for this model.
Official model cards, release notes, docs, and other references synced from the source page.
LatentSync discussions are most active in r/StableDiffusion, r/comfyui, r/latentsync. Top Reddit threads cluster around benchmark and model-comparison threads.
The strongest match in this snapshot has 840 upvotes and 277 comments.
Hi everyone,
I’ve been experimenting with lip sync models for a project where I need to sync lip movements in a video to a given audio file.
I’ve tried **Wav2Lip** and **LatentSync** — I found LatentSync to perform better, but the results are still far from accurate.
Does anyone have recommendations for other models I can try? Preferably **open source** with **fast runtimes**.
Thanks in advance!
LatentSync update to 1.5
Quality and Stability Improvement
workflow:
[https://github.com/ShmuelRonen/ComfyUI-LatentSyncWrapper/blob/main/workflow/latentsync1.5\_comfyui\_basic.json](https://github.com/ShmuelRonen/ComfyUI-LatentSyncWrapper/blob/main/workflow/latentsync1.5_comfyui_basic.json)
online run:
[https://www.comfyonline.app/explore/f6a36d51-ee68-429b-87db-a3314b8a2513](https://www.comfyonline.app/explore/f6a36d51-ee68-429b-87db-a3314b8a2513)
Workflow can be downloaded here:
https://filebin.net/f4boko99u9g99vay
This workflow allows you to generate English audio from European films/videos and lip-synced to the actor using Latentsync 1.5. The generated voice retains the accent and emotional expression from the source voice. For optimal results, use a voice file containing at least five seconds of speech. (This has only been tested with French, German, Italian and Spanish - not sure about other languages)
1. Make sure that the fps is same for all the nodes!
2. Connect the "Background sound" output to the "Stack Audio" node if you want to add the background/ambient sound back to the generated audio.
3. Enable the "Convolution Reverb" node if you want reverb in the generated audio. Read this page for more info: https://github.com/c0ffymachyne/ComfyUI_SignalProcessing
4. Try E2 model as well.
4. The audio generation is fast, it's Latentsync that is time consuming. An efficient method is to disconnect the audio output to Latentsync Sampler, then keep re-generate the audio until you get the result you want. After that, fixed the seed and reconnect the audio output to Latentsync.
4. Sometimes the generated voice sounds like low bitrate audio with a metallic sound - you need to upscale it to improve the quality. There's a few free online options (including Adobe) for AI audio upscaling. I am surprised that there are so many image upscaling models available for ComfyUI, but not even a single one for audio. Otherwise, I would have included it as the final post-processing step for this workflow. If you are proficient in digital audio software (DAW), you can also enhance the sound quality using specialized audio tools.
LatentSync requires two inputs: a video URL pointing to a source talking-head video and an audio URL pointing to the target audio track you want the subject to appear to speak.
The model documentation recommends using source videos at 720p, 1080p, or 4K resolution, paired with clean, dry audio recordings for optimal output quality.
Yes, LatentSync is described as multilingual and supports multiple languages and accents, adapting to diverse speakers and recording conditions.
The model has a context window of 50,000 tokens as listed in its metadata.
No. LatentSync operates via audio-conditioned latent diffusion and does not rely on 3D meshes or 2D facial landmarks to generate lip-synced output.
According to the model metadata, the training date is listed as August 2025.
Continue browsing adjacent models from the same provider.