AI Models Directory

B

ByteDance

2 models

›

LatentSync

Release date unavailable

LatentSync is an end-to-end lip-synchronization model developed by ByteDance that takes a source talking-head video and a target audio track as inputs and produces a new video with mouth movements precisely aligned to the provided speech. It is built on audio-conditioned latent diffusion, meaning it operates directly in the latent space rather than relying on 3D meshes or 2D facial landmarks. The model supports multiple languages and accents, works with diverse speakers and recording conditions, and handles both real people and stylized characters such as anime. A key technical feature of LatentSync is Temporal REPresentation Alignment (TREPA), a technique designed to reduce flicker, jitter, and frame-to-frame artifacts, keeping head pose and lip motion stable across longer sequences. The model is recommended for use with high-resolution source videos — 720p, 1080p, or 4K — paired with clean audio recordings. It is well suited for workflows involving lip-syncing, audio dubbing, digital human creation, and video-audio alignment.

Context: 50,000 Output: N/A

Input: N/A Output: N/A

View model →

›

Lip Sync

Omni Human 1.5

Release date unavailable

OmniHuman 1.5 is an avatar animation model developed by ByteDance that converts still images into fully animated digital humans using audio input. It generates synchronized lip movements, facial expressions, and body language by combining audio signals with semantic understanding from Multimodal Large Language Models. The model is built on a dual-system cognitive architecture inspired by System 1 and System 2 theory, enabling both fast reactive animations and deliberate, context-aware responses. It supports a context window of 50,000 tokens and was trained through September 2025. The model works across a wide range of visual styles, including realistic photographs, anime characters, illustrated portraits, and stylized artwork, as well as non-human subjects like animals and anthropomorphic figures. It can produce videos exceeding one minute in length with dynamic motion, camera movement, and multi-character interactions. OmniHuman 1.5 is suited for use cases such as virtual persona creation, NPC animation in games, AI spokesperson production, virtual instructor development, and video content creation without large production teams. It accepts image URLs and audio URLs as inputs.

Context: 50,000 Output: N/A

Input: N/A Output: N/A

View model →

K

Kling

1 models

›

Lip Sync

AI Avatar Standard

Release date unavailable

Kling AI Avatar Standard is an audio-driven talking-head model developed by Kling that animates a single still portrait image into a synchronized speaking video. It accepts a portrait photo and an audio track as inputs, then generates a video with phoneme-aligned lip movements, natural eye blinks, and subtle head motion while preserving the subject's identity throughout. The model supports both real voice recordings and text-to-speech generated audio, and an optional text prompt can influence background style or framing. Output duration is variable and determined by the length of the provided audio, up to a maximum of 10 minutes. Kling AI Avatar Standard is designed for everyday production workflows where reliable, clean avatar video is needed at scale. Typical use cases include explainer videos, customer support avatars, internal training materials, and product demonstrations. For best results, the model expects a clear, front-facing portrait with even lighting and at least 512px resolution, paired with a clean voice recording sampled at 16–48 kHz. It is available via API through WaveSpeed and is accessible on MindStudio without requiring separate API key management.

Context: 50,000 Output: N/A

Input: N/A Output: N/A

View model →

H

HeyGen

1 models

›

Lip Sync

HeyGen Video Translate

Release date unavailable

HeyGen Video Translate is an AI-powered video localization tool developed by HeyGen that automatically translates spoken content in videos and synchronizes the speaker's lip movements to match the translated audio. The tool accepts a video input and a target language selection, then produces a dubbed version of the video without requiring any technical expertise from the user. It supports a broad range of languages, making it accessible to content creators, marketers, and educators who need to reach international audiences. What distinguishes HeyGen Video Translate from basic subtitle or dubbing tools is its lip-sync technology, which adjusts visible mouth movements to align with the translated audio track, producing a more natural viewing experience. The model has been in continuous development since September 2023 and is designed for non-technical users through a straightforward interface. It is best suited for localizing marketing videos, educational content, corporate communications, and social media clips into multiple languages.

Context: 50,000 Output: N/A

Input: N/A Output: N/A

View model →

M

MeiGen

1 models

›

Lip Sync

Infinitetalk

Release date unavailable

InfiniteTalk is an audio-driven avatar generation model developed by MeiGen-AI and hosted on WaveSpeedAI. It takes a single portrait photo or silent video paired with an audio track and produces an animated talking or singing video with synchronized lip movements, head poses, facial expressions, and body posture. Built on the Wan 2.1 video diffusion foundation, it uses a sparse-frame processing approach and a rolling 81-frame context window to maintain visual consistency across extended sequences. The model supports output videos up to 10 minutes long and offers both 480p and 720p resolution options. InfiniteTalk is designed for content creators, marketers, educators, and developers who need to produce realistic talking-head videos at scale. It supports any language for lip synchronization and includes a two-person dialogue mode for animating back-and-forth conversations between two speakers. Common use cases include multilingual dubbing and localization, corporate training videos, virtual presenters, podcast visualization, and music video production. Its extended duration support makes it particularly suited for long-form educational content and digital human applications.

Context: 50,000 Output: N/A

Input: N/A Output: N/A

View model →

Explore frontier AI models by provider, pricing, and context

ByteDance

LatentSync

Omni Human 1.5

Kling

AI Avatar Standard

HeyGen

HeyGen Video Translate

MeiGen

Infinitetalk