Lip Sync Animation
Generates frame-accurate lip movements synchronized to an audio input URL, aligning phoneme timing with spoken content.
OmniHuman 1.5 is an avatar animation model developed by ByteDance that converts still images into fully animated digital humans using audio input. It generates synchronized lip movements, facial expressions, and body language by combining audio signals with semantic understanding from Multimodal Large Language Models. The model is built on a dual-system cognitive architecture inspired by System 1 and System 2 theory, enabling both fast reactive animations and deliberate, context-aware responses. It supports a context window of 50,000 tokens and was trained through September 2025. The model works across a wide range of visual styles, including realistic photographs, anime characters, illustrated portraits, and stylized artwork, as well as non-human subjects like animals and anthropomorphic figures. It can produce videos exceeding one minute in length with dynamic motion, camera movement, and multi-character interactions. OmniHuman 1.5 is suited for use cases such as virtual persona creation, NPC animation in games, AI spokesperson production, virtual instructor development, and video content creation without large production teams. It accepts image URLs and audio URLs as inputs.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for Omni Human 1.5.
OmniHuman 1.5 is an avatar animation model developed by ByteDance that converts still images into fully animated digital humans using audio input. It generates synchronized lip movements, facial expressions, and body language by combining audio signals with semantic understanding from Multimodal Large Language Models. The model is built on a dual-system cognitive architecture inspired by System 1 and System 2 theory, enabling both fast reactive animations and deliberate, context-aware responses. It supports a context window of 50,000 tokens and was trained through September 2025.
The model works across a wide range of visual styles, including realistic photographs, anime characters, illustrated portraits, and stylized artwork, as well as non-human subjects like animals and anthropomorphic figures. It can produce videos exceeding one minute in length with dynamic motion, camera movement, and multi-character interactions. OmniHuman 1.5 is suited for use cases such as virtual persona creation, NPC animation in games, AI spokesperson production, virtual instructor development, and video content creation without large production teams. It accepts image URLs and audio URLs as inputs.
Generates frame-accurate lip movements synchronized to an audio input URL, aligning phoneme timing with spoken content.
Produces micro-expressions and eye movements that reflect the emotional and semantic content of the speech, derived from Multimodal LLM understanding.
Animates a static image URL into a video, supporting realistic photos, anime, illustrated portraits, and stylized artwork as input.
Generates videos longer than one minute with dynamic motion, camera movement, and support for multi-character interactions.
Handles humans, animals, anthropomorphic figures, and cartoon characters, making it usable across diverse visual styles and subject types.
Uses a System 1 and System 2 inspired architecture to simulate both fast intuitive reactions and deliberate, context-aware body language responses.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
Image to be lip synced.
Audio to be lip synced.
Parameters currently listed by OpenRouter or the local catalog for this model.
Official model cards, release notes, docs, and other references synced from the source page.
Omni Human 1.5 discussions are most active in r/Freepik_AI. The strongest match in this snapshot has 1 upvotes and 3 comments.
I have a complaint about Freepik’s mobile app, and I’m wondering if anyone has info or if Freepik has shared anything official.
On desktop, models like Kling 3.0 Motion Control and Omni Human 1.5 are great—they let us upload videos (and audio for Omni Human) for Motion Control or Lip Sync tasks. But on the mobile app, we can’t upload videos for these models at all. It’s limiting when working on the go.
Has Freepik said when they’ll add video/audio upload support for these models on the Android app? Any timeline or roadmap? Would love to hear if anyone knows more!
Where do i set the length for Omni Human 1.5 Outputs? My audio is 15 seconds long but it only generates 10 seconds, no matter what. pls fix!
OmniHuman 1.5 accepts two input types: an image URL (the source portrait or character image) and an audio URL (the speech or sound that drives the animation).
OmniHuman 1.5 has a context window of 50,000 tokens.
The model supports realistic photographs, anime characters, illustrated portraits, stylized artwork, animals, anthropomorphic figures, and cartoons — not just human faces.
OmniHuman 1.5 can produce videos over one minute in length, with dynamic motion, camera movement, and multi-character interactions.
OmniHuman 1.5 was developed by ByteDance with a training date of September 2025.
Continue browsing adjacent models from the same provider.