LLM Model Directory

Explore frontier AI models by provider, pricing, and context

Browse the synced model catalog by provider, release, pricing, and core capabilities.

32 models 11 providers in view Current filter: All providers Type: Video
O

OpenAI

2 models

Video

Sora 2 Pro

Release date unavailable

Sora 2 Pro is the premium tier of OpenAI's second-generation video generation model, available to ChatGPT Pro subscribers. It generates videos up to 25 seconds in length at resolutions up to 1080p and frame rates between 24 and 60 fps, with synchronized dialogue, sound effects, and ambient audio produced alongside the video. The model also includes a Cameo feature that lets users inject a consistent character — based on an uploaded video of a person, pet, or object — into any generated scene. Sora 2 Pro is designed for filmmakers, content creators, marketers, and storytellers who require longer, higher-fidelity AI-generated video with professional-grade audio. The model handles complex, multi-part prompts and maintains character and scene continuity across multiple shots within a single clip. It models physical phenomena such as gravity, collisions, and fluid dynamics, and scored approximately 8.5 out of 10 in independent physics evaluations. The model's training data has a cutoff of September 2025.

Video
Context: 5,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Sora 2

Release date unavailable

Sora 2 is OpenAI's video generation model, announced on September 30, 2025. It generates videos up to 10 seconds long with 4K-like detail from text prompts, supporting visual styles including cinematic, photorealistic, and anime. The model integrates audio generation — dialogue, sound effects, and ambient sound — synchronized to on-screen action, including lip-synced character speech, which distinguishes it from its predecessor. Sora 2 is designed for content creators, filmmakers, marketers, and storytellers who want to produce video content from text descriptions. It supports multi-shot scene control, allowing users to maintain character and world continuity across multiple shots within a single video with control over camera angles, lighting, and transitions. A Cameo feature lets users upload a short video of themselves to inject their likeness and voice into generated scenes.

Video
Context: 5,000 Output: N/A
Input: N/A Output: N/A
View model →
G

Google

5 models

Video

Veo 3 Fast

Release date unavailable

Veo 3 Fast is a video generation model developed by Google, available through Google AI Studio under the identifier veo-3.0-fast-generate-001. It is the speed-optimized variant of the Veo 3 model family, designed to produce AI-generated video from text prompts with faster turnaround times than the standard Veo 3 model. It accepts text inputs of up to 480 tokens and outputs video content based on those descriptions. This model is well-suited for workflows where generation speed is a priority, such as rapid prototyping of video concepts, content pipeline automation, and exploratory creative work. Developers and content creators who need to iterate quickly or generate video at scale will find the fast variant a practical choice. It supports image URL inputs alongside text, allowing for additional visual context to guide generation.

Video
Context: 480 Output: N/A
Input: N/A Output: N/A
View model →
Video

Veo 3

Release date unavailable

Veo 3 (model ID: veo-3.0-generate-001) is Google's generally available video generation model, accessible through the Vertex AI platform. It is the stable, production-ready successor to the earlier veo-3.0-generate-preview endpoint, which Google has officially deprecated. The model accepts text prompts, image inputs, and configuration parameters to synthesize video content, and it supports a context window of up to 5,000 tokens. Veo 3 is part of Google's broader Veo 3.0 model family, which also includes a fast-generation variant. Google designated this release as the recommended migration target for teams that had been using the preview endpoint, signaling its readiness for large-scale, enterprise workloads. It is best suited for developers and organizations building production applications that require reliable, API-driven video generation through Google Cloud infrastructure.

Video
Context: 5,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Veo 3.1

Release date unavailable

Veo 3.1 is Google's generally available video generation model, released under the identifier veo-3.1-generate-001 and accessible through Google's Vertex AI platform. It is the production-ready successor to the veo-3.1-generate-preview endpoint, making it the recommended migration target for developers who built on the preview version. The model generates video content from text prompts and supports image-based inputs, enabling a range of media creation workflows. It is part of Google's broader Veo model family, which includes multiple generation variants. Veo 3.1 is designed for developers and businesses that need a reliable, supported API for AI video generation at scale. Its stable endpoint status means it carries production-grade support commitments, distinguishing it from preview or experimental releases. The model accepts text prompts, individual image URLs, and image arrays as inputs, and supports a seed parameter for reproducible outputs. It is well suited for applications such as marketing content pipelines, automated media production, and any workflow requiring consistent, repeatable video generation.

Video
Context: 5000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Veo 3.1 Fast

Release date unavailable

Veo 3.1 Fast is a video generation model developed by Google, part of the Veo 3.1 model family. It is optimized for speed, making it suitable for workflows that require rapid video output at scale. The model is available through both the Gemini API and Vertex AI, giving developers two integration paths for production use. The stable endpoint identifier is veo-3.1-fast-generate-001, which replaced an earlier preview endpoint. Veo 3.1 Fast accepts text prompts as well as image inputs, including single images and image arrays, allowing for both text-to-video and image-to-video generation workflows. It supports configuration options such as aspect ratio, duration, and resolution through toggle-based parameters. The model is best suited for developers and creators who need to generate video content quickly at scale and require a stable, production-grade API endpoint for integration into their pipelines.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Veo 2

Release date unavailable

Veo 2 is Google's production-ready video generation model, released in April 2025 via the Gemini API under the model ID veo-2.0-generate-001. It accepts both text prompts and reference images as input, generating high-definition video output at resolutions up to 4K. The model includes physics-aware rendering that handles fluid dynamics, lighting, and object interactions, and it embeds SynthID watermarking in all generated videos to identify AI-created content. Veo 2 is available through both the Gemini API and Google's Vertex AI platform, making it accessible to developers via standard API calls without specialized infrastructure. It supports cinematic prompt controls such as aerial shots, panning, and time-lapses, and maintains consistent character appearance across scenes. The model is suited for developers, marketers, creative professionals, and educators who need to generate video content programmatically for use cases like product demos, ad campaigns, and educational visualizations.

Video
Context: 5,000 Output: N/A
Input: N/A Output: N/A
View model →
X

X.ai

1 models

B

ByteDance

5 models

Video

DreamActor V2

Release date unavailable

DreamActor V2 is a video generation model developed by ByteDance that animates static images by transferring motion from a reference driving video onto a target character. It is the second generation of ByteDance's DreamActor series and was made available in February 2026. Rather than relying on skeleton extraction or pose estimation pipelines, it uses a spatiotemporal in-context learning framework that reads motion directly from raw video pixels, which allows it to handle character types that traditional pose-based methods struggle with, including animals, cartoon mascots, fantasy creatures, and 3D renders. DreamActor V2 accepts two inputs — a character image and a driving video — and produces animated video outputs up to 15 seconds at 720p resolution across a range of aspect ratios. It transfers facial expressions, head orientation, eye direction, lip movement, hand gestures, and full-body motion while maintaining the structural consistency of the source character across frames. This makes it applicable to use cases such as social media content creation, brand animation, virtual avatar production, game asset prototyping, and educational video generation.

Video
Context: 1000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Seedance 1.5 Pro

Release date unavailable

Seedance 1.5 Pro is an image-to-video generation model developed by ByteDance that transforms static images into cinematic video clips at up to 1080p resolution. It uses a dual-branch Diffusion-Transformer (DB-DiT) architecture to generate video and audio simultaneously in a single pass, producing millisecond-level lip-sync and environmental audio without requiring post-production editing. Videos can range from 5 to 10 seconds in duration and support aspect ratios including 16:9, 9:16, and 21:9. What distinguishes Seedance 1.5 Pro is its native audio-visual synthesis, which generates speech, sound effects, and ambient audio in sync with the video rather than layering them separately afterward. It supports multilingual lip-sync across six languages and offers over 15 controllable camera movements — such as dolly zooms, tracking shots, and orbits — specified through text prompts. The model is well-suited for content creators, marketers, and developers working on dialogue-driven content, social media clips, and multilingual voiceover projects where visual consistency and synchronized audio are required.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Seedance 2.0

Release date unavailable

Seedance 2.0 generates Hollywood-grade cinematic videos from text prompts with native audio-visual synchronization, director-level camera and lighting control, and exceptional motion stability. Built on Seed's unified multimodal architecture, it leads on instruction adherence, motion quality, and visual aesthetics.

Video
Context: 50K Output: 10,000 tokens
Input: N/A Output: N/A
View model →
Video

Seedance 2.0 Fast

Release date unavailable

Seedance 2.0 generates Hollywood-grade cinematic videos from text prompts with native audio-visual synchronization, director-level camera and lighting control, and exceptional motion stability. Built on Seed's unified multimodal architecture, it leads on instruction adherence, motion quality, and visual aesthetics.

Video
Context: 50K Output: 10,000 tokens
Input: N/A Output: N/A
View model →
Video

Seedance 2.0 Fast Turbo

Release date unavailable

Seedance 2.0 generates Hollywood-grade cinematic videos from text prompts with native audio-visual synchronization, director-level camera and lighting control, and exceptional motion stability. Built on Seed's unified multimodal architecture, it leads on instruction adherence, motion quality, and visual aesthetics.

Video
Context: 50K Output: 10,000 tokens
Input: N/A Output: N/A
View model →
K

Kling

7 models

Video

Kling 3.0

Release date unavailable

Kling 3.0 is a video generation model developed by Kling, released with a training date of February 2026. It supports both text-to-video and image-to-video workflows, accepting text prompts, image URLs, and multiple configuration options as inputs. The model is identified by the ID kling-video-v3.0-std and is available on MindStudio as part of the Kling model family. Kling 3.0 is suited for creators and developers who need to generate video content from written descriptions or existing images. Its dual input support makes it flexible for use cases ranging from concept visualization to animating static imagery. The model accepts a context window of up to 10,000 tokens, giving users room to provide detailed prompts and configuration parameters.

Video
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling 2.6

Release date unavailable

Kling 2.6 is a video generation model developed by Kling, capable of producing videos from text prompts or input images. It supports both text-to-video and image-to-video workflows, accepting text descriptions, image URLs, and selection-based inputs to guide the generation process. The model was added to MindStudio in March 2026 and carries a training date of December 2025. Kling 2.6 is suited for creators and developers who need to generate video content programmatically without managing their own infrastructure. Its dual input modality — text and image — makes it applicable to a range of use cases including content creation, storyboarding, and visual prototyping. The model operates under the identifier kling-video-v2.6-std and is accessible through MindStudio without requiring separate API key configuration.

Video
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling 2.6 Pro Motion Control

Release date unavailable

Kling V2.6 Pro Motion Control is an AI video generation model developed by Kuaishou Technology that animates static character images by extracting and transferring motion from real reference video clips. Rather than generating movement from text descriptions alone, it uses a 3D face and body reconstruction system built on deep learning-based 3D modeling to map human faces and body movements from 2D inputs, then applies those motion paths frame-by-frame to a subject image. The model runs on a Diffusion Transformer Architecture and produces output at 30 frames per second with coherent motion transitions throughout the generated clip. The model accepts reference videos between 3 and 30 seconds in length and supports a wide range of movement types, including dance routines, martial arts, walking cycles, and subtle gestures. It preserves the subject's appearance consistently across all frames without identity drift, and it supports optional text prompts to adjust scene styling, lighting, and atmosphere while keeping the motion transfer intact. Kling V2.6 Pro Motion Control is well suited for social media character animation, brand mascot animation, film pre-production prototyping, digital human content creation, and educational demonstrations.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling 3.0 Motion Control

Release date unavailable

Kling 3.0 Motion Control is a video generation model developed by Kling that specializes in motion transfer. It takes a reference video and a source still image as inputs, then animates the still image by applying the motion patterns extracted from the reference video. This makes it distinct from standard text-to-video or image-to-video models, as the motion itself is explicitly guided by an existing video clip rather than inferred from a prompt alone. The model is well-suited for workflows where consistent, repeatable motion is required across different subjects or scenes — for example, applying a specific walking cycle, gesture, or camera movement to a new character or background image. It accepts image URLs, video URLs, text, and configuration inputs, giving users control over how the motion transfer is applied. With a context window of 1000 tokens, it is designed for focused, single-generation tasks rather than extended multi-turn interactions.

Video
Context: 1000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling 3.0 Pro

Release date unavailable

Kling 3.0 Pro is a video generation model developed by Kling, designed to produce video content from both text prompts and image inputs. It represents the 3.0 Pro tier of Kling's video model lineup, with a training cutoff of February 2026 and availability on MindStudio starting March 2026. The model accepts text descriptions, image URLs, and configurable selection parameters to control output characteristics. Kling 3.0 Pro is suited for workflows that require generating video from written descriptions or existing images, making it applicable to content creation, prototyping, and visual storytelling tasks. Its support for both text-to-video and image-to-video modalities gives it flexibility across different starting points for video production. The model operates with a context window of 10,000 tokens, accommodating detailed prompts for more precise video generation.

Video
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling O1

Release date unavailable

Kling Video O1 is an AI video generation model developed by Kuaishou Technology, built on a Multimodal Visual Language (MVL) framework that accepts text, images, and video as inputs within a single unified system. The model supports three distinct operating modes — Reference Images, Reference Video, and Video Editing — allowing creators to animate static visuals, generate or extend footage from a reference video, or modify specific elements within an existing clip while leaving the rest of the scene intact. A defining feature of Kling Video O1 is its Elements system, which lets users upload up to four images of a character or object from different angles to give the model a near-3D understanding of the subject. This enables consistent identity preservation across multiple shots and dynamic camera movements, addressing a common challenge in AI video generation. The model is well suited for use cases in film production, advertising, and social media content creation where reference-driven control and shot-to-shot consistency are required.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling O3

Release date unavailable

Kling Video O3, also known as Kling 3.0 Omni, is a video generation model developed by Kuaishou and launched in February 2026. It is the premium tier of the Kling 3.0 model family, designed specifically for structured, multi-shot storytelling rather than single isolated clips. The model accepts text, images, and video as inputs, and uses Multimodal Visual Language (MVL) technology to reason about scene composition, spatial relationships, and motion in a unified pass. It supports clip lengths of up to 15 seconds across up to six distinct shots generated in a single request. Kling Video O3 is built for workflows where visual consistency is critical — such as brand marketing, recurring character content, and cinematic pre-production. It preserves a subject's exact appearance, including facial features, clothing, logos, and on-screen text, across shots and scene transitions when a reference image or video is provided. The model also generates synchronized audio natively alongside video, covering ambient sound, dialogue, and multilingual lip-sync without requiring separate post-production. It is best suited for production scenarios where a character, product, or campaign identity has already been defined and consistent output at scale is the goal.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
W

Wan

4 models

Video

Wan 2.6

Release date unavailable

Wan 2.6 is a video generation model developed by Alibaba that produces 1080p video at 24 frames per second for clips up to 15 seconds in length. It accepts text, image, or video as input and generates complete video output — including synchronized audio, dialogue, sound effects, and lip movements — in a single generation pass, without requiring a separate audio pipeline. The model was trained with a cutoff of December 2025 and is available as an open-source release. Wan 2.6 is designed for creators, marketers, and developers who need publish-ready video content without extensive post-production work. Its distinguishing features include multi-shot narrative handling across a single clip, character consistency when using reference figures, physics simulation for realistic motion, and style transfer from reference videos. These capabilities make it suited for use cases such as social media content, product demonstrations, commercials, and short narrative sequences.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Wan 2.2

Mar 26, 2025

Wan 2.2 is a multimodal video generation model developed by Alibaba's Tongyi Laboratory and released in July 2025 under the Apache 2.0 license. It is the first video diffusion model to apply a Mixture-of-Experts (MoE) architecture, which splits processing between high-noise expert networks that handle overall layout and composition and low-noise expert networks that refine fine details. The model supports both text-to-video and image-to-video generation, with native bilingual prompting in English and Chinese. It is available in a 5B parameter variant suited for consumer hardware and a 14B parameter variant for higher-quality output. Wan 2.2 was trained on a dataset expanded significantly from its predecessor, with image data increasing by 65.6% and video data by 83.2%. It includes a dedicated aesthetic fine-tuning stage informed by film industry standards, further refined through reinforcement learning to align with human visual preferences. Specialized modules — Wan-Animate and Wan-Move — allow users to animate a character from a single image or transfer motion from one video to another subject. The model is natively supported by ComfyUI and accepts LoRA adapters and source images as inputs alongside text prompts.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Wan 2.5

Release date unavailable

Wan 2.5 is an open-source AI video generation model developed by Alibaba's DAMO Academy. It generates videos up to 10 seconds long at resolutions ranging from 480p to 1080p HD, with native 4K available in preview, all rendered at 24 frames per second. The model's defining characteristic is its ability to generate audio and video simultaneously in a single step — producing character dialogue with lip-sync, environmental ambient sounds, and background music directly from a text or image prompt, without requiring separate post-production audio work. It supports multiple input modes including text-to-video, image-to-video, audio-to-video, and video-to-video refinement. Wan 2.5 is designed for content creators, filmmakers, advertisers, and developers who need production-ready video with synchronized audio. It supports cinematic camera controls such as dolly, tracking, and crane movements, as well as lighting styles, depth of field, and particle effects like rain and fire. The model handles photorealistic, anime, illustrated, and stylized visual aesthetics, and processes prompts in at least 8 languages with matching audio generation. Its open-source nature makes it accessible for local deployment and integration into custom pipelines.

Video
Context: 2,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Wan 2.7

Release date unavailable

Seedance 2.0 generates Hollywood-grade cinematic videos from text prompts with native audio-visual synchronization, director-level camera and lighting control, and exceptional motion stability. Built on Seed's unified multimodal architecture, it leads on instruction adherence, motion quality, and visual aesthetics.

Video
Context: 50K Output: 10,000 tokens
Input: N/A Output: N/A
View model →
L

Luma Labs

2 models

Video

Ray 2

Release date unavailable

Ray 2 is a large-scale AI video generation model developed by Luma Labs, released in January 2025. It runs on approximately 10 times the compute of its predecessor, Ray 1.6, and is built on a multi-modal architecture trained directly on video sequences rather than individual frames. This training approach gives the model an understanding of natural motion, lighting behavior, and physical object interactions. Ray 2 accepts text prompts, images, or video as input and generates clips ranging from 5 to 9 seconds, extendable up to 30 seconds, at resolutions up to 1080p with optional 4K upscaling. Ray 2 supports multiple aspect ratios including 16:9, 9:16, 1:1, and 21:9, and includes keyframe control so users can define start frames, end frames, or both for precise scene direction. A speed-optimized variant called Ray 2 Flash delivers comparable visual quality in roughly one-third the render time, making it suitable for rapid iteration. The model is available through Luma AI's own platform and via Amazon Bedrock, where AWS serves as the exclusive cloud provider for fully managed access. It is used across industries including advertising, entertainment, architecture, fashion, film, and music production.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Ray Flash 2

Release date unavailable

Ray Flash 2 is a video generation model developed by Luma AI, designed to produce AI-generated videos from text prompts and images. It is the speed-optimized variant within the Ray 2 model family, prioritizing throughput and fast iteration while maintaining visual quality. As part of the broader Ray lineup, it sits alongside Ray 2, Ray 1.6, and Photo Flash 1, giving users options across different speed and quality trade-offs. Ray Flash 2 is well-suited for workflows that require rapid video generation or high-volume production, such as creative prototyping or iterative content development. It is accessible through multiple platforms, including ComfyUI's Partner Nodes system for visual AI workflows and the Codenteam Intersect platform for developer integrations. The model accepts both text and image inputs, supporting flexible generation modes for a range of creative and technical use cases.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
L

Lightricks

3 models

Video

LTX-2 19b

Release date unavailable

LTX-2 19B is an open-source video generation model developed by Lightricks and released on January 6, 2026. It uses an asymmetric dual-stream Diffusion Transformer architecture to generate video and synchronized audio together in a single unified process, rather than producing silent video and adding audio as a separate step. The model accepts text prompts, reference images, or existing video clips as input and outputs native 4K video with flexible frame-rate control and support for extended clip durations. What distinguishes LTX-2 19B is its simultaneous audiovisual output, where ambient sound, environmental effects, and speech synchronization are generated alongside the video frames. The model supports LoRA fine-tuning for camera motion control and custom stylization, and offers NVFP4 and FP8 quantization formats that reduce VRAM usage by up to 60% and accelerate generation up to 3x. A distilled 8-step fast generation mode runs 5–6 times faster than the full model, and on an RTX 4090 with NVFP4 quantization an 8-second 720p clip can be produced in approximately 25 seconds. It is well suited for film-style storytelling, advertising production, and any workflow requiring tight audiovisual coherence.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

LTX-2.3

Release date unavailable

LTX-2.3 is a multimodal video generation model developed by Lightricks and released in March 2026. Built on a Diffusion Transformer architecture with 22 billion parameters, it generates synchronized audio and video in a single forward pass at resolutions up to 4K at 50 frames per second, for clips up to 20 seconds long. It is available as open-source software with open weights under a permissive license, and can be run locally, accessed via API, or deployed on-premises. The model introduces several architectural updates over its predecessor, including a rebuilt variational autoencoder for sharper texture and edge detail, a gated attention text connector for improved prompt adherence, and an upgraded vocoder trained on filtered audio data for cleaner output. It supports native portrait-mode output at 1080×1920 and ships in four checkpoint variants — dev, distilled, fast, and pro — with the distilled variant completing generation in as few as 8 denoising steps. LTX-2.3 is aimed at independent creators, small studios, and developers who need a production-ready open-source foundation for video creation without licensing fees.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

LTX-2.3 LoRA

Release date unavailable

LTX-2.3 LoRA is a Low-Rank Adaptation fine-tuning system built on top of Lightricks' LTX-2.3 video generation model, released in January 2026. Rather than retraining the full model, LoRA adapters allow users to teach the base model new characters, visual styles, or motion behaviors at a fraction of the computational cost. The system supports both text-to-video and image-to-video generation workflows, and LoRAs trained on the earlier LTX-2.0 model are reported to retain compatibility with the 2.3 update. LTX-2.3 LoRA is designed for creators and developers who need stylistically consistent output across AI-generated video sequences, such as animation, storytelling, or visual effects production. It supports multi-character generation with consistent appearance across frames, style transfer, and community-developed camera movement controls including dolly in and out. The model runs locally using open-source tooling and has gained traction in the Stable Diffusion community for its character and style fidelity in generated video content.

Video
Context: 1000 Output: N/A
Input: N/A Output: N/A
View model →
M

MiniMax

1 models

P

PixVerse

1 models

R

Runway

1 models