Unified AV Generation
Generates video and scene-aware audio simultaneously in one pass using a dual-stream Diffusion Transformer, eliminating the sync issues common in separate audio-video pipelines.
LTX-2 19B is an open-source video generation model developed by Lightricks and released on January 6, 2026. It uses an asymmetric dual-stream Diffusion Transformer architecture to generate video and synchronized audio together in a single unified process, rather than producing silent video and adding audio as a separate step. The model accepts text prompts, reference images, or existing video clips as input and outputs native 4K video with flexible frame-rate control and support for extended clip durations. What distinguishes LTX-2 19B is its simultaneous audiovisual output, where ambient sound, environmental effects, and speech synchronization are generated alongside the video frames. The model supports LoRA fine-tuning for camera motion control and custom stylization, and offers NVFP4 and FP8 quantization formats that reduce VRAM usage by up to 60% and accelerate generation up to 3x. A distilled 8-step fast generation mode runs 5–6 times faster than the full model, and on an RTX 4090 with NVFP4 quantization an 8-second 720p clip can be produced in approximately 25 seconds. It is well suited for film-style storytelling, advertising production, and any workflow requiring tight audiovisual coherence.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for LTX-2 19b.
LTX-2 19B is an open-source video generation model developed by Lightricks and released on January 6, 2026. It uses an asymmetric dual-stream Diffusion Transformer architecture to generate video and synchronized audio together in a single unified process, rather than producing silent video and adding audio as a separate step. The model accepts text prompts, reference images, or existing video clips as input and outputs native 4K video with flexible frame-rate control and support for extended clip durations.
What distinguishes LTX-2 19B is its simultaneous audiovisual output, where ambient sound, environmental effects, and speech synchronization are generated alongside the video frames. The model supports LoRA fine-tuning for camera motion control and custom stylization, and offers NVFP4 and FP8 quantization formats that reduce VRAM usage by up to 60% and accelerate generation up to 3x. A distilled 8-step fast generation mode runs 5–6 times faster than the full model, and on an RTX 4090 with NVFP4 quantization an 8-second 720p clip can be produced in approximately 25 seconds. It is well suited for film-style storytelling, advertising production, and any workflow requiring tight audiovisual coherence.
Generates video and scene-aware audio simultaneously in one pass using a dual-stream Diffusion Transformer, eliminating the sync issues common in separate audio-video pipelines.
Produces video at native 4K resolution with flexible frame-rate control and support for extended clip durations beyond standard short-form outputs.
Accepts a reference image URL as input and animates it into a video clip, preserving visual content from the source image across generated frames.
Supports Low-Rank Adaptation (LoRA) modules for precise camera motion control, enabling film-style cinematography directions such as pans, zooms, and tracking shots.
Supports NVFP4 and FP8 quantization formats that reduce VRAM usage by up to 60% and accelerate generation up to 3x compared to full-precision inference.
Offers an 8-step distilled generation mode that runs 5–6x faster than the full model, producing an 8-second 720p clip in approximately 25 seconds on an RTX 4090 with NVFP4.
Generates video directly from text prompts, translating scene descriptions into temporally stable video clips with synchronized audio.
Accepts a manual seed value as input, allowing reproducible generation runs and controlled variation across outputs.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
Up to 3 LoRAs.
A specific value that is used to guide the 'randomness' of the generation.
Parameters currently listed by OpenRouter or the local catalog for this model.
Official model cards, release notes, docs, and other references synced from the source page.
LTX-2 19b discussions are most active in r/StableDiffusion, r/comfyui, r/LocalLLaMA. Top Reddit threads cluster around benchmark and model-comparison threads.
The strongest match in this snapshot has 984 upvotes and 216 comments.
[https://civitai.com/models/2304098](https://civitai.com/models/2304098)
The examples shown in the preview video are a mix of 1280x720 and 848x480, with a few 640x640 thrown in. I really just wanted to showcase what the model can do and the fact it can run well. Feel free to mess with some of the settings to get what you want. Most of the nodes that you need to mess with if you want to tweak are still open. The ones that are all closed and grouped up can be ignored unless you want to modify more. For most people just set it and forget it!
These are two workflows that I've been using for my setup.
I have 12GB VRAM and 48GB system ram and I can run these easily.
The T2V is set for the 1280x720 and usually I get a 5s video in a little under 5 minutes. You can absolutely lessen that. I was making videos in 848x480 in about 2 minutes. So, it can FLY!
This does not use any fancy nodes (one node from Kijai KJNodes pack to load audio VAE and of course the GGUF node to load the GGUF model), no special optimization. It's just a standard workflow so you don't need anything like Sage, Flash Attention, that one thing that goes "PING!"... not needed.
I2V is set for a resolution of 640x640 but I have left a note in the spot where you can define your own resolution. I would stick in the 480-640 range (adjust for widescreen etc) the higher the res the better. You CAN absolutely do 1280x720 videos in I2V as well but they will take FOREVER. Talking like 3-5 minutes on the upscale PER ITERATION!! But, the results are much much better!
Links to the models used are right next to the models section, notes on what you need also there.
This is the native comfy workflow that has been altered to include the GGUF, separated VAE, clip connector, and a few other things. Should be just plug and play. Load in the workflow, download and set your models, test.
I have left a nice little prompt to use for T2V, I2V I'll include the prompt and provide the image used.
Drop a note if this helps anyone out there. I just want everyone to enjoy this new model because it is a lot of fun. It's not perfect but it is a meme factory for sure.
If I missed anything, you have any questions, comments, anything at all just drop a line and I'll do my best to respond and hopefully if you have a question I have an answer!
I’m comparing LTX-2 outputs with the same setup and found something interesting.
Setup:
* LTX-2 IC-LoRA (Pose) I2V
* Sampler: Euler Simple
* Steps: 8
* (+ refine 3 steps)
Models tested:
1. `ltx-2-19b-distilled-fp8`
2. `ltx-2-19b-dev-fp8.safetensors` \+ `ltx-2-19b-distilled-lora-384` (strength **1.0**)
3. `ltx-2-19b-dev-fp8.safetensors` \+ `ltx-2-19b-distilled-lora-384` (strength **0.6**)
workflow + other results:
* [https://scrapbox.io/work4ai/ltx-2-19b-distilled\_vs\_ltx-2-19b-distilled-lora](https://scrapbox.io/work4ai/ltx-2-19b-distilled_vs_ltx-2-19b-distilled-lora)
As you can see, `ltx-2-19b-distilled` and the dev model with `ltx-2-19b-distilled-lora` at **strength 1.0** end up producing almost the same result in my tests. That consistency is nice, but both also tend to share the same downside: the output often looks “overcooked” in an AI-ish way (plastic skin, burn-out / blown highlights, etc.).
With the recommended **LoRA strength 0.6**, the result looks a lot more natural and the harsh artifacts are noticeably reduced.
I started looking into this because the distilled LoRA is huge (\~7.67GB), so I wanted to replace it with the distilled checkpoint to save space. But for my setup, the distilled checkpoint basically behaves like “LoRA = 1.0”, and I can’t get the nicer look I’m getting at 0.6 even after trying a few sampling tweaks.
If you’re seeing similar plastic/burn-out artifacts with `ltx-2-19b-distilled(-fp8)`, I’d suggest using the LoRA instead — at least with the LoRA you can adjust the strength.
New version of Workflow (v2):
[https://github.com/RageCat73/RCWorkflows/blob/main/011426-LTX2-AudioSync-i2v-Ver2.json](https://github.com/RageCat73/RCWorkflows/blob/main/011426-LTX2-AudioSync-i2v-Ver2.json)
This is a follow-up to my previous post - please read it for more information and context:
[https://www.reddit.com/r/StableDiffusion/comments/1qcc81m/ltx2\_audio\_synced\_to\_added\_mp3\_i2v\_6\_examples\_3/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/StableDiffusion/comments/1qcc81m/ltx2_audio_synced_to_added_mp3_i2v_6_examples_3/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)
Thanks to user u/foxdit for pointing out that the strength of the LTX Distill Lora 384 can greatly affect the quality of realistic people. This new workflow sets it to .6
Credit MUST go to Kijai for introducing the first workflows that have the Mel-Band model that makes this possible. I hear he doesn't have much time to devote to refining workflows so it's up to the community to take what he gives us and build on them.
There is also an optional detail lora in the upscale group/node. It's disabled in my new workflow by default to save memory, but setting it to .3 is another recommendation. You can see the results for yourself in the video.
Bear in mind the video is going to get compressed by Reddit's servers but you'll still be able to see a significant difference. If you want to see the original 110 mb video, let me know and I'll send a Google drive link to it. I'd rather not open up my Google drive to everyone publicly.
The new workflow is also friendlier to beginners, it has better notes and literally has areas and nodes labelled Steps 1-7. It moves the Load Audio node closer to the Load image and trim audio nodes as well. Overall, it's minor improvements. If you already have the other one, it may not be worth it unless you're curious.
The new workflow has ALL the download links to the models and LORAs, but I'll also paste them below. I'll try to answer questions if I can, but there may be a delay of a day or 2 depending on your timezone and my free time.
Based on this new testing, I really can't recommend the distilled only model (8step model) because the distilled workflows don't have any way to alter the strength of the LORA that is baked inherently into the model. Some people may be limited to that model due to hardware constraints.
**IMPORTANT NOTE ABOUT PROMPT (updated 1/16/26): FOR BEST RESULTS, add the lyrics of the song or a transcript of the words being spoken in the prompt. In further experiments, this helps a lot.**
**The woman sings the words: "My Tea's gone cold I'm wondering why got out of bed at all..." will help to trigger the lip sync. Sometimes you only need the first few words of the lyric, but it may be best to include as many of the words as possible for a good lip sync. Also add emotions and expressions to the prompt as well or go with: the woman sings with passion and emotion if you want to be generic.**
**IMPORTANT NOTE ABOUT RESOLUTION: My workflow is set to 480x832 (portrait) as a STARTING resolution. Change that to what you think your system can handle. You MUST change that to 832x480 if you do a widescreen image or higher otherwise, you're going to get a VERY small video. Look at the Preview node for what the final resolution of the image will be. Remember, it must be divisible by 32, but the resize node in Step 2 handles that. Please read the notes in the workflow if you're a beginner.**
\*\*\*\*\* If you notice the lipsync is kinda wonky in this video, it's because I slapped the video together in a rush. I only noticed after I rendered it in Resolve and by then I was rushed to do something else so I didn't bother to go back and fix it. Since I only cared about showing the quality and I've already posted, I'm not going to go back and fix it even though it bothers my OCD a little.
Some other stats. I'm very fortunate to have a 4090 (24 gb VRAM) and 64 gb of system RAM (purchased over a year ago) before the price craziness. a 768 x 1088 video 20 seconds (481 frames - 24fps) takes 6-10 minutes depending on the Loras I set, 25 steps using Euler. Your mileage will vary.
\*\*\*update to post: I'm using a VERY simple prompt. My goal wasn't to test prompt adherence but to mess with quality and lipsync. Here is the embarrassingly short prompt that I sometimes vary with 1-2 words about expressions or eye contact. This is driving nearly ALL of my singing videos:
**"A video of a woman singing. She sings with subtle and fluid movements and a happy expression. She sings with emotion and passion. static camera."**
Crazy, right?
Models and Lora List
\*checkpoints\*\*
\- \[ltx-2-19b-dev-fp8.safetensors\]
[https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-dev-fp8.safetensors](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-dev-fp8.safetensors)
\*\*text\_encoders - Quantized Gemma
\- \[gemma\_3\_12B\_it\_fp8\_e4m3fn.safetensors\]
[https://huggingface.co/GitMylo/LTX-2-comfy\_gemma\_fp8\_e4m3fn/resolve/main/gemma\_3\_12B\_it\_fp8\_e4m3fn.safetensors?download=true](https://huggingface.co/GitMylo/LTX-2-comfy_gemma_fp8_e4m3fn/resolve/main/gemma_3_12B_it_fp8_e4m3fn.safetensors?download=true)
\*\*loras\*\*
\- \[LTX-2-19b-LoRA-Camera-Control-Static\]
[https://huggingface.co/Lightricks/LTX-2-19b-LoRA-Camera-Control-Static/resolve/main/ltx-2-19b-lora-camera-control-static.safetensors?download=true](https://huggingface.co/Lightricks/LTX-2-19b-LoRA-Camera-Control-Static/resolve/main/ltx-2-19b-lora-camera-control-static.safetensors?download=true)
\- \[ltx-2-19b-distilled-lora-384.safetensors\]
[https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-distilled-lora-384.safetensors?download=true](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-distilled-lora-384.safetensors?download=true)
\*\*latent\_upscale\_models\*\*
\- \[ltx-2-spatial-upscaler-x2-1.0.safetensors\]
[https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-spatial-upscaler-x2-1.0.safetensors](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-spatial-upscaler-x2-1.0.safetensors)
Mel-Band RoFormer Model - For Audio
\- \[MelBandRoformer\_fp32.safetensors\]
[https://huggingface.co/Kijai/MelBandRoFormer\_comfy/resolve/main/MelBandRoformer\_fp32.safetensors?download=true](https://huggingface.co/Kijai/MelBandRoFormer_comfy/resolve/main/MelBandRoformer_fp32.safetensors?download=true)
I've been running LTX-2 (the 19B distilled model) on an NVIDIA Jetson AGX Thor and built an open-source pipeline around it. Generating 1080p video (1920x1088) at 24fps with audio, camera control LoRAs, and batch rendering. Figured I'd share since there's almost nothing out there about running big video models on Jetson.
\*\*GitHub: [github.com/divhanthelion/ltx2](http://github.com/divhanthelion/ltx2)
\## What it generates
https://reddit.com/link/1r03u80/video/ep0gbzpsxgig1/player
1920x1088, 161 frames (\~6.7s), 24fps with synchronized audio. About 15 min diffusion + 2 min VAE decode per clip on the Thor.
\## The interesting part: unified memory
The Jetson Thor has 128GB of RAM shared between CPU and GPU. This sounds great until you realize it breaks every standard memory optimization:
\- \*\*\`enable\_model\_cpu\_offload()\` is useless\*\* — CPU and GPU are the same memory. Moving tensors to CPU frees nothing. Worse, the offload hooks create reference paths that prevent model deletion, and removing them later leaves models in an inconsistent state that segfaults during VAE decode.
\- \*\*\`tensor.to("cpu")\` is a no-op\*\* — same physical RAM. You have to actually \`del\` the object and run \`gc.collect()\` + \`torch.cuda.empty\_cache()\` (twice — second pass catches objects freed by the first).
\- \*\*Page cache will kill you\*\* — safetensors loads weights via mmap. Even after \`.to("cuda")\`, the original pages may still be backed by page cache. If you call \`drop\_caches\` while models are alive, the kernel evicts the weight pages and your next forward pass segfaults.
\- \*\*You MUST use \`torch.no\_grad()\` for VAE decode\*\* — without it, PyTorch builds autograd graphs across all 15+ spatial tiles during tiled decode. On unified memory, this doesn't OOM cleanly — it segfaults. I lost about 4 hours to this one.
The pipeline does manual memory lifecycle: load everything → diffuse → delete transformer/text encoder/scheduler/connectors → decode audio → delete audio components → VAE decode under \`no\_grad()\` → delete everything → flush page cache → encode video. Every stage has explicit cleanup and memory reporting.
\## What's in the repo
\- \`generate.py\` — the main pipeline with all the memory management
\- \`decode\_latents.py\` — standalone decoder for recovering from failed runs (latents are auto-saved)
\- Batch rendering scripts with progress tracking and ETA
\- Camera control LoRA support (dolly in/out/left/right, jib up/down, static)
\- Optional FP8 quantization (cuts transformer memory roughly in half)
\- Post-processing pipeline for RIFE frame interpolation + Real-ESRGAN upscaling (also Dockerized)
Everything runs in Docker so you don't touch your system Python. The NGC PyTorch base image has the right CUDA 13 / sm\_110 build.
\## Limitations (being honest)
\- \*\*Distilled model only does 8 inference steps\*\* — motion is decent but not buttery smooth. Frame interpolation in post helps.
\- \*\*Negative prompts don't work\*\* — the distilled model uses CFG=1.0, which mathematically eliminates the negative prompt term. It accepts the flag silently but does nothing.
\- \*\*1080p is the ceiling for quality\*\* — you can generate higher res but the model was trained at 1080p. Above that you get spatial tiling seams and coherence loss. Better to generate at 1080p and upscale.
\- \*\*\~15 min per clip\*\* — this is a 19B model on an edge device. It's not fast. But it's fully local and offline.
\## Hardware
NVIDIA Jetson AGX Thor, JetPack 7.0, CUDA 13.0. 128GB unified memory. The pipeline needs at least 128GB — at 64GB you'd need FP8 + pre-computed text embeddings to fit, and it would be very tight.
If anyone else is running video gen models on Jetson hardware, I'd love to compare notes. The unified memory gotchas are real and basically undocumented.
LTX-2 19B has a context window of 1,000 tokens, as specified in the model metadata.
Yes, LTX-2 19B is fully open source. It can be deployed locally without any cloud dependency, and model files are available on Hugging Face. It is also compatible with ComfyUI via community integrations.
The model supports NVFP4 and FP8 quantization, which reduce VRAM requirements by up to 60%. With NVFP4 quantization on an RTX 4090, an 8-second 720p clip can be generated in approximately 25 seconds. Exact minimum VRAM requirements depend on the quantization format and output resolution chosen.
Yes. LTX-2 19B generates video and synchronized audio together in a single unified process. The audio output includes ambient sound, environmental effects, and speech synchronization that correspond to the on-screen action.
The model accepts text prompts, reference image URLs, and existing video clips as inputs. It also supports LoRA configuration, numeric parameters, toggle group settings, and a manual seed value for reproducibility.
LTX-2 19B was developed by Lightricks and released on January 6, 2026. It was added to MindStudio on January 13, 2026.
Continue browsing adjacent models from the same provider.