Qwen

Qwen Image

Qwen-Image is an image generation and editing model developed by Alibaba's Qwen team. It accepts text prompts and source images as input and supports both text-to-image generation and a wide range of image editing tasks, including style transfer, object addition and removal, background changes, and pose manipulation. The model uses a dual-encoding architecture that processes images through both Qwen2.5-VL for semantic understanding and a VAE encoder for visual fidelity, feeding into an MMDiT backbone. What distinguishes Qwen-Image from many other generation models is its ability to render complex text accurately within images, including multi-line layouts and logographic scripts such as Chinese characters. This capability is built using a curriculum learning strategy that progressively scales from simple to complex text rendering tasks during training. The model has been evaluated on benchmarks covering image generation, image editing, and text rendering, including GenEval, DPG, GEdit, LongText-Bench, ChineseWord, and CVTG-2K. It is well-suited for workflows that require accurate in-image typography, multilingual text, or detailed image editing from a source image.

Aug 04, 2025 10,000 context N/A output
Text-to-Image Generation Image Editing Complex Text Rendering LoRA Support Seed Control Image Understanding Tasks

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

Qwen

Input Context Window

The number of tokens supported by the input context window.

10,000 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

N/A tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Aug 04, 2025 9 months ago

Knowledge Cut-off Date

When the model's knowledge was last updated.

August 2025

API Providers

The providers that offer this model. This is not an exhaustive list.

Hugging Face

Modalities

Types of data this model can process.

Image Text Code

What is Qwen Image

A fuller summary of positioning, capabilities, and source-specific details for Qwen Image.

Qwen-Image is an image generation and editing model developed by Alibaba's Qwen team. It accepts text prompts and source images as input and supports both text-to-image generation and a wide range of image editing tasks, including style transfer, object addition and removal, background changes, and pose manipulation. The model uses a dual-encoding architecture that processes images through both Qwen2.5-VL for semantic understanding and a VAE encoder for visual fidelity, feeding into an MMDiT backbone.

What distinguishes Qwen-Image from many other generation models is its ability to render complex text accurately within images, including multi-line layouts and logographic scripts such as Chinese characters. This capability is built using a curriculum learning strategy that progressively scales from simple to complex text rendering tasks during training. The model has been evaluated on benchmarks covering image generation, image editing, and text rendering, including GenEval, DPG, GEdit, LongText-Bench, ChineseWord, and CVTG-2K. It is well-suited for workflows that require accurate in-image typography, multilingual text, or detailed image editing from a source image.

Capabilities

What Qwen Image supports

IMG

Text-to-Image Generation

Generates images from text prompts across a wide range of artistic styles, evaluated on benchmarks including GenEval and DPG.

IMG

Image Editing

Edits source images via a reference imageUrl input, supporting style transfer, background changes, object addition, removal, replacement, and pose manipulation.

AI

Complex Text Rendering

Renders multi-line, paragraph-level, and logographic text (including Chinese characters) within generated images, benchmarked on LongText-Bench, ChineseWord, and CVTG-2K.

AI

LoRA Support

Accepts LoRA adapters as an input parameter, allowing fine-tuned style or subject customization to be applied at inference time.

AI

Seed Control

Accepts a numeric seed input to enable reproducible image outputs across generation runs.

IMG

Image Understanding Tasks

Supports detection, segmentation, depth estimation, novel view synthesis, and super resolution as part of its unified architecture.

Pricing for Qwen Image

Primary API pricing shown in the same “quick compare” spirit as the reference page.

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

Hugging Face

Configuration & Parameters

The configurable options currently documented for this model.

Width

Number
Default: 1024 Range: 256 - 1536

Height

Number
Default: 1024 Range: 256 - 1536

LoRAs

LoRA

Up to 3 LoRAs.

Seed

Seed

A specific value that is used to guide the 'randomness' of the generation.

Supported Request Parameters

Parameters currently listed by OpenRouter or the local catalog for this model.

Width Height LoRAs Seed

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about Qwen Image

Qwen Image discussions are most active in r/StableDiffusion, r/comfyui, r/LocalLLaMA.

Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 1550 upvotes and 159 comments.

I'm always getting slightly plasticy and airbrushed results from Qwen Image Edit, the teeth and yes don't look very natural, especially if it's not a face portrait. I see Nano Banana and Grok Imagine and GPT Image doing such great work and makes me wonder if any Image to Image Comfyui workflow with locally hosted models can ever come close. Would love to see other share their thoughts or workflows if you have any. Thanks!

Open Reddit thread
r/comfyui 178 upvotes 44 comments May 11, 2026
The combination of qwen image + Z image

I've created an agent for generating Japanese film-style image cues. The images produced using this combination are of very high quality. I've also tried using these cues to create images in MyJet, and the results are quite good. There are some noticeable differences in the results; which one do you prefer? If there's a lot of interest, I'll open-source this agent.

I've uploaded a Comfyui workflow for local use, you can click this link to download it directly: [https://drive.google.com/file/d/1pLz52RDPdyQMgwS5LVeMrQ2GVFrhLy78/view?usp=drive\_link](https://drive.google.com/file/d/1pLz52RDPdyQMgwS5LVeMrQ2GVFrhLy78/view?usp=drive_link)

However, I strongly recommend replacing the node used for image-based prompts from qwen3 with a larger language model like Gemini or GPT for better results.

Therefore, I've also prepared two cloud-based workflows for your convenience: If you want to use the Comfyui cloud platform, the workflow is here: [https://www.runninghub.cn/post/2053673047776866305/?inviteCode=rh-v1317](https://www.runninghub.cn/post/2053673047776866305/?inviteCode=rh-v1317)

If you prefer to use MJ, you can use it through TapNow,the workflow is here: [https://app.tapnow.ai/tapflow/view/2e3b1d50](https://app.tapnow.ai/tapflow/view/2e3b1d50)

Open Reddit thread
r/StableDiffusion 133 upvotes 74 comments January 6, 2026
Comparison: Trained the same character LoRAs on Z-Image Turbo vs Qwen 2512

I’ve compared some character LoRAs that I trained myself on both Z-Image Turbo (ZIT) and Qwen Image 2512. Every character LoRA in this comparison was trained using the exact same dataset on both ZIT and Qwen.

All comparisons above were done in ComfyUI using 12 steps, 1 CFG, multiple resolutions. I intentionally bumped up the steps higher than the defaults (8 for ZIT, 4 for Qwen Lightning) hoping to get maximum results.

As you can see in the images, ZIT is still better in terms of realism compared to Qwen.
Even though I used the res\_2s sampler and bong\_tangent scheduler for Qwen (because the realism drops without them), the skin texture still looks a bit plastic. ZIT is clearly superior in terms of realism. Some of the prompt tests above also used references from the dataset.

For distant shots, Qwen LoRAs often require FaceDetailer (as i did on Dua Lipa concert image above) to make the likeness look better. ZIT sometimes needs FaceDetailer too, but not as often as Qwen.

ZIT is also better in terms of prompt adherence (as we all expected). Maybe it’s due to the Reinforcement Learning method they use.

As for Concept Bleeding/ Semantic Leakage (I honestly don't understand this deeply, and I don't even know if I'm using the right term ). maybe one of you can explain it better? I just noticed a tendency for diffusion models to be hypersensitive to certain words.

This is where ZIT has a flaw that I find a bit annoying: the concept bleeding on ZIT is worse than Qwen (maybe because of smaller parameters or the distilled model?). For example, with the prompt "a passport photo of \[subject\]". Even though both models tend to generate Asian faces with this prompt but the association with Asian faces is much stronger on ZIT. I had to explicitly mention the subject's traits for non-Asian character LoRAs. Because the concept bleeding is so strong on ZIT, I haven't been able to get a good likeness on the "Thor" prompt like the one in the image above.

And it’s already known that another downside of ZIT is using multiple LoRAs at once. So far, I haven't successfully used 3 LoRAs simultaneously. 2 is still okay.

Although I’m still struggling to make LoRAs involving specific acts that work well when combined with character lora, i’ve trained that work fine when combined with character lora. You can check out those on: [https://civitai.com/user/markindang](https://civitai.com/user/markindang)

All of these LoRAs were trained using ostris/ai-toolkit. Big thanks to him!

Qwen2512+FaceDetailer: [https://drive.google.com/file/d/17jIBf3B15uDIEHiBbxVgyrD3IQiCy2x2/view?usp=drive\_link](https://drive.google.com/file/d/17jIBf3B15uDIEHiBbxVgyrD3IQiCy2x2/view?usp=drive_link)
ZIT+FaceDetailer: [https://drive.google.com/file/d/1e2jAufj6\_XU9XA2\_PAbCNgfO5lvW0kIl/view?usp=drive\_link](https://drive.google.com/file/d/1e2jAufj6_XU9XA2_PAbCNgfO5lvW0kIl/view?usp=drive_link)

Open Reddit thread
View more discussions →
FAQ

Common questions about Qwen Image

What is the context window for Qwen-Image?

The model has a context window of 10,000 tokens, as listed in the model metadata.

What input types does Qwen-Image accept?

Qwen-Image accepts an image URL (source image), numeric parameters for dimensions or other settings, LoRA adapter configurations, and a seed value for reproducibility.

What makes Qwen-Image's text rendering distinct?

The model uses a curriculum learning strategy that trains progressively from simple to complex text tasks, enabling accurate rendering of multi-line text and logographic scripts like Chinese characters within generated images.

What benchmarks has Qwen-Image been evaluated on?

Qwen-Image has been evaluated on GenEval and DPG for image generation; GEdit, ImgEdit, and GSO for image editing; and LongText-Bench, ChineseWord, and CVTG-2K for text rendering.

What is the training data cutoff for Qwen-Image?

The model's training date is listed as August 2025 in the model metadata.

More models from Qwen

Continue browsing adjacent models from the same provider.

← All AI Models