Text-to-Image Generation
Generates images from natural language text prompts using a Multimodal Diffusion Transformer architecture. Supports output at resolutions including 1024×1024 pixels.
Stable Diffusion 3 (SD3) is a text-to-image generation model developed by Stability AI and released in June 2024. It introduces a Multimodal Diffusion Transformer (MMDiT) architecture that maintains separate weight sets for image and language representations, which improves the model's ability to interpret complex, detailed prompts. The model is available in multiple size variants ranging from 800 million to 8 billion parameters, making it deployable across a range of hardware configurations. One of SD3's most notable characteristics is its ability to render legible text within generated images, a task that has historically been difficult for diffusion-based models. The 8B parameter variant fits within 24GB of VRAM and generates a 1024×1024 image in approximately 34 seconds using 50 sampling steps. SD3 is well suited for creative professionals, developers, and researchers who require high-fidelity image generation with strong alignment to nuanced text prompts.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for Stable Diffusion 3.
Stable Diffusion 3 (SD3) is a text-to-image generation model developed by Stability AI and released in June 2024. It introduces a Multimodal Diffusion Transformer (MMDiT) architecture that maintains separate weight sets for image and language representations, which improves the model's ability to interpret complex, detailed prompts. The model is available in multiple size variants ranging from 800 million to 8 billion parameters, making it deployable across a range of hardware configurations.
One of SD3's most notable characteristics is its ability to render legible text within generated images, a task that has historically been difficult for diffusion-based models. The 8B parameter variant fits within 24GB of VRAM and generates a 1024×1024 image in approximately 34 seconds using 50 sampling steps. SD3 is well suited for creative professionals, developers, and researchers who require high-fidelity image generation with strong alignment to nuanced text prompts.
Generates images from natural language text prompts using a Multimodal Diffusion Transformer architecture. Supports output at resolutions including 1024×1024 pixels.
Renders legible, accurate text within generated images, a capability that diffusion models have historically struggled with. Achieved through the MMDiT architecture's improved language understanding.
Follows detailed and nuanced text prompts closely, including multi-subject scenes and complex compositional instructions. Separate image and language weight sets in MMDiT contribute to this behavior.
Accepts a user-defined seed value to produce reproducible image outputs. Useful for iterating on a composition while holding other variables constant.
Exposes configurable select inputs for controlling generation parameters such as aspect ratio, style, and output format. Multiple select fields are available in the input schema.
Available in variants from 800M to 8B parameters to accommodate different hardware constraints. The largest variant requires approximately 24GB of VRAM.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
A blurb of text describing what you do not wish to see in the output image.
A specific value that is used to guide the 'randomness' of the generation. Omit this parameter or pass 0 to use a random seed.
Parameters currently listed by OpenRouter or the local catalog for this model.
Official model cards, release notes, docs, and other references synced from the source page.
Stable Diffusion 3 discussions are most active in r/StableDiffusion, r/singularity, r/LocalLLaMA. Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions.
The strongest match in this snapshot has 2601 upvotes and 280 comments.
[https://civitai.com/articles/5732](https://civitai.com/articles/5732)
Stable Diffusion 3 has a context window of 10,000 tokens as listed in its metadata, which governs how much text prompt information the model can process at once.
According to the model metadata, Stable Diffusion 3 has a training date of June 2024.
SD3 uses a Multimodal Diffusion Transformer (MMDiT) architecture, which uses separate sets of weights for image and language representations. This differs from earlier Stable Diffusion versions that used a UNet-based architecture.
The 8B parameter variant of Stable Diffusion 3 fits within 24GB of VRAM, such as that found on an NVIDIA RTX 4090, and generates a 1024×1024 image in approximately 34 seconds using 50 sampling steps.
The model accepts text input for the prompt, along with multiple select fields for configuration options such as style or format, and a seed input for reproducible generation.
Stable Diffusion 3 is published by Stability AI. It can be accessed via the Stability AI platform API or used directly through MindStudio without requiring separate API key management.
Continue browsing adjacent models from the same provider.