LLM Model Directory

Explore frontier AI models by provider, pricing, and context

Browse the synced model catalog by provider, release, pricing, and core capabilities.

55 models 12 providers in view Current filter: All providers Type: Image
O

OpenAI

4 models

Image

GPT Image 1.5

Release date unavailable

GPT Image 1.5 is an image generation model developed by OpenAI and released in December 2025. It serves as the engine behind the ChatGPT Images experience and is also available to developers via the API using the model ID gpt-image-1.5. The model supports both text-to-image generation and image editing, with output resolutions up to 1536×1024 and 1024×1536 pixels. It ranks first in the Image Edit category on the Chatbot Arena leaderboard. The model is designed to follow nuanced editing instructions, changing only the specified elements of an image while preserving lighting, composition, and overall context. It maintains consistent facial likeness across inputs, outputs, and sequential edits, making it well-suited for photo retouching, virtual try-ons, stylistic transformations, and multi-image compositing. Text rendering accuracy is notably improved compared to its predecessor, and generation speed is up to 4x faster. It is accessible to all ChatGPT users as well as developers building applications in e-commerce, marketing, and creative tooling.

Image
Context: 4000 Output: N/A
Input: $5.00 Output: $10.00
View model →
Image

GPT Image 1

Release date unavailable

GPT Image 1 is OpenAI's flagship image generation model, released in April 2025, designed to convert text descriptions into images and make targeted edits to existing photos. It is built on a unified neural network architecture that processes both text and images together, which allows it to interpret complex, multi-part prompts and produce outputs that closely match the specified intent. The model supports readable text rendering within images, making it practical for use cases like marketing materials, infographics, and product labels. Output formats include square (1024×1024), portrait (1024×1536), and landscape (1536×1024) resolutions, with three quality tiers available. GPT Image 1 is particularly suited for creative professionals, marketers, and developers who need consistent, production-ready visuals. Its region-aware editing capability allows changes to specific parts of an image — such as a background or a single object — without altering unrelated elements like faces, lighting, or logos. The model accepts image inputs alongside text prompts, enabling workflows that involve editing or building upon existing photos. It is accessible via the OpenAI API and is integrated into MindStudio for use without requiring direct API key management.

Image
Context: 4,000 Output: N/A
Input: $5.00 Output: $40.00
View model →
Image

GPT Image 2

Release date unavailable

OpenAI's latest image generation and editing model, offering state-of-the-art visual quality, precise instruction following, and support for large-scale batch processing.

Image
Context: N/A Output: N/A
Input: N/A Output: N/A
View model →
Image

GPT Image Latest

Release date unavailable

GPT Image Latest is OpenAI's current image generation and editing model, available via the API under the identifier chatgpt-image-latest. This alias always resolves to the most recent version of OpenAI's image generation technology, meaning applications built on it automatically use the latest underlying model without requiring code changes. It powers image creation and editing within ChatGPT and is also directly accessible through the OpenAI API for developers. The model supports both text-to-image generation and targeted editing of existing images, with notable accuracy when rendering text within generated visuals — a historically difficult task for image generation models. It also supports large-scale asynchronous batch processing through OpenAI's Batch API, allowing up to 50,000 image generation or editing jobs to be submitted with separate rate limits. These characteristics make it well-suited for creative professionals, developers building visual applications, and teams that need to process images at scale.

Image
Context: 4,000 Output: N/A
Input: $5.00 Output: $10.00
View model →
G

Google

7 models

Image

Gemini 3.1 Flash Image

Feb 26, 2026

Gemini 3.1 Flash Image Preview, internally codenamed "Nano Banana 2," is Google DeepMind's Flash-tier image generation and editing model released in February 2026. It accepts image URL arrays alongside configurable input parameters, making it suitable for both generation and editing workflows. The model holds the number one ranking on the Artificial Analysis Image Arena and the Arena leaderboard for text-to-image generation, and every image it produces is invisibly watermarked via SynthID for provenance tracking. The model is distinguished by its ability to maintain visual coherence across up to five characters and fourteen objects in a single image, its text rendering within generated images with multilingual support, and its use of real-time web search to inform image outputs. It supports flexible aspect ratios and upscaling up to 4K resolution. Gemini 3.1 Flash Image is available through the Gemini API, Google AI Studio, and Vertex AI, and is well suited for developers, designers, and businesses that need high-volume, high-quality image generation and editing at scale.

Text Image Structured Output
Context: 131,072 Output: 32,768 tokens
Input: $0.50 Output: $3.00
View model →
Image

Gemini 3 Pro Image

Nov 20, 2025

Gemini 3 Pro Image is Google's flagship image generation and editing model, built on the Gemini 3 Pro architecture. It supports both image and text prompts as inputs and is designed for high-fidelity visual creation tasks such as product visualization, storyboarding, infographic design, and complex multi-element compositions. The model includes a tunable media_resolution control that lets developers balance speed, precision, and detail depending on the task at hand. It also supports real-time grounding via Search integration, enabling context-rich visual outputs. The model is notable for its text rendering capabilities within images, including long passages and multilingual layouts, as well as identity preservation across up to five subjects in multi-image blending scenarios. It supports high-resolution output at 2K and 4K resolutions with flexible aspect ratios, along with fine-grained controls for localized edits, lighting adjustments, focus changes, and camera transformations. Gemini 3 Pro Image is best suited for designers, developers, and creative professionals who require reliable, high-quality image generation with detailed control over visual outputs. It was added to MindStudio on November 20, 2025, with a training data cutoff of November 2025.

Text Image Structured Output
Context: 65,536 Output: 32,768 tokens
Input: $2.00 Output: $12.00
View model →
Image

Gemini 2.5 Flash Image

Oct 07, 2025

Gemini 2.5 Flash Image is Google's image generation and editing model, launched in August 2025 via the Gemini API, Google AI Studio, and Vertex AI. It is internally nicknamed "nano-banana" and builds on the native image generation capabilities introduced in Gemini 2.0 Flash. The model accepts arrays of image URLs as input, enabling workflows that involve multiple source images in a single request. It supports a context window of 1,048,576 tokens, allowing for richly detailed prompts alongside image inputs. The model is designed for use cases that require combining natural language instructions with visual content, including targeted image editing, multi-image blending, and maintaining consistent characters across a series of images. It integrates Gemini's broad world knowledge into the generation process, which helps produce contextually accurate visual outputs from descriptive text prompts. Developers and enterprises building creative tools, storytelling applications, or product visualization pipelines are the primary intended audience. It is accessible through both the Gemini API and Vertex AI, making it available for consumer and enterprise deployments.

Text Image Structured Output
Context: 1,048,576 Output: 32,768 tokens
Input: $0.30 Output: $2.50
View model →
Image

Imagen 4 Fast

Release date unavailable

Imagen 4 Fast is a text-to-image generation model developed by Google, available through the fal.ai platform as a fast inference variant. It accepts natural language text prompts and produces images across a range of visual styles, including photorealistic scenes, illustrations, concept art, and painterly aesthetics. The model is designed to interpret complex, multi-element prompts and render fine details such as textures, fabrics, and lighting with precision. It was trained with data through early 2025 and is available for commercial use via fal.ai. Imagen 4 Fast is optimized for workflows that require rapid image generation without reducing visual fidelity. It supports a context window of up to 10,000 tokens, allowing for detailed and descriptive prompts. The model is well-suited for creative professionals, developers, and businesses building applications around product imagery, storytelling visuals, or concept development. Input types include text prompts, selection parameters, and a seed value for reproducible outputs.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Imagen 3

Release date unavailable

Imagen 3 is a text-to-image generation model developed by Google, available through fal.ai, that produces photorealistic images from natural language prompts. It supports a range of visual styles from photorealism to animation and maintains consistent visual composition across five aspect ratios. A notable technical characteristic is its ability to accurately render readable text, signage, and typography within generated images, which has historically been a challenge for image generation models. The model accepts conversational prompts without requiring specialized syntax, and a seed parameter enables reproducible outputs for iterative workflows. Imagen 3 is well suited for use cases that require high visual fidelity and reliable in-image text, including marketing asset creation, product visualization, and concept art development. It supports batch generation of up to four images per request and outputs across aspect ratios including 1:1, 16:9, 9:16, 3:4, and 4:3. The model was trained through late 2024 and accepts text, select, and seed as input types. A companion variant, Imagen 3 Fast, is available for workflows where generation speed takes priority over maximum image quality.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Imagen 3 Fast

Release date unavailable

Imagen 3 Fast is a text-to-image model developed by Google, built on the Imagen 3 architecture and optimized for lower-latency generation. It accepts text prompts and produces images across a range of visual styles, from photorealistic scenes to illustrated and artistic outputs. A notable characteristic of the model is its ability to render legible text within generated images, which is a common challenge for image generation systems. It supports five aspect ratios — 1:1, 16:9, 9:16, 3:4, and 4:3 — and can generate up to four images per request. Imagen 3 Fast is available through the fal.ai platform with full API access, making it accessible to developers building content creation pipelines, prototyping tools, or real-time applications where generation speed is a priority. The model supports seed-based inputs for reproducible outputs, giving developers control over generation consistency. It accepts up to 10,000 context tokens for prompt input and was trained on data through late 2024. It is well suited for teams that need scalable, fast image generation without manual configuration of complex parameters.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

Imagen 4 Ultra

Release date unavailable

Imagen 4 Ultra is Google's flagship image generation model and the top tier of the Imagen 4 family, trained through early 2025. It accepts text prompts of up to 10,000 tokens and is designed to handle complex, multi-element descriptions including specific art styles, multi-scene compositions, and nuanced visual storytelling. The model supports image URL arrays as input, allowing users to reference existing images alongside text prompts. It is licensed for commercial use, making it available to businesses and creative professionals working on production-grade projects. Imagena 4 Ultra is best suited for use cases where image fidelity and detail are priorities, such as professional design work, advertising, and high-resolution visual content creation. It covers a wide range of output styles, from photorealistic portraits and landscapes to stylized illustrations and pixel art. According to community benchmarking discussions, Imagen 4 Ultra has achieved competitive Elo ratings in image arenas, including a reported tie with GPT-Image-1 in the Image Arena as of mid-2025. The model is accessible via the Google Gemini API as well as third-party inference platforms such as fal.ai.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
X

X.ai

2 models

Image

Grok Imagine

Release date unavailable

Grok Imagine (grok-imagine-image) is a text-to-image generation model developed by xAI, the AI company founded by Elon Musk. It is part of the Grok Imagine family and allows users and developers to generate high-resolution images from plain-language text descriptions. The model was unveiled in early August 2025 and expanded from a subscriber-only feature to a broadly available tool accessible via xAI's developer API. It sits alongside the more premium grok-imagine-image-pro variant, serving as the standard, faster option in the family. Grok Imagine supports up to 300 requests per minute, making it suited for applications that require image generation at volume. It accepts a 131,072-token context window and is accessible through xAI's API for integration into apps, tools, and workflows. The model is best suited for developers and creators who need reliable, high-throughput image generation for use cases such as prototyping, content creation, or building image-powered products. No artistic expertise is required to use it — a text description is sufficient to produce a detailed image.

Image
Context: 131,072 Output: 500k
Input: $1.25 Output: $2.50
View model →
Image

Grok Imagine Pro

Release date unavailable

Grok Imagine Pro is xAI's advanced text-to-image generation model, sitting at the top of xAI's image generation lineup above the standard grok-imagine-image. Published under the X brand, it accepts text prompts along with image URL inputs and selection parameters to produce detailed visual outputs. The "pro" designation reflects its position as the higher-quality tier within xAI's image generation offerings. Grok Imagine Pro is well-suited for developers and creators who require high-fidelity AI-generated imagery within production pipelines or creative workflows. It supports a context window of 131,072 tokens, allowing for detailed and nuanced text prompts. Use cases include content generation, creative projects, and any application where prompt adherence and image detail are priorities.

Image
Context: 131,072 Output: N/A
Input: $1.25 Output: $2.50
View model →
B

Blackforestlabs

12 models

Image

FLUX 1.1 [pro]

Release date unavailable

FLUX 1.1 [pro] is a text-to-image generation model developed by Black Forest Labs, the team behind the FLUX model family. It accepts detailed text prompts and produces images at resolutions up to 2K and 4 megapixels, with support for aspect ratios including 1:1, 16:9, 9:16, 4:3, and 3:4. The model represents an upgrade over FLUX 1.0 [pro], delivering generation speeds approximately six times faster and improved adherence to the content and structure described in user prompts. FLUX 1.1 [pro] is designed for use cases that require high visual fidelity from text descriptions, including illustrations, advertising visuals, concept art, portraits, and photorealistic scenes. It is accessible via API, making it suitable for developers integrating image generation into applications, as well as for graphic designers and visual marketers working in professional workflows. A 4-megapixel image can be generated in approximately 10 seconds.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX 1.1 [pro] Ultra

Release date unavailable

FLUX 1.1 [pro] Ultra is a text-to-image generation model developed by Black Forest Labs. It produces images at resolutions up to 4 megapixels and is designed to closely follow text prompt instructions while maintaining image quality at high resolutions. The model offers two generation modes: Ultra mode, which prioritizes strict prompt adherence, and Raw mode, which produces a more naturalistic rendering with fewer synthetic-looking artifacts. FLUX 1.1 [pro] Ultra is suited for professional and creative applications that require high-resolution output, such as concept art, print materials, and social media visuals. It is accessible via the Black Forest Labs API, making it straightforward to integrate into existing workflows and platforms. The model accepts a seed input for reproducible outputs alongside configurable generation settings.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 [dev] LoRA

Release date unavailable

FLUX.1 [dev] LoRA is an image generation model built on FLUX.1 [dev], a 12-billion parameter rectified flow transformer developed by Black Forest Labs and released in August 2024. It extends the base FLUX.1 [dev] model with LoRA (Low-Rank Adaptation) support, allowing users to load pre-trained style and character adapters to shape the visual output without retraining the underlying model. The model is served through WaveSpeed AI's inference platform, which provides a REST API with no cold starts and consistent availability. It supports both text-to-image and image-to-image workflows, with output resolutions ranging from 256×256 up to 1536×1536 pixels. This model is well suited for developers and creators who need stylistically flexible image generation at scale. By swapping LoRA adapters — such as community options like Flux-Super-Realism-LoRA or yarn_art_Flux_LoRA — users can shift between hyper-realistic photography, painterly aesthetics, and character-driven art within the same base model. A prompt enhancer input is also available to refine natural language prompts before generation. Common use cases include product visualization, character design, creative exploration, and content production workflows.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 [dev] Ultra-Fast

Release date unavailable

This variant is optimized for ultra-fast inference, making it suited for iterative creative workflows, rapid prototyping, and applications where generation latency matters. Its input schema includes dual image URL fields, LoRA configuration, selection parameters, numeric controls, and a seed value, giving developers precise control over output dimensions, style, and reproducibility. It is well-suited for graphic designers, developers building image generation pipelines, and creators who need consistent, customizable visual outputs at scale.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 [schnell] LoRA

Release date unavailable

FLUX.1 [schnell] LoRA is an image generation model developed by Black Forest Labs that combines the schnell (fast) variant of the FLUX.1 architecture with LoRA (Low-Rank Adaptation) support, enabling fine-tuned style and subject customization on top of base image generation. It accepts text prompts alongside LoRA weights and source images as inputs, allowing users to steer outputs toward specific visual styles, characters, or aesthetics without retraining the full model. The model supports a context window of 10,000 tokens for prompt input and accepts configuration parameters including seed values and selectable generation options. This model is well-suited for workflows that require repeatable or stylistically consistent image outputs, such as brand asset creation, character design, and concept art iteration. By accepting LoRA inputs directly, it gives developers and designers a way to apply custom-trained adaptations at inference time rather than relying solely on prompt engineering. It is available on MindStudio without requiring separate API key configuration, making it accessible for integration into AI-powered applications.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 Kontext [max]

Release date unavailable

FLUX.1 Kontext [max] is an image generation and editing model developed by Black Forest Labs. It accepts image URLs, configuration selects, and seed inputs to support both generating new images from text prompts and editing existing images while preserving their original context and composition. A notable characteristic of the model is its ability to accurately render text within generated images, which has historically been a difficult task for image generation systems. It also supports multi-image input, allowing multiple reference images to guide the generation process. FLUX.1 Kontext [max] is the highest-tier variant in the Kontext model family from Black Forest Labs, positioned for workflows that require precise contextual understanding and high-fidelity output. It is suited for creative professionals, designers, and developers who need reliable image editing and generation within production pipelines. The model integrates with tools such as ComfyUI and MCP-compatible servers, and it carries a context window of 10,000 tokens. Its REMIX tag indicates support for remixing and transforming existing visual content.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 Kontext [pro]

Release date unavailable

FLUX.1 Kontext [pro] is an image generation and editing model developed by Black Forest Labs, released in May 2025. It is designed to accept an existing image alongside a text prompt and return a modified version of that image, making targeted changes while preserving the overall composition and structure. This context-aware approach distinguishes it from text-to-image-only models, as it is built specifically for in-place editing workflows rather than generating images from scratch alone. The model accepts image URLs, configurable settings, and a seed value as inputs, giving users control over reproducibility and output variation. It is well suited for workflows that require consistent visual identity across edits, such as changing materials, lighting, or stylistic elements in product renders, architectural visualizations, or creative compositions. With a context window of 10,000 tokens, it can process detailed natural language instructions for precise prompt following.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [dev] LoRA

Release date unavailable

FLUX.2 [dev] LoRA is a text-to-image model published by Black Forest Labs, built on a 32-billion parameter diffusion transformer architecture. It extends the FLUX.2 [dev] base model with Low-Rank Adaptation (LoRA) support, enabling users to inject custom styles, characters, or brand identities into image outputs without retraining the full model. The model uses a Mistral Small 3.1 text encoder for prompt processing and runs on WaveSpeedAI's infrastructure with no cold starts. It was made available in November 2025. The model supports stacking up to four LoRA adapters simultaneously in a single generation request, with independently adjustable strength per adapter. This makes it well-suited for brand-consistent marketing, character-consistent content creation, product visualization, and design iteration workflows. Custom LoRAs can be trained on as few as 15 to 30 images, lowering the barrier for teams that need fine-grained visual control. The model also supports batch generation of one to four images per request, useful for producing consistent campaign sets or A/B variants.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [klein] 9B

Release date unavailable

FLUX.2 [klein] 9B with LoRA support is a high-quality text-to-image model with 9B parameters, offering enhanced realism, crisper text generation, and fast LoRA customization. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [max]

Release date unavailable

FLUX.2 [max] is the flagship model in Black Forest Labs' FLUX.2 family, released in December 2025. It is designed for image generation and editing tasks that require high fidelity, accurate prompt following, and visual consistency across complex edits. The model accepts image URL arrays and seed inputs, enabling reference-based generation workflows where source imagery guides the output. It supports a context window of 46,864 tokens, which is notably large for an image generation model. A distinguishing feature of FLUX.2 [max] is its Grounded Generation capability, which allows the model to retrieve real-time information from the web during generation — enabling visuals tied to current events, live data, or recent trends without manual reference uploads. The model also supports character consistency across scenes and styles, multi-reference image editing, and product photography workflows. These characteristics make it suited for professional use cases such as brand work, e-commerce imagery, storytelling, and cinematic visual production.

Image
Context: 46,864 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [pro]

Release date unavailable

FLUX.2 [pro] is a production-grade image generation model developed by Black Forest Labs, released in late 2025. It uses a rectified-flow transformer backbone paired with a Mistral-class vision-language model to handle both image generation and editing within a single unified architecture. The model supports a 32,000-token context window, enabling detailed, multi-part prompts with compositional and spatial constraints. Outputs can reach up to 4 megapixels, with fine detail in faces, hands, and textures suited for commercial use. A defining feature of FLUX.2 [pro] is its ability to accept up to 8–10 reference images simultaneously, maintaining consistent character, product, and style identity across generated scenes. It also supports hex color matching, reliable typography rendering, structured JSON prompts, and pose guidance, making it well-suited for brand-controlled workflows. Built-in C2PA cryptographic metadata provides content provenance, and layered safety filtering blocks IP-infringing and explicit content at inference time. The model is designed for use cases such as e-commerce product imagery, advertising campaigns, and any workflow requiring consistent visual identity across multiple assets.

Image
Context: 32K Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [turbo]

Release date unavailable

FLUX.2 [turbo] is an image generation model developed by Black Forest Labs, designed to convert text descriptions into images across a wide range of styles including photorealistic scenes, illustrations, concept art, and character design. The model supports resolutions up to 2K and 4MP output, with a context window of 10,000 tokens for prompt input. It accepts select and seed inputs, giving users control over style options and reproducibility of results. The model is positioned for workflows where generation speed is a priority, producing images in approximately 10 seconds at 4MP resolution. It supports multiple aspect ratios including 1:1, 16:9, 9:16, 4:3, and 3:4, making it adaptable for different creative and commercial formats. FLUX.2 [turbo] is well-suited for graphic designers, visual marketers, and developers integrating image generation into applications via API.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
B

ByteDance

3 models

Image

Seedream 4.5

Release date unavailable

Seedream 4.5 is an image generation and editing model developed by ByteDance, built on a unified architecture that handles both creating images from text prompts and modifying existing images within a single system. It accepts up to 10 reference images in one request, enabling multi-source compositing workflows such as product swaps and element transfers between images while preserving depth, perspective, and lighting consistency. The model supports output resolutions up to 2048×2048 pixels across multiple aspect ratios including 1:1, 16:9, and 9:16. One of Seedream 4.5's most documented characteristics is its text rendering accuracy — it can produce correctly spelled, legible text in various font styles and non-Latin scripts integrated naturally into scenes like signs, packaging, and posters. The model ranks 10th globally on the LM Arena leaderboard with a score of 1147 and was trained through December 2025. It is well suited for designers, marketers, and e-commerce teams who need production-ready visuals driven by natural language prompts, without requiring manual selection tools or layer-based editing workflows.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Seedream 4.0

Release date unavailable

Seedream 4.0 is an image generation model developed by ByteDance, designed to produce images from text prompts and source image inputs. It supports a context window of 10,000 tokens and accepts image URL arrays alongside numerical parameters, enabling flexible control over generation behavior. The model is part of ByteDance's Seedream series and is available through MindStudio's model catalog. Seedream 4.0 is best suited for workflows that require image generation guided by reference images, making it useful for tasks like style transfer, image variation, and visually consistent content creation. Its support for source image inputs distinguishes it from purely text-to-image pipelines, allowing users to anchor outputs to existing visual references. Developers can integrate it into MindStudio applications without managing separate API keys.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Seedream 5.0 Lite

Release date unavailable

Seedream 5.0 Lite is a lightweight image editing model developed by ByteDance, released in February 2026 as part of the Seedream 5.0 family from their Seed team. It is designed for creative workflows that require fast, high-quality image transformations driven by natural language prompts and reference images. The model supports combining elements from up to 14 reference images in a single edit, making it suited for compositing and character fusion tasks. It is positioned as a more efficient alternative to the full Seedream 5.0 suite while retaining the core editing capabilities of the family. A notable characteristic of Seedream 5.0 Lite is its focus on identity and face preservation, maintaining facial features such as eyes, jawline, proportions, and skin tone through style transformations. ByteDance specifically improved small-face rendering and skin texture restoration in this release compared to Seedream 4.5 Edit. The model also keeps non-edited regions stable, reducing unintended changes to areas outside the edit target. It is well suited for use cases including style transfer, e-commerce product visualization, social media content creation, and rapid concept art prototyping.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
K

Kling

2 models

Image

Kling Image O1

Release date unavailable

Kling Image O1, formally known as Kling Omni Image O1, is an image generation model developed by Kuaishou Technology, the company behind the Kling AI ecosystem. It is built on a Multimodal Visual Language (MVL) framework that combines natural language understanding with multi-reference image processing, allowing it to accept between 1 and 10 reference images simultaneously and extract consistent visual features across all outputs. The model was trained through December 2025 and supports a context window of 10,000 tokens. The model is designed to address a common challenge in AI image generation: maintaining consistent character identity, style, and visual detail across multiple generated images. It is particularly suited for workflows such as IP character design, comic and manga creation, brand merchandise imagery, and serialized visual content where cross-image consistency is a requirement. Inputs include image URL arrays alongside select and toggle controls, giving users structured options for guiding generation behavior.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Kling Image O3

Release date unavailable

Kling Image O3 is the first image generation model released by Kling AI, designed to produce high-quality visuals from text prompts or reference images. It is notable for its ability to accurately render text within generated images, a capability that many image generation models handle poorly, making it well-suited for designs involving typography, signage, or branded content. The model supports resolutions up to 4K across a wide range of aspect ratios, including landscape dimensions up to approximately 6256×2681 pixels and portrait dimensions up to 3548×4730 pixels. Kling Image O3 accepts both text prompts and image inputs, allowing users to guide generation from an existing reference image as well as from a written description. Its combination of high-resolution output, compositional awareness, and in-image text rendering makes it particularly relevant for professional use cases such as game asset creation, marketing materials, and editorial illustration. The model is available through MindStudio without requiring separate API key management.

Image
Context: 2,500 Output: N/A
Input: N/A Output: N/A
View model →
W

Wan

5 models

Image

Wan 2.6

Release date unavailable

Wan 2.6 is a multimodal AI generation model developed by Alibaba Cloud and released in December 2025. It uses a Mixture-of-Experts architecture with 14 billion total parameters, activating roughly 20% of them during inference. The model supports text-to-video, image-to-video, reference-to-video, and image generation modes, and accepts prompts in both English and Chinese. Video outputs can reach up to 15 seconds at 1080p resolution and 24 frames per second. What distinguishes Wan 2.6 from many generation models is its native audio output — synchronized dialogue, sound effects, and lip-sync are generated alongside video without requiring separate post-production tools. The model also supports multi-shot storytelling from a single prompt, maintaining character consistency across scenes with automatic camera transitions. It is well suited for content creators, marketers, and developers who need high-fidelity video and image output, particularly those aiming to produce publish-ready content with minimal manual editing.

Image
Context: 2,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Wan 2.5

Release date unavailable

Wan 2.5 is an open-source AI video generation model developed by Alibaba's DAMO Academy. It produces video clips up to 10 seconds long at resolutions up to 1080p, and generates synchronized audio — including dialogue with lip-sync, ambient sound effects, and background music — alongside the visuals in a single generation step. The model accepts text prompts, still images, audio tracks, or existing video clips as input, and supports cinematic controls such as camera movement types, lighting styles, and depth of field specified directly in the prompt. Wan 2.5 is designed for content creators, filmmakers, advertisers, and developers who need video output with accompanying audio without separate post-production workflows. It supports prompts and generated dialogue in at least 8 languages, and offers 480p, 720p, and 1080p as standard output resolutions with native 4K available in preview. Compared to its predecessor Wan 2.2, this version doubles the maximum video duration from 5 to 10 seconds, raises the standard resolution from 720p to 1080p, and introduces the audio generation system as an entirely new feature.

Image
Context: 2,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Wan 2.2 Deprecated

Release date unavailable

Structured model profile with pricing, context, and capability details.

Image
Context: N/A Output: N/A
Input: N/A Output: N/A
View model →
Image

Wan 2.7

Release date unavailable

Alibaba's powerful multimodal AI model that generates cinematic 1080p video with native audio synchronization, multi-shot storytelling, and advanced image creation.

Image
Context: 2K Output: N/A
Input: N/A Output: N/A
View model →
Image

Wan 2.7 Pro

Release date unavailable

Alibaba's powerful multimodal AI model that generates cinematic 1080p video with native audio synchronization, multi-shot storytelling, and advanced image creation.

Image
Context: 2K Output: N/A
Input: N/A Output: N/A
View model →
I

Ideogram

7 models

Image

Ideogram Upscale

Release date unavailable

Ideogram Upscale is an AI-powered image enhancement tool developed by Ideogram AI, first launched in June 2024. It takes lower-resolution images and scales them up to 8K resolution while preserving and sharpening fine detail. A notable aspect of the tool is its integration with Topaz Labs technology, which is incorporated directly into the Ideogram platform to improve output quality. The model accepts both images generated within Ideogram and externally sourced images brought in for enhancement. Ideogram Upscale is designed for designers, marketers, and creators who need production-ready assets at print or large-format quality. Common use cases include preparing graphics for merchandise, advertising materials, logos, and high-resolution digital displays. The tool is available both through the Ideogram web platform and via the Ideogram API, allowing developers to integrate upscaling into automated pipelines and custom workflows.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V1

Release date unavailable

Ideogram v1 is an AI image generation model developed by Ideogram AI that creates visuals from text prompts. It was added to MindStudio in August 2024 and accepts inputs including text descriptions and configurable parameters. The model supports a range of artistic styles, from photorealistic imagery to illustrations and graphic design aesthetics, with a context window of 10,000 tokens for prompt input. What distinguishes Ideogram v1 from many other image generation models is its ability to render legible, accurately spelled text directly within generated images. This makes it particularly useful for designers, marketers, and content creators who need to produce assets like posters, banners, social media graphics, and branded materials where typography is part of the composition. Its strong prompt adherence also allows it to translate detailed descriptions into coherent visual outputs.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V1 Remix

Release date unavailable

Ideogram v1 Remix is an image generation model developed by Ideogram AI that takes an existing image and a text prompt as inputs to produce a transformed version of that image. It is designed to reinterpret visual content by applying new styles, moods, or artistic directions while preserving the underlying compositional structure of the source image. The model builds on Ideogram's v1 image generation foundation and adds image-guided creation as a core workflow. Ideogram v1 Remix is particularly suited for creative professionals, designers, and artists who need to iterate on visual concepts or explore stylistic variations from a starting reference. One of its notable characteristics is its text rendering accuracy, a trait carried over from the broader Ideogram model family. Users can control outputs through parameters including style selection, aspect ratio, and a seed value for reproducibility, making it useful for both exploratory and production-oriented creative work.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V2

Release date unavailable

Ideogram V2 is a text-to-image model developed by Ideogram and released in August 2024 as the second generation of their image generation platform. It accepts text prompts along with style, aspect ratio, and other configuration inputs to produce images, and includes a Magic Prompt feature that automatically expands simple prompts into more detailed descriptions before generation. The model supports a prompt input up to 10,000 characters and accepts a seed value for reproducible outputs. Ideogram V2 is particularly suited for use cases that require legible, well-styled text embedded directly within generated images, such as social media graphics, posters, banners, product labels, and logo concepts. It is used by designers, marketers, and content creators who need images where typography accuracy is a priority. The model offers style control options and is available through the Ideogram platform as well as via API for integration into third-party applications.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V2 Remix

Release date unavailable

Ideogram V2 Remix is an image-to-image generation model developed by Ideogram. Rather than generating images from scratch, it takes an uploaded source image and a text prompt, then produces a transformed version that blends the written creative direction with the original composition and visual elements. It accepts a range of common image formats including jpg, jpeg, png, webp, gif, and avif. The model is designed for designers, artists, and content creators who want to iterate on existing visuals rather than start from a blank canvas. It supports stylistic and thematic transformations guided by natural language, making it useful for exploring concept variations, adapting imagery to new aesthetics, or generating multiple creative directions from a single reference image. A seed input is available for reproducibility, and multiple selection parameters allow control over style, aspect ratio, and other output characteristics.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V3

Release date unavailable

Ideogram V3, also referred to as Ideogram 3.0, is a text-to-image generation model released by Ideogram in March 2025. It accepts text prompts alongside optional inputs such as style reference images, aspect ratio selections, and rendering mode preferences to produce photorealistic images. One of its defining technical characteristics is its ability to render legible, accurate typography directly within generated images — a capability that has historically been a challenge for image generation models. It also supports a Reframe variant that enables outpainting and multi-aspect-ratio adaptation. Ideogram V3 is available in three rendering tiers — Turbo, Balanced, and Quality — allowing users to trade off generation speed against output fidelity depending on their workflow. The model is particularly suited for use cases where visual accuracy and readable text within images are both required, such as advertising assets, e-commerce photography, branded content, UX mockups, and editorial design. Its style reference control feature allows a reference image to guide color grading, texture, and compositional style across a set of generated outputs. The model accepts a seed input, enabling reproducible results when the same prompt and settings are reused.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V3 Remix

Release date unavailable

Ideogram V3 Remix is an image editing model developed by Ideogram, a company founded by former Google Brain researchers. It extends the base Ideogram V3 image generation model with a remixing system that allows users to transform existing images using text prompts, with a 0–100 strength slider that controls how much the output deviates from the source image. Users can supply their own images or work with images previously generated in Ideogram, and the model will automatically generate a descriptive prompt when an external image is uploaded. The model accepts up to three style reference images to guide color palette, texture, and mood, and supports reusable Style Codes for maintaining brand consistency across outputs. Ideogram is particularly noted for its ability to render legible, correctly spelled text within generated images, making it well-suited for posters, packaging, logos, and marketing materials. It is designed for designers, marketers, and creative professionals who need to iterate on visual concepts without rebuilding them from scratch.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Q

Qwen

5 models

Image

Qwen Image

Aug 04, 2025

Qwen-Image is an image generation and editing model developed by Alibaba's Qwen team. It accepts text prompts and source images as input and supports both text-to-image generation and a wide range of image editing tasks, including style transfer, object addition and removal, background changes, and pose manipulation. The model uses a dual-encoding architecture that processes images through both Qwen2.5-VL for semantic understanding and a VAE encoder for visual fidelity, feeding into an MMDiT backbone. What distinguishes Qwen-Image from many other generation models is its ability to render complex text accurately within images, including multi-line layouts and logographic scripts such as Chinese characters. This capability is built using a curriculum learning strategy that progressively scales from simple to complex text rendering tasks during training. The model has been evaluated on benchmarks covering image generation, image editing, and text rendering, including GenEval, DPG, GEdit, LongText-Bench, ChineseWord, and CVTG-2K. It is well-suited for workflows that require accurate in-image typography, multilingual text, or detailed image editing from a source image.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Qwen 2 Pro

Release date unavailable

Qwen Image 2.0 Pro is an image generation and editing model developed by Alibaba's Qwen team and released in February 2026. It uses an 8B Qwen3-VL encoder paired with a 7B diffusion decoder to produce images natively at 2048×2048 resolution. A single model handles both text-to-image generation and image editing tasks, and it accepts prompts up to 1,000 tokens for detailed scene descriptions. It holds the number one position on AI Arena's blind human evaluation leaderboard for both text-to-image generation and image editing. One of the model's defining characteristics is its ability to render accurately spelled, properly positioned text within generated images, making it suitable for infographics, presentation slides, movie posters, comics, and bilingual Chinese and English content. Its 7 billion parameter footprint is smaller than its predecessor, which used 20 billion parameters, enabling faster inference. The model is well suited for marketing teams, content creators, and designers who need production-ready visuals where accurate text rendering, high native resolution, or iterative editing workflows are priorities.

Image
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Qwen Image Edit Plus

Release date unavailable

Qwen Image Edit Plus is an image generation and editing model developed by Qwen, released in early 2026. It supports text-to-image generation, image-to-image editing, and ControlNet pose conditioning, making it suited for workflows that require precise control over output composition. The model accepts image URL arrays, numeric parameters, and seed values as inputs, enabling reproducible results across generation runs. The model is designed for tasks that involve modifying existing images based on text prompts as well as generating new images from scratch. Its ControlNet pose support allows users to guide human figure layouts using reference poses, which is useful for character-focused creative work. With a context window of 50,000 tokens, it can process detailed prompt instructions alongside image inputs.

Image
Context: 50,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Z Image Turbo

Release date unavailable

Z Image Turbo is a text-to-image generation model developed by Alibaba's Tongyi-MAI lab, built on a single-stream diffusion transformer architecture with 6 billion parameters. It is the distilled, few-step variant of the Z-Image foundation model, designed to produce high-quality images at faster inference speeds without significant quality degradation. The model incorporates a Reinforcement Learning from Human Feedback (RLHF) pipeline using DPO and GRPO stages to align outputs with human aesthetic preferences, and includes a built-in prompt enhancer with a reasoning chain to improve results from short or simple prompts. Z Image Turbo accepts text prompts, source images, LoRA weights, and a seed value as inputs, making it suitable for both text-to-image and image editing workflows. Its training data infrastructure includes a Data Profiling Engine, Cross-modal Vector Engine, and a multi-level image captioning system covering OCR, world knowledge, and editing difference captions. The model is well-suited for creative professionals, developers building image generation pipelines, and researchers working with efficient diffusion transformer architectures.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Z Image Turbo Controlnet

Release date unavailable

Z Image Turbo Controlnet is an image generation model developed by Alibaba's Tongyi-MAI lab, built on a single-stream diffusion transformer architecture with 6 billion parameters. It uses a few-step distillation approach (the Turbo variant) to accelerate inference while preserving output quality, and incorporates ControlNet to allow structural guidance from a source image. The model was trained with a multi-level captioning system and a data infrastructure that includes a Cross-modal Vector Engine and World Knowledge Topological Graph to improve semantic alignment between prompts and outputs. This model is well-suited for workflows that require both speed and structural control over generated images, such as guided creative generation, image editing pipelines, and rapid prototyping. It accepts image URLs as source inputs alongside configurable parameters including seed values for reproducibility. An RLHF alignment pipeline using DPO and GRPO stages was applied to bring outputs closer to human aesthetic preferences, and a built-in prompt enhancer with reasoning chain helps produce better results from short or underspecified prompts.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
W

Wavespeed

1 models

S

Stability

5 models

Image

SDXL LoRA

Jul 04, 2023

SDXL LoRA is a text-to-image generative AI model developed by Stability AI, built as a successor to Stable Diffusion. It runs on a 3.5 billion parameter architecture and generates images natively at 1024×1024 resolution, using dual text encoders — OpenCLIP-ViT/G and CLIP-ViT/L — to interpret complex prompts with reported 89% prompt adherence in benchmark testing. The model also supports an optional refiner stage that applies an ensemble-of-experts approach to add fine detail to generated outputs. What distinguishes SDXL LoRA from the base SDXL model is its built-in support for Low-Rank Adaptation (LoRA), a technique that enables efficient style and subject customization without full model retraining. Users can apply up to five LoRA adapters simultaneously, making it practical for tasks like consistent character design, brand-specific imagery, and specialized artistic styles. It is well-suited for digital artists, marketing teams, game developers, and product designers who need repeatable, customizable visual output at scale.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

SDXL

Release date unavailable

SDXL (Stable Diffusion XL) is an open-source image generation model developed by Stability AI and released in July 2023. It accepts text prompts and optional image inputs to produce images, and supports workflows including text-to-image generation, image-to-image editing, and inpainting. The model is available in two configurations: SDXL 1.0, optimized for out-of-the-box use, and SDXL 1.0 Open, which allows fine-tuning with custom data and inference code. Both variants are deployable via AWS SageMaker and Amazon Bedrock. SDXL is designed for designers, creative professionals, and developers who need generative imagery at scale. Its open model variant supports customization through fine-tuning, making it usable for specialized image pipelines beyond general-purpose prompting. Inputs include image URLs, numeric parameters for dimensions, and a seed value for reproducible outputs. The model is tagged as open source and multi-modal, reflecting its support for both text and image inputs.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Stable Diffusion 3

Release date unavailable

Stable Diffusion 3 (SD3) is a text-to-image generation model developed by Stability AI and released in June 2024. It introduces a Multimodal Diffusion Transformer (MMDiT) architecture that maintains separate weight sets for image and language representations, which improves the model's ability to interpret complex, detailed prompts. The model is available in multiple size variants ranging from 800 million to 8 billion parameters, making it deployable across a range of hardware configurations. One of SD3's most notable characteristics is its ability to render legible text within generated images, a task that has historically been difficult for diffusion-based models. The 8B parameter variant fits within 24GB of VRAM and generates a 1024×1024 image in approximately 34 seconds using 50 sampling steps. SD3 is well suited for creative professionals, developers, and researchers who require high-fidelity image generation with strong alignment to nuanced text prompts.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Stable Image Core

Release date unavailable

Stable Image Core is a text-to-image generation model developed by Stability AI, designed to convert natural language descriptions into detailed visual imagery. It accepts text prompts of up to 77 tokens and produces images across a range of styles and subjects. The model is available through both the Stability AI API and AWS Bedrock, giving developers flexibility in how they integrate it into their workflows. Stable Image Core is well suited for use cases such as creative content generation, marketing visuals, concept art, and rapid visual prototyping. Its availability on AWS Bedrock means it can be incorporated into cloud-based applications without managing underlying infrastructure. The model serves as an accessible entry point into Stability AI's image generation ecosystem, balancing output quality with ease of deployment.

Image
Context: 77 Output: N/A
Input: N/A Output: N/A
View model →
Image

Stable Image Ultra

Release date unavailable

Stable Image Ultra is Stability AI's flagship text-to-image generation model, designed to produce high-quality, photorealistic images from natural language text prompts. It sits at the top of Stability AI's image generation lineup and accepts concise text descriptions — up to a 77-token context window — to generate detailed visuals with strong coherence and fidelity. The model supports configurable inputs including text prompts, selection parameters, and a seed value for reproducible outputs. Stable Image Ultra is well-suited for applications such as marketing visuals, concept art, product visualization, and editorial illustration. It is available through Stability AI's own API and via AWS Bedrock, making it accessible for production-scale deployments without requiring infrastructure management. Developers and enterprises can integrate it directly into existing workflows through these managed cloud platforms.

Image
Context: 77 Output: N/A
Input: N/A Output: N/A
View model →
L

Luma Labs

2 models

Image

Photon 1

Release date unavailable

Photon 1 is a text-to-image generation model developed by Luma Labs, the company also known for its Ray video generation models. Released in early 2025, it is built on a proprietary Universal Transformer architecture rather than the diffusion-based approach used by many image generators, which enables it to produce 1920×1080 resolution outputs with accurate lighting, shadows, and textures. It accepts natural language prompts without requiring specialized prompt engineering, and supports up to four reference images to guide style, composition, or character appearance. Photon 1 is available in two configurations: the standard variant targets maximum quality for professional and print-ready use cases, while Photon 1 Flash is a lighter variant optimized for speed, with generation times as low as 100–500 milliseconds. A beta character consistency feature allows the model to maintain a character's appearance across multiple generations using a single reference image. The model is well suited for designers, marketers, and content creators who need production-ready imagery for applications such as product photography, marketing campaigns, and character-driven visual projects.

Image
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Photon 1 Flash

Release date unavailable

Photon Flash 1 is an image generation model developed by Luma Labs, designed with an emphasis on speed and low-latency inference. It is the "flash" variant in the Photon model family, meaning it is optimized for faster generation cycles compared to more computationally intensive counterparts. The model accepts a context window of up to 1,000 tokens and is accessible through AI platforms and aggregators including MindStudio. Photon Flash 1 is best suited for production environments and real-time applications where response time is a priority alongside image quality. Developers and businesses building high-throughput workflows — such as rapid prototyping tools, content pipelines, or interactive applications — are the primary audience for this model. Its design reflects a practical trade-off between generation speed and output fidelity, making it a functional choice when latency constraints are a key deployment consideration.

Image
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →