LLM Model Directory

Explore frontier AI models by provider, pricing, and context

Browse the synced model catalog by provider, release, pricing, and core capabilities.

253 models 29 providers in view Current filter: All providers
O

OpenAI

45 models

Text

GPT 5.5

Apr 24, 2026

GPT-5.5 is OpenAI’s frontier model designed for complex professional workloads, building on GPT-5.4 with stronger reasoning, higher reliability, and improved token efficiency on hard tasks. It features a 1M+ token...

Text Image File
Context: 1050K Output: 128,000 tokens
Input: $5.00 Output: $30.00
View model →
Text

GPT 5.4

Mar 05, 2026

GPT-5.4 is a text generation model developed by OpenAI, released in March 2026 as their flagship model for professional and enterprise use. It is available in three variants — standard, Thinking, and Pro — and features a context window of 1 million tokens, the largest OpenAI has offered. The model is designed not only to plan complex tasks but to complete them reliably, with built-in computer use capabilities for orchestrating multi-step agentic workflows. GPT-5.4 is best suited for enterprise teams running AI in production environments, including customer support automation, document drafting, data analysis, and developer workflows. It recorded an 83% score on GDPval for knowledge work tasks and ranked second out of 116 models on the Artificial Analysis Intelligence Index. The Pro variant adds multi-path reasoning evaluation for scenarios where analytical depth is prioritized over speed, such as scientific research and complex decision-making.

Text Image File
Context: 1050K Output: 128,000 tokens
Input: $2.50 Output: $15.00
View model →
Text

GPT 5.4 Pro

Mar 05, 2026

GPT-5.4 Pro is a text generation model developed by OpenAI, released in March 2026 as part of the GPT-5.4 family. It is one of three variants in that family — alongside the standard GPT-5.4 and GPT-5.4 Thinking — and is specifically optimized for deep analytical work through multi-path reasoning evaluation. The model supports a context window of 1 million tokens, the largest OpenAI has offered, enabling it to process extensive documents, codebases, and multi-step workflows within a single session. GPT-5.4 Pro is designed for professional and enterprise use cases where thoroughness takes priority over speed, including scientific research, complex decision-making, legal analysis, and financial modeling. The broader GPT-5.4 family includes built-in computer use capabilities for agentic workflows, produces 33% fewer factual errors than GPT-5.2, and ranked #2 out of 116 models on the Artificial Analysis Intelligence Index. It also recorded benchmark scores on OSWorld-Verified, WebArena Verified, and an 83% score on GDPval for knowledge work tasks.

Text Image File
Context: 1050K Output: 128,000 tokens
Input: $30.00 Output: $180.00
View model →
Text

GPT‑5.2 Pro

Dec 10, 2025

GPT-5.2 Pro is a text generation model developed by OpenAI, added to MindStudio in December 2025. It supports a 400,000-token context window and is trained on data through December 2025, making it OpenAI's most recent flagship release. The model is tagged for reasoning, tool use, and MCP (Model Context Protocol) support, reflecting its design for complex, multi-step tasks. GPT-5.2 Pro is built for professional knowledge work across a wide range of domains. According to OpenAI, it was evaluated on GDPval, a benchmark spanning 44 occupations, where it performed at or above the level of industry professionals on well-specified tasks. It is best suited for workflows that require deep reasoning, tool integration, and handling large documents or long-context inputs.

Text Image File
Context: 400,000 Output: 256,000 tokens
Input: $21.00 Output: $168.00
View model →
Image

GPT Image 1.5

Release date unavailable

GPT Image 1.5 is an image generation model developed by OpenAI and released in December 2025. It serves as the engine behind the ChatGPT Images experience and is also available to developers via the API using the model ID gpt-image-1.5. The model supports both text-to-image generation and image editing, with output resolutions up to 1536×1024 and 1024×1536 pixels. It ranks first in the Image Edit category on the Chatbot Arena leaderboard. The model is designed to follow nuanced editing instructions, changing only the specified elements of an image while preserving lighting, composition, and overall context. It maintains consistent facial likeness across inputs, outputs, and sequential edits, making it well-suited for photo retouching, virtual try-ons, stylistic transformations, and multi-image compositing. Text rendering accuracy is notably improved compared to its predecessor, and generation speed is up to 4x faster. It is accessible to all ChatGPT users as well as developers building applications in e-commerce, marketing, and creative tooling.

Image
Context: 4000 Output: N/A
Input: $5.00 Output: $10.00
View model →
Text

GPT-5.1

Nov 13, 2025

GPT-5.1 is a text generation model developed by OpenAI, positioned as the flagship option for coding and agentic workflows. It supports a 400,000-token context window and features configurable reasoning effort, allowing users to toggle between reasoning and non-reasoning modes depending on the task at hand. Its training data extends through November 2025. The model is designed with tool use and agent orchestration in mind, accepting inputs that include tool definitions and MCP server configurations alongside standard text prompts. This makes it well-suited for multi-step tasks, automated pipelines, and code generation scenarios where structured decision-making and external integrations are required.

Text Image File
Context: 400,000 Output: 128,000 tokens
Input: $1.25 Output: $10.00
View model →
Video

Sora 2 Pro

Release date unavailable

Sora 2 Pro is the premium tier of OpenAI's second-generation video generation model, available to ChatGPT Pro subscribers. It generates videos up to 25 seconds in length at resolutions up to 1080p and frame rates between 24 and 60 fps, with synchronized dialogue, sound effects, and ambient audio produced alongside the video. The model also includes a Cameo feature that lets users inject a consistent character — based on an uploaded video of a person, pet, or object — into any generated scene. Sora 2 Pro is designed for filmmakers, content creators, marketers, and storytellers who require longer, higher-fidelity AI-generated video with professional-grade audio. The model handles complex, multi-part prompts and maintains character and scene continuity across multiple shots within a single clip. It models physical phenomena such as gravity, collisions, and fluid dynamics, and scored approximately 8.5 out of 10 in independent physics evaluations. The model's training data has a cutoff of September 2025.

Video
Context: 5,000 Output: N/A
Input: N/A Output: N/A
View model →
Text

GPT-5

Aug 07, 2025

GPT-5 is OpenAI's flagship text generation model, designed with a focus on coding, reasoning, and agentic tasks across a wide range of domains. It supports a 400,000-token context window and has a training data cutoff of September 2024. The model is tagged for reasoning, tool use, and MCP (Model Context Protocol) support, reflecting its orientation toward complex, multi-step workflows. GPT-5 is best suited for developers and teams building agentic applications, automated pipelines, and code-heavy workflows. It accepts tool definitions and MCP server configurations as inputs, making it well-suited for orchestration scenarios where the model needs to call external functions or services. It is available via the OpenAI API and accessible on MindStudio without requiring separate API key management.

Text Image File
Context: 400,000 Output: 128,000 tokens
Input: $1.25 Output: $10.00
View model →
Text

GPT-5 Chat

Aug 07, 2025

GPT-5 Chat is a text generation model developed by OpenAI and serves as the snapshot of GPT-5 currently deployed in ChatGPT. It has a 400,000-token context window and a training data cutoff of September 2024. The model supports tool use and MCP (Model Context Protocol) servers as input types, making it suitable for agentic workflows and integrations. GPT-5 Chat is designed for high-intelligence tasks that benefit from a large context window, such as long-document analysis, multi-step reasoning, and complex instruction following. Its support for tools and MCP servers means it can be connected to external services and data sources within automated pipelines. Developers accessing it via the API receive the same model version that powers the ChatGPT interface, keeping behavior consistent across both surfaces.

Text Image File
Context: 400,000 Output: 16,384 tokens
Input: $1.25 Output: $10.00
View model →
Text

GPT-5 mini

Aug 07, 2025

GPT-5 mini is a text generation model developed by OpenAI, designed as a faster and more cost-efficient variant of GPT-5. It supports a 400,000-token context window and has a training data cutoff of May 2024. The model is tagged as a latest release and supports tool use and MCP (Model Context Protocol) server integrations. GPT-5 mini is best suited for well-defined tasks where precise prompting is used and response speed or cost efficiency is a priority. It accepts structured inputs including tool calls and MCP server configurations, making it a practical choice for agentic workflows and automation pipelines. Developers working on tasks with clear, bounded requirements are the primary intended audience for this model.

Text Image File
Context: 400,000 Output: 128,000 tokens
Input: $0.25 Output: $2.00
View model →
Text

GPT-5 nano

Aug 07, 2025

GPT-5 Nano is a text generation model developed by OpenAI and released as part of the GPT-5 model family. It is designed to be the fastest and most cost-efficient variant in that family, making it accessible for high-volume or latency-sensitive applications. The model supports a 400,000-token context window and has a training data cutoff of May 2024. It accepts structured inputs including tool calls and MCP server configurations. GPT-5 Nano is particularly well-suited for summarization and classification tasks, where speed and throughput matter more than extended reasoning depth. Its large context window allows it to process long documents in a single pass, which is useful for document triage, content labeling, and similar workflows. Developers can integrate it with external tools and MCP servers, extending its utility beyond pure text generation into agentic and multi-step task scenarios.

Text Image File
Context: 400,000 Output: 128,000 tokens
Input: $0.05 Output: $0.40
View model →
Text

GPT OSS 120B

Aug 05, 2025

GPT OSS 120B is OpenAI's largest open-weight model, released in August 2025 under the Apache 2.0 license. It has approximately 116.8 billion total parameters and uses a Mixture-of-Experts (MoE) architecture that activates only around 5.1 billion parameters per token, enabling efficient inference on a single H100 GPU. The model is part of the GPT OSS family and is designed for commercial and private deployments without licensing restrictions. The model is built for coding, mathematical reasoning, scientific analysis, and agentic workflows. It supports a 128,000-token context window, adjustable reasoning levels (low, medium, and high), and native tool use including web browsing, Python code execution, and custom developer-defined functions. Architecturally, it uses 36 transformer layers with 128 experts per MoE layer (top 4 active per token), Grouped Query Attention, Rotary Position Embeddings, and an alternating local/dense attention pattern, and it is available for local inference via Hugging Face Transformers, llama.cpp, and vLLM.

Text Tools Structured Output
Context: 131.1K Output: 32,768 tokens
Input: $0.15 Output: $0.00
View model →
Text

GPT OSS 20B

Aug 05, 2025

GPT OSS 20B is an open-weight text generation model released by OpenAI in August 2025, representing the company's first open-weight release since GPT-2 in 2019. It uses a Mixture-of-Experts (MoE) architecture with 21 billion total parameters, activating approximately 3.6 billion parameters per token across 4 of 32 experts in 24 layers. Combined with MXFP4 4-bit quantization, the model runs within 16GB of memory, making it suitable for consumer hardware and on-device deployment. It is licensed under Apache 2.0, allowing local hosting, firewall-protected deployment, and fine-tuning for custom use cases. GPT OSS 20B supports a 128,000-token context window and includes adjustable reasoning levels — low, medium, and high — with chain-of-thought traces. Its documented strengths include coding, mathematical reasoning, and scientific analysis, along with tool use and agentic workflow support. The model also produces structured outputs for predictable, schema-conforming responses. It is available through Hugging Face, Amazon SageMaker, Amazon Bedrock, and NVIDIA NIM, and is well-suited for developers and organizations that require a self-hosted, customizable AI model without relying on cloud infrastructure.

Text Tools Structured Output
Context: 128,000 Output: 32,768 tokens
Input: $0.10 Output: $0.00
View model →
Text

o3-pro

Jun 10, 2025

o3-pro is a text generation model developed by OpenAI, released on June 10, 2025. It is built around a reasoning-first architecture that performs iterative self-reflection before producing a response, simulating multiple solution paths and evaluating potential flaws rather than generating a single-pass output. The model accepts both text and image inputs and supports a 200,000-token context window. It also includes autonomous tool use, allowing it to independently invoke capabilities like Python execution, file analysis, and web retrieval. o3-pro is designed for tasks that require sustained, multi-step reasoning — including mathematics, software engineering, scientific research, and legal analysis. It supports structured outputs and function calling, making it suitable for integration into developer pipelines and agentic workflows. Access to the model via API requires identity verification (KYC) from OpenAI. It is best suited for developers, researchers, and enterprises that need reliable, deeply reasoned outputs on complex problems.

Text Image File
Context: 200,000 Output: 100,000 tokens
Input: $2.00 Output: $80.00
View model →
Text

o3

Apr 16, 2025

OpenAI o3 is the flagship model in OpenAI's o-series of reasoning models, released in April 2025. It is designed to spend more time thinking through problems before responding, using large-scale reinforcement learning to work through complex, multi-step tasks. The model supports a 200,000-token context window and can process both text and images as inputs. According to OpenAI, o3 makes 20% fewer major errors than its predecessor on difficult real-world tasks, with particular strength in programming, business consulting, and creative ideation. A notable feature of o3 is its ability to integrate images directly into its reasoning process — not just interpreting them, but actively using them as part of problem-solving, including handling blurry, reversed, or low-quality visuals. The model can also autonomously combine tools such as web search, Python-based data analysis, and image generation to address multi-faceted questions. It is best suited for users who need rigorous analytical reasoning across domains like biology, mathematics, engineering, and software development, particularly when tasks require combining visual and textual information.

Text Image File
Context: 200K Output: 100,000 tokens
Input: $2.00 Output: $8.00
View model →
Text

o4-mini

Apr 16, 2025

o4-mini is a compact text generation model developed by OpenAI and released in April 2025 alongside the larger o3 model. It uses a chain-of-thought reasoning approach, thinking through problems step by step before producing a response, which makes it well-suited for structured problem-solving in math, coding, science, and visual tasks. The model supports a 200,000-token context window, allowing it to process and analyze lengthy documents in a single session. What distinguishes o4-mini from earlier reasoning models is its native ability to incorporate images directly into its reasoning process — not just interpreting them, but actively using them as part of its chain of thought, including handling low-quality or rotated images. It is also trained for agentic tool use, meaning it can decide when to invoke tools like web search, Python execution, or file analysis to complete multi-step tasks. Its design prioritizes high throughput, making it a practical choice for developers and applications that require large volumes of reasoning-intensive requests.

Text Image File
Context: 200,000 Output: 100,000 tokens
Input: $1.10 Output: $4.40
View model →
Text

GPT-4.1

Apr 14, 2025

GPT-4.1 is a text generation model developed by OpenAI and released in April 2025. It is positioned as OpenAI's flagship model for handling complex, multi-domain tasks and is available to developers via the OpenAI API. The model supports a 200,000-token context window, enabling it to process and reason over long documents, codebases, and extended conversations in a single request. Its training data has a knowledge cutoff of May 31, 2024. GPT-4.1 is designed for problem solving across a wide range of domains, including coding, analysis, instruction following, and structured output generation. It is an API-only model, meaning it is accessible through the OpenAI platform rather than through ChatGPT's consumer interface. Developers building agents, pipelines, or applications that require handling large amounts of context or complex multi-step instructions are the primary intended audience for this model.

Text Image File
Context: 200,000 Output: 32,768 tokens
Input: $2.00 Output: $8.00
View model →
Text

GPT-4.1 Mini

Apr 14, 2025

GPT-4.1 Mini is a text generation model developed by OpenAI, released as part of the GPT-4.1 model family in April 2025. It is designed to occupy a middle ground between the full GPT-4.1 model and lighter-weight options, offering a context window of over one million tokens — specifically 1,047,576 tokens. The model has a training data cutoff of May 31, 2024, and is accessible via the OpenAI API. GPT-4.1 Mini is positioned for use cases where developers need a capable text generation model without the latency or cost profile of larger models. Its large context window makes it suitable for tasks involving long documents, extended conversations, or multi-step instructions. It fits well into applications that require a balance of response quality, throughput, and cost efficiency.

Text Image File
Context: 1,047,576 Output: 32,768 tokens
Input: $0.40 Output: $1.60
View model →
Text

GPT-4.1 Nano

Apr 14, 2025

GPT-4.1 Nano is a text generation model developed by OpenAI and released in April 2025. It is the smallest and most cost-efficient model in the GPT-4.1 family, designed for latency-sensitive and high-throughput applications. It supports a context window of over one million tokens (1,047,576 tokens), making it capable of processing very long documents or conversation histories in a single request. Its training data has a knowledge cutoff of May 31, 2024. GPT-4.1 Nano is best suited for tasks where speed and cost efficiency are priorities, such as classification, summarization, autocomplete, and lightweight instruction-following. Because it sits at the smaller end of the GPT-4.1 family, it trades some capability headroom for significantly lower latency and cost per token. Developers building applications that require frequent, rapid model calls — such as real-time assistants, tagging pipelines, or high-volume data processing — are the primary target audience for this model.

Text Image File
Context: 1,047,576 Output: 32,768 tokens
Input: $0.10 Output: $0.40
View model →
Text

o1-pro

Mar 19, 2025

o1-pro is a text generation model developed by OpenAI and released in December 2024. It is built on the same foundation as the o1 model family but allocates significantly more compute and longer reflection time per query, which allows it to work through multi-step problems more carefully before producing a response. It supports a 200,000-token context window and can generate up to 100,000 tokens in a single output, and it accepts both text and image inputs. The model is designed for tasks where accuracy on difficult problems takes priority over response speed. It performs well on advanced mathematics, scientific reasoning, and complex coding challenges, with benchmark scores including 94.8% on MATH, 92.4% on HumanEval, and 77.3% on GPQA. o1-pro was initially available exclusively through the ChatGPT Pro subscription plan before becoming accessible via the OpenAI API in March 2025.

Text Image File
Context: 200,000 Output: 100,000 tokens
Input: $150.00 Output: $600.00
View model →
Text

o3-mini

Jan 31, 2025

o3-mini is a text generation model developed by OpenAI and released in January 2025. It belongs to OpenAI's o-series, a family of models trained to reason through problems step by step before producing a response. The model is designed to balance reasoning quality with speed and cost efficiency, making it practical for high-volume deployments where deliberate thinking is needed without long wait times. o3-mini is particularly well-suited for tasks involving mathematical reasoning, programming challenges, and scientific questions. It operates with a 200,000-token context window, allowing it to process long documents, extended codebases, or multi-turn conversations in a single session. The model generates output at approximately 137 tokens per second and uses an internal reasoning process rather than responding immediately, which contributes to its accuracy on structured, logic-intensive tasks.

Text File Tools
Context: 200K Output: 100,000 tokens
Input: $1.10 Output: $4.40
View model →
Text

o1

Dec 17, 2024

OpenAI o1 is a large language model developed by OpenAI and trained using reinforcement learning to perform complex, multi-step reasoning. Unlike standard language models that respond immediately, o1 generates an internal chain of thought before producing its final answer, allowing it to work through difficult problems more systematically. It supports a 200,000-token context window, tool use, and Structured Outputs via the API. The model is designed for tasks in coding, mathematics, and science where careful reasoning is more important than broad general knowledge. It has demonstrated notable benchmark results, including ranking in the 89th percentile on Codeforces competitive programming questions, placing among the top 500 students in the US on the AIME math qualifier, and exceeding human PhD-level accuracy on the GPQA benchmark covering physics, biology, and chemistry. It is well-suited for developers and researchers who need a model that can handle technically demanding problems within a large context.

Text Image File
Context: 200,000 Output: 100,000 tokens
Input: $15.00 Output: $60.00
View model →
Text to Speech

TTS

Release date unavailable

TTS (tts-1) is OpenAI's text-to-speech model designed for speed and responsiveness. It converts written text into natural-sounding audio and is optimized to minimize the delay between text input and audio output. The model supports a 4096-token context window and is accessible through the OpenAI API, making it straightforward to integrate into existing applications and workflows. TTS is well-suited for use cases where timely audio delivery matters, such as interactive voice assistants, customer service systems, educational tools, and entertainment applications. OpenAI also offers a sibling model, tts-1-hd, which prioritizes audio fidelity over speed. Developers who need the fastest possible voice response times will find tts-1 the appropriate choice, while those who can tolerate slightly higher latency in exchange for higher audio quality may opt for tts-1-hd.

Context: N/A Output: N/A
Input: $15.00 Output: N/A
View model →
Text

GPT-4o Mini

Jul 18, 2024

GPT-4o Mini is a text generation model developed by OpenAI and released in July 2024. It is designed to deliver low-cost, low-latency responses across a wide range of tasks, making it suitable for applications that require fast throughput or high request volumes. The model supports a 128,000-token context window and is compatible with the same range of languages as GPT-4o. GPT-4o Mini is positioned for use cases such as real-time customer interactions, processing large volumes of context, and multimodal reasoning tasks. It performs on academic benchmarks across both textual intelligence and multimodal reasoning, outscoring GPT-3.5 Turbo and other small models in those evaluations. Its combination of speed and affordability makes it a practical choice for developers building cost-sensitive production applications.

Text Image File
Context: 128,000 Output: 16,383 tokens
Input: $0.15 Output: $0.60
View model →
Text

GPT-4o

May 13, 2024

GPT-4o is a multimodal language model developed by OpenAI, released in May 2024. The "o" stands for "omni," reflecting its ability to accept any combination of text, audio, and image as input and generate any combination of those same modalities as output. It has a 128,000-token context window and a training data cutoff of October 2023. One of GPT-4o's defining characteristics is its audio response latency, which can be as low as 232 milliseconds and averages around 320 milliseconds — comparable to human conversational response times. It is well-suited for applications requiring fast, multimodal interaction, such as voice assistants, image analysis pipelines, and multilingual text processing. OpenAI has noted it offers improved performance on non-English text compared to GPT-4 Turbo, while also being available at a lower API cost.

Text Image File
Context: 128,000 Output: 16,384 tokens
Input: $2.50 Output: $10.00
View model →
Text

GPT-4 Turbo

Apr 09, 2024

GPT-4 Turbo is a variant of OpenAI's GPT-4 model, released to provide faster response times while retaining the language understanding and generation capabilities of the base GPT-4. It supports a 128,000-token context window, allowing it to process and reason over long documents, extended conversations, or large blocks of text in a single request. The model has a training data cutoff of December 2023 and is available through OpenAI's API. GPT-4 Turbo is designed for use cases where both response quality and speed matter, such as interactive chatbots, real-time content generation, and applications that need to handle lengthy inputs. Its large context window makes it well-suited for tasks like document summarization, multi-turn dialogue, and code generation across large codebases. Developers building latency-sensitive applications often choose this variant over the base GPT-4 for its improved throughput.

Text Image Tools
Context: 128,000 Output: 4,096 tokens
Input: $10.00 Output: $30.00
View model →
Vision

GPT-4 Turbo Vision

Release date unavailable

GPT-4 Turbo Vision is a multimodal language model developed by OpenAI that accepts both text and image inputs, allowing it to analyze visual content and answer questions about it. It is built on GPT-4 Turbo and extends the traditional text-only language model paradigm by incorporating vision capabilities, with a context window of 128,000 tokens. The model's training data has a cutoff of December 2023. GPT-4 Turbo Vision is well suited for tasks that require reasoning over images alongside text, such as document analysis, visual question answering, interpreting diagrams, and describing image content. The large context window allows users to include substantial amounts of text alongside image inputs in a single request. It is available through OpenAI's API and is accessible on MindStudio without requiring separate API key management.

Context: 128,000 Output: 4,096 tokens
Input: $10.00 Output: N/A
View model →
Instruct

GPT-3.5 Instruct Deprecated

Sep 28, 2023

This model is a variant of GPT-3.5 Turbo tuned for instructional prompts and omitting chat-related optimizations. Training data: up to Sep 2021.

Text Structured Output
Context: 4.1K Output: 2,000 tokens
Input: $1.50 Output: $2.00
View model →
Text

GPT-3.5 Deprecated

May 28, 2023

GPT-3.5 Turbo is OpenAI's fastest model. It can understand and generate natural language or code, and is optimized for chat and traditional completion tasks. Training data up to Sep 2021.

Text Tools Structured Output
Context: 16.4K Output: 4,000 tokens
Input: $0.50 Output: $1.50
View model →
Text

GPT-4

May 28, 2023

GPT-4 is a large language model developed by OpenAI and released in March 2023 as part of the Generative Pre-trained Transformer series. It accepts text input and produces text output, with a context window of 8,192 tokens, and its training data has a knowledge cutoff of April 2023. GPT-4 was designed to improve on earlier GPT models in areas such as instruction following, contextual understanding, and factual accuracy across a wide range of topics. GPT-4 is well suited for tasks that require sustained coherence over longer passages, such as drafting documents, answering detailed questions, summarizing content, and writing or reviewing code. It is available through the OpenAI API and has been integrated into products including ChatGPT and Microsoft Copilot. Developers and organizations commonly use it for applications that involve natural language understanding, content generation, and conversational interfaces.

Text Tools Structured Output
Context: 8,192 Output: 5,000 tokens
Input: $30.00 Output: $60.00
View model →
Image

GPT Image 1

Release date unavailable

GPT Image 1 is OpenAI's flagship image generation model, released in April 2025, designed to convert text descriptions into images and make targeted edits to existing photos. It is built on a unified neural network architecture that processes both text and images together, which allows it to interpret complex, multi-part prompts and produce outputs that closely match the specified intent. The model supports readable text rendering within images, making it practical for use cases like marketing materials, infographics, and product labels. Output formats include square (1024×1024), portrait (1024×1536), and landscape (1536×1024) resolutions, with three quality tiers available. GPT Image 1 is particularly suited for creative professionals, marketers, and developers who need consistent, production-ready visuals. Its region-aware editing capability allows changes to specific parts of an image — such as a background or a single object — without altering unrelated elements like faces, lighting, or logos. The model accepts image inputs alongside text prompts, enabling workflows that involve editing or building upon existing photos. It is accessible via the OpenAI API and is integrated into MindStudio for use without requiring direct API key management.

Image
Context: 4,000 Output: N/A
Input: $5.00 Output: $40.00
View model →
Image

GPT Image 2

Release date unavailable

OpenAI's latest image generation and editing model, offering state-of-the-art visual quality, precise instruction following, and support for large-scale batch processing.

Image
Context: N/A Output: N/A
Input: N/A Output: N/A
View model →
Image

GPT Image Latest

Release date unavailable

GPT Image Latest is OpenAI's current image generation and editing model, available via the API under the identifier chatgpt-image-latest. This alias always resolves to the most recent version of OpenAI's image generation technology, meaning applications built on it automatically use the latest underlying model without requiring code changes. It powers image creation and editing within ChatGPT and is also directly accessible through the OpenAI API for developers. The model supports both text-to-image generation and targeted editing of existing images, with notable accuracy when rendering text within generated visuals — a historically difficult task for image generation models. It also supports large-scale asynchronous batch processing through OpenAI's Batch API, allowing up to 50,000 image generation or editing jobs to be submitted with separate rate limits. These characteristics make it well-suited for creative professionals, developers building visual applications, and teams that need to process images at scale.

Image
Context: 4,000 Output: N/A
Input: $5.00 Output: $10.00
View model →
Instruct

GPT-3 Deprecated

Release date unavailable

Enhanced language understanding and generation for detailed, context-relevant responses.

Context: N/A Output: 2,500 tokens
Input: N/A Output: N/A
View model →
Text

GPT-4.5 Deprecated

Release date unavailable

Increased capacity and nuance compared to predecessors, offering more accurate text generation.

Text
Context: N/A Output: 8,000 tokens
Input: N/A Output: N/A
View model →
Transcription

GPT-4o mini Transcribe

Release date unavailable

GPT-4o mini Transcribe is a speech-to-text model developed by OpenAI that uses the GPT-4o mini architecture to convert spoken audio into written text. It is designed to deliver improved word error rates and more accurate language recognition compared to the original Whisper-based transcription models. The model is part of OpenAI's transcription API offerings and became available in 2025. This model is well-suited for applications that require accurate transcripts from audio input, such as meeting notes, voice interfaces, and content captioning. Its use of the GPT-4o mini backbone allows it to handle a range of languages with improved recognition accuracy. Developers looking for a cost-efficient transcription option within the OpenAI ecosystem can use this model via the API.

Context: 16,000 Output: 2,000
Input: $1.25 Output: $5.00
View model →
Vision

GPT-4o Mini Vision

Release date unavailable

GPT-4o Mini Vision is a multimodal language model developed by OpenAI, released in mid-2024. It is a smaller, more cost-efficient variant of the GPT-4o family, designed to process both text and images within a single context window of 128,000 tokens. The model supports the same range of languages as GPT-4o and is optimized for low latency, making it suitable for high-throughput or real-time applications. The model is well-suited for tasks that require fast responses at scale, such as customer-facing chat interfaces, document analysis with visual content, and pipelines where cost per token is a primary constraint. Its multimodal reasoning capability allows it to interpret images alongside text in the same request. Developers working with large volumes of context or needing to process mixed text-and-image inputs at reduced cost are the primary intended audience.

Context: 128,000 Output: 16,383 tokens
Input: $0.15 Output: N/A
View model →
Transcription

GPT-4o Transcribe

Release date unavailable

GPT-4o Transcribe is a speech-to-text model developed by OpenAI that uses the GPT-4o model architecture to convert spoken audio into written text. It is part of OpenAI's audio model lineup and was introduced as an improvement over the original Whisper-based transcription models, offering a lower word error rate and more accurate language recognition across a broader range of languages. The model is designed for use cases where transcription accuracy is a priority, such as meeting notes, voice interfaces, medical dictation, and multilingual content. Because it builds on GPT-4o rather than the earlier Whisper architecture, it brings stronger language understanding to the transcription task, which can help with difficult audio conditions, accented speech, and domain-specific vocabulary.

Context: 16,000 Output: 2,000
Input: $2.50 Output: $10.00
View model →
Vision

GPT-4o Vision

Release date unavailable

GPT-4o Vision is a variant of OpenAI's GPT-4o model that accepts both text and image inputs, allowing it to analyze visual content and respond to questions about it. Developed by OpenAI and added to MindStudio in June 2024, it supports a 128,000-token context window and has a training data cutoff of October 2023. The model addresses a historical limitation of language models, which traditionally processed only text, by enabling multimodal input handling within a single system. GPT-4o Vision is well suited for tasks that require interpreting images alongside text, such as describing visual content, answering questions about photographs or diagrams, extracting information from images, and supporting workflows where visual and textual data appear together. Because it shares the GPT-4o architecture, it handles natural language tasks in addition to vision tasks without requiring a separate model. Developers building applications that involve document analysis, image-based Q&A, or mixed-media content can use this model through the OpenAI API.

Context: 128,000 Output: 4,096 tokens
Input: $2.50 Output: N/A
View model →
Text to Speech

GPT-4o-mini TTS

Release date unavailable

GPT-4o-mini TTS is a text-to-speech model developed by OpenAI that converts written text into natural-sounding spoken audio. It belongs to the GPT-4o mini family, which is designed to deliver capable output at a smaller computational footprint than full-scale variants. The model is accessible to developers through the OpenAI API and is intended for programmatic speech generation across a range of applications. It accepts a text input of up to 2,000 characters and returns audio output in a synthesized voice. GPT-4o-mini TTS is part of OpenAI's broader suite of audio models, which also includes transcription and speech-to-speech capabilities. Its focus is specifically on the text-to-speech task, producing clear and expressive spoken output from plain text. The model is well-suited for voice-enabled applications, accessibility tools, content narration, and any product that requires reliable, scalable audio generation without requiring a larger model. Developers can configure voice selection and other parameters through the API.

Context: N/A Output: N/A
Input: $0.60 Output: $12.00
View model →
Text

o1-mini Deprecated

Release date unavailable

Faster, cheaper version of o1 adept at coding, math, and science tasks without extensive general knowledge.

Text
Context: 128,000 Output: 65,536 tokens
Input: $1.10 Output: $4.40
View model →
Text

o1-preview Deprecated

Release date unavailable

Early preview model using broad general knowledge to reason about hard problems.

Text
Context: 128,000 Output: 32,768 tokens
Input: $15.00 Output: $60.00
View model →
Video

Sora 2

Release date unavailable

Sora 2 is OpenAI's video generation model, announced on September 30, 2025. It generates videos up to 10 seconds long with 4K-like detail from text prompts, supporting visual styles including cinematic, photorealistic, and anime. The model integrates audio generation — dialogue, sound effects, and ambient sound — synchronized to on-screen action, including lip-synced character speech, which distinguishes it from its predecessor. Sora 2 is designed for content creators, filmmakers, marketers, and storytellers who want to produce video content from text descriptions. It supports multi-shot scene control, allowing users to maintain character and world continuity across multiple shots within a single video with control over camera angles, lighting, and transitions. A Cameo feature lets users upload a short video of themselves to inject their likeness and voice into generated scenes.

Video
Context: 5,000 Output: N/A
Input: N/A Output: N/A
View model →
Text to Speech

TTS HD

Release date unavailable

TTS HD (model ID: tts-1-hd) is a text-to-speech model developed by OpenAI that converts written text into natural-sounding spoken audio. It accepts a text input of up to 4096 tokens and produces audio output in a variety of supported voices. TTS-1-HD is the quality-optimized variant in OpenAI's TTS model family, designed to produce higher-fidelity audio compared to the standard TTS-1 offering. The model is well-suited for applications that require clear, natural-sounding voice output, such as voice assistants, audiobook narration, accessibility tools, and content creation workflows. It supports multiple built-in voices and can output audio in formats including MP3, Opus, AAC, and FLAC. Developers access the model through OpenAI's API, and it is available on MindStudio without requiring separate API key management.

Context: N/A Output: N/A
Input: $30.00 Output: N/A
View model →
Transcription

Whisper

Release date unavailable

Whisper is a general-purpose speech recognition model developed by OpenAI and made available via the OpenAI API under the model ID whisper-1. It was trained on a large dataset of diverse audio, enabling it to handle a wide range of accents, background noise conditions, and technical vocabulary. What distinguishes Whisper is its multitask design: it can perform not only speech-to-text transcription but also speech translation into English and automatic language identification within a single model. Whisper is well suited for developers building transcription pipelines, subtitle generation tools, voice interfaces, or any application that requires converting spoken audio into structured text. It supports multilingual input, making it useful for global applications where audio may arrive in different languages. The model accepts common audio formats and returns transcriptions or translations as plain text or with optional timestamps.

Context: N/A Output: N/A
Input: $0.01 Output: N/A
View model →
G

Google

35 models

Text

Gemini 1.0 Pro Deprecated

Apr 27, 2026

This model always redirects to the latest model in the Google Gemini Pro family.

Text Image File
Context: 1.0M Output: 2,048 tokens
Input: $2.00 Output: $12.00
View model →
Image

Gemini 3.1 Flash Image

Feb 26, 2026

Gemini 3.1 Flash Image Preview, internally codenamed "Nano Banana 2," is Google DeepMind's Flash-tier image generation and editing model released in February 2026. It accepts image URL arrays alongside configurable input parameters, making it suitable for both generation and editing workflows. The model holds the number one ranking on the Artificial Analysis Image Arena and the Arena leaderboard for text-to-image generation, and every image it produces is invisibly watermarked via SynthID for provenance tracking. The model is distinguished by its ability to maintain visual coherence across up to five characters and fourteen objects in a single image, its text rendering within generated images with multilingual support, and its use of real-time web search to inform image outputs. It supports flexible aspect ratios and upscaling up to 4K resolution. Gemini 3.1 Flash Image is available through the Gemini API, Google AI Studio, and Vertex AI, and is well suited for developers, designers, and businesses that need high-volume, high-quality image generation and editing at scale.

Text Image Structured Output
Context: 131,072 Output: 32,768 tokens
Input: $0.50 Output: $3.00
View model →
Text

Gemini 3.1 Pro

Feb 19, 2026

Gemini 3.1 Pro is a frontier reasoning model developed by Google, released in February 2026 as a major upgrade to the Gemini 3 series. It supports multimodal inputs — including text, images, video, audio, and code — within a single model, and offers a context window of 1,048,576 tokens, equivalent to roughly 1,500 A4 pages. The model scores 77.1% on the ARC-AGI-2 benchmark and introduces a medium thinking level designed to balance cost, speed, and reasoning depth. Gemini 3.1 Pro is built for developers, enterprises, and researchers working on demanding, multi-step workflows. It is particularly suited to agentic coding, structured planning, financial modeling, multimodal analysis, and workflow automation. The model is accessible through the Gemini API, Google AI Studio, Vertex AI, Gemini CLI, Android Studio, and the Gemini app for Pro and Ultra subscribers.

Text Image File
Context: 1,048,576 Output: 65,536 tokens
Input: $2.00 Output: $12.00
View model →
Text

Gemini 3 Flash

Dec 17, 2025

Gemini 3 Flash is a text generation model developed by Google, released in December 2025 as part of the Gemini 3 family. It is designed to deliver near-frontier reasoning performance at lower latency than full-scale models, making it suitable for interactive and production-grade applications. The model accepts multimodal inputs including text, images, audio, video, and PDFs, and produces text output. A configurable reasoning system allows users to select thinking levels — minimal, low, medium, or high — to balance response speed against reasoning depth. The model supports a context window of up to 1,048,576 tokens, enabling it to process very long documents, codebases, and extended conversation histories in a single pass. It includes built-in support for tool use, structured output, and automatic context caching, which makes it well-suited for agentic workflows and multi-step pipelines. Developers working on coding assistants, automated agents, and multi-turn chat applications are the primary intended audience. It is available via the Gemini API and through third-party providers such as OpenRouter.

Text Image File
Context: 1,048,576 Output: 65,535 tokens
Input: $0.50 Output: $3.00
View model →
Image

Gemini 3 Pro Image

Nov 20, 2025

Gemini 3 Pro Image is Google's flagship image generation and editing model, built on the Gemini 3 Pro architecture. It supports both image and text prompts as inputs and is designed for high-fidelity visual creation tasks such as product visualization, storyboarding, infographic design, and complex multi-element compositions. The model includes a tunable media_resolution control that lets developers balance speed, precision, and detail depending on the task at hand. It also supports real-time grounding via Search integration, enabling context-rich visual outputs. The model is notable for its text rendering capabilities within images, including long passages and multilingual layouts, as well as identity preservation across up to five subjects in multi-image blending scenarios. It supports high-resolution output at 2K and 4K resolutions with flexible aspect ratios, along with fine-grained controls for localized edits, lighting adjustments, focus changes, and camera transformations. Gemini 3 Pro Image is best suited for designers, developers, and creative professionals who require reliable, high-quality image generation with detailed control over visual outputs. It was added to MindStudio on November 20, 2025, with a training data cutoff of November 2025.

Text Image Structured Output
Context: 65,536 Output: 32,768 tokens
Input: $2.00 Output: $12.00
View model →
Text

Gemini 3 Deprecated

Nov 18, 2025

Gemini 3 Pro is a multimodal text generation model developed by Google, released in November 2025. It supports a context window of 1,048,576 tokens and is designed to handle complex reasoning tasks, nuanced instruction following, and agentic workflows. The model is available to developers through Google AI Studio and Vertex AI, and is also integrated into Google Search and the Gemini app. Gemini 3 Pro is built for tasks that require understanding context and intent with minimal prompting, including multi-step problem solving, code generation, and multimodal input processing. It is positioned as Google's primary model for agentic development, including use within the Google Antigravity platform. The model accepts tool inputs alongside text and numeric parameters, making it suited for applications that require dynamic tool use and structured interactions.

Text
Context: 1,048,576 Output: 65,536 tokens
Input: $2.00 Output: $12.00
View model →
Image

Gemini 2.5 Flash Image

Oct 07, 2025

Gemini 2.5 Flash Image is Google's image generation and editing model, launched in August 2025 via the Gemini API, Google AI Studio, and Vertex AI. It is internally nicknamed "nano-banana" and builds on the native image generation capabilities introduced in Gemini 2.0 Flash. The model accepts arrays of image URLs as input, enabling workflows that involve multiple source images in a single request. It supports a context window of 1,048,576 tokens, allowing for richly detailed prompts alongside image inputs. The model is designed for use cases that require combining natural language instructions with visual content, including targeted image editing, multi-image blending, and maintaining consistent characters across a series of images. It integrates Gemini's broad world knowledge into the generation process, which helps produce contextually accurate visual outputs from descriptive text prompts. Developers and enterprises building creative tools, storytelling applications, or product visualization pipelines are the primary intended audience. It is accessible through both the Gemini API and Vertex AI, making it available for consumer and enterprise deployments.

Text Image Structured Output
Context: 1,048,576 Output: 32,768 tokens
Input: $0.30 Output: $2.50
View model →
Video

Veo 3 Fast

Release date unavailable

Veo 3 Fast is a video generation model developed by Google, available through Google AI Studio under the identifier veo-3.0-fast-generate-001. It is the speed-optimized variant of the Veo 3 model family, designed to produce AI-generated video from text prompts with faster turnaround times than the standard Veo 3 model. It accepts text inputs of up to 480 tokens and outputs video content based on those descriptions. This model is well-suited for workflows where generation speed is a priority, such as rapid prototyping of video concepts, content pipeline automation, and exploratory creative work. Developers and content creators who need to iterate quickly or generate video at scale will find the fast variant a practical choice. It supports image URL inputs alongside text, allowing for additional visual context to guide generation.

Video
Context: 480 Output: N/A
Input: N/A Output: N/A
View model →
Text

Gemini 2.5 Flash Lite

Jul 22, 2025

Gemini 2.5 Flash Lite is Google's most cost-efficient model in the Gemini 2.5 family, designed for high-volume, latency-sensitive workloads. It supports a 1 million-token context window and includes optional reasoning capabilities that can be toggled on or off via controllable thinking budgets, allowing developers to balance speed and depth depending on the task. The model also supports Grounding with Google Search, Code Execution, and URL Context as built-in features. Gemini 2.5 Flash Lite is well-suited for production applications that require processing large numbers of requests efficiently, such as document classification, real-time translation, content moderation, and coding assistance. Its multimodal input support and broad benchmark coverage across coding, math, science, and reasoning tasks make it a practical choice for developers building scalable AI pipelines where cost and throughput are primary constraints.

Text Image File
Context: 1.0M Output: 65,535 tokens
Input: $0.10 Output: $0.40
View model →
Video

Veo 3

Release date unavailable

Veo 3 (model ID: veo-3.0-generate-001) is Google's generally available video generation model, accessible through the Vertex AI platform. It is the stable, production-ready successor to the earlier veo-3.0-generate-preview endpoint, which Google has officially deprecated. The model accepts text prompts, image inputs, and configuration parameters to synthesize video content, and it supports a context window of up to 5,000 tokens. Veo 3 is part of Google's broader Veo 3.0 model family, which also includes a fast-generation variant. Google designated this release as the recommended migration target for teams that had been using the preview endpoint, signaling its readiness for large-scale, enterprise workloads. It is best suited for developers and organizations building production applications that require reliable, API-driven video generation through Google Cloud infrastructure.

Video
Context: 5,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Veo 3.1

Release date unavailable

Veo 3.1 is Google's generally available video generation model, released under the identifier veo-3.1-generate-001 and accessible through Google's Vertex AI platform. It is the production-ready successor to the veo-3.1-generate-preview endpoint, making it the recommended migration target for developers who built on the preview version. The model generates video content from text prompts and supports image-based inputs, enabling a range of media creation workflows. It is part of Google's broader Veo model family, which includes multiple generation variants. Veo 3.1 is designed for developers and businesses that need a reliable, supported API for AI video generation at scale. Its stable endpoint status means it carries production-grade support commitments, distinguishing it from preview or experimental releases. The model accepts text prompts, individual image URLs, and image arrays as inputs, and supports a seed parameter for reproducible outputs. It is well suited for applications such as marketing content pipelines, automated media production, and any workflow requiring consistent, repeatable video generation.

Video
Context: 5000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Veo 3.1 Fast

Release date unavailable

Veo 3.1 Fast is a video generation model developed by Google, part of the Veo 3.1 model family. It is optimized for speed, making it suitable for workflows that require rapid video output at scale. The model is available through both the Gemini API and Vertex AI, giving developers two integration paths for production use. The stable endpoint identifier is veo-3.1-fast-generate-001, which replaced an earlier preview endpoint. Veo 3.1 Fast accepts text prompts as well as image inputs, including single images and image arrays, allowing for both text-to-video and image-to-video generation workflows. It supports configuration options such as aspect ratio, duration, and resolution through toggle-based parameters. The model is best suited for developers and creators who need to generate video content quickly at scale and require a stable, production-grade API endpoint for integration into their pipelines.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Text

Gemini 2.5 Flash

Jun 17, 2025

Gemini 2.5 Flash is a text generation model developed by Google, designed to balance performance and cost efficiency. It is a thinking model, meaning it applies internal reasoning steps before producing a response, which supports more deliberate outputs across a range of tasks. The model supports a context window of 1,048,576 tokens, making it suitable for processing long documents, extended conversations, and large codebases in a single request. Gemini 2.5 Flash is well-suited for tasks that require both speed and reasoning, such as summarization, question answering, tool use, and multi-step instruction following. It supports tool integrations, allowing it to be used in agentic workflows where external functions or APIs need to be called. The model reached general availability with a training data cutoff of June 2025, and is accessible through Google's Vertex AI platform.

Text Image File
Context: 1,048,576 Output: 65,535 tokens
Input: $0.30 Output: $2.50
View model →
Vision

Gemini 2.5 Flash Vision

Jun 17, 2025

Gemini 2.5 Flash Vision is a multimodal vision model developed by Google, designed to process and reason over visual inputs alongside text. It is part of the Gemini 2.5 Flash family, which is built around balancing cost efficiency with broad capability coverage. The model supports a context window of 1,048,576 tokens, making it suitable for tasks that require processing large amounts of information in a single request. It was trained with a knowledge cutoff of June 2025. This model is positioned for use cases where real-time or low-latency responses are important, such as visual question answering, document analysis with images, and applications that combine vision with extended context. The "thinking" architecture underlying the Gemini 2.5 Flash series enables the model to apply multi-step reasoning before producing a response. Developers looking for a vision-capable model that can handle long documents, images, and mixed-modality inputs without incurring the cost of larger models will find this a practical option.

Context: 1,048,576 Output: 65,535 tokens
Input: $0.30 Output: N/A
View model →
Text

Gemini 2.5 Pro

Jun 17, 2025

Gemini 2.5 Pro is a thinking model developed by Google DeepMind, designed to reason through complex problems rather than simply predict outputs. It is built to analyze information, draw logical conclusions, and incorporate contextual nuance across tasks in code, mathematics, and STEM. The model supports native multimodality, meaning it can process text, images, audio, video, and code repositories within a single context. The model features a 1,048,576-token context window, making it suited for tasks that require processing large documents, entire codebases, or extended conversations. It scored 63.8% on the SWE-Bench Verified coding evaluation and is available through the Gemini API and Google AI Studio. It is best suited for developers and researchers working on complex reasoning tasks, long-document analysis, and advanced code generation.

Text Image File
Context: 1,048,576 Output: 65,536 tokens
Input: $1.25 Output: $10.00
View model →
Vision

Gemini 2.5 Pro Vision

Jun 17, 2025

Gemini 2.5 Pro Vision is a multimodal AI model developed by Google DeepMind, designed to reason through complex problems by analyzing text, images, audio, video, and code. It operates as a "thinking model," meaning it works through logical steps before producing a response rather than generating output directly. The model supports a context window of 1,048,576 tokens, enabling it to process large documents, codebases, and extended conversations in a single request. The model is particularly suited for tasks that require combining visual understanding with structured reasoning, such as interpreting diagrams, analyzing image-based data, and generating code from visual inputs. It has demonstrated strong benchmark performance in math, science, and software engineering tasks, including a 63.8% score on the SWE-Bench Verified evaluation. Gemini 2.5 Pro Vision is available through Google AI Studio and via the Gemini API, making it accessible for developers building applications that require both vision and reasoning capabilities.

Context: 1,048,576 Output: 65,536 tokens
Input: $1.25 Output: N/A
View model →
Text

Gemma 3.2

Release date unavailable

Gemma 3 27B is an open-weight multimodal language model developed by Google DeepMind as the flagship model in the Gemma 3 family. It accepts both image and text inputs and generates text outputs, supporting over 140 languages and a context window of 128,000 tokens — sixteen times larger than the previous Gemma 2 generation. The model is built on the same research foundation as Google's Gemini models and was released in March 2025. Gemma 3 27B is designed to run in resource-constrained environments, including on a single consumer GPU with 24GB of VRAM, as well as on laptops, desktops, and cloud infrastructure. It is well-suited for tasks such as visual question answering, document analysis, multilingual text generation, summarization, coding assistance, and logical reasoning. Its combination of multimodal input support, large context handling, and open-weight availability makes it a practical choice for developers building applications that require flexible deployment options.

Text
Context: 128,000 Output: 8,000 tokens
Input: $0.10 Output: N/A
View model →
Text

Gemini 2.0 Flash Lite

Feb 25, 2025

Gemini 2.0 Flash Lite is a multimodal text generation model developed by Google, released in early 2025 as part of the Gemini 2.0 model family. It is designed specifically for high-volume, cost-sensitive applications, offering a balance between response speed and output quality. The model supports a context window of over one million tokens (1,048,576), making it suitable for processing long documents or extended conversations in a single request. Gemini 2.0 Flash Lite is best suited for developers and organizations that need to run large numbers of inference requests without incurring high costs. Its architecture prioritizes throughput and efficiency, making it a practical choice for tasks like summarization, classification, translation, and content generation at scale. The model's training data has a cutoff of June 2024, and it is accessible through Google's Vertex AI platform.

Text Image File
Context: 1,048,576 Output: 8,192 tokens
Input: $0.08 Output: $0.30
View model →
Vision

Gemini 2.0 Flash-Lite Vision

Feb 25, 2025

Gemini 2.0 Flash-Lite Vision is a multimodal model developed by Google, designed to process both visual and textual inputs. It belongs to the Gemini 2.0 Flash family and is positioned as the fastest and most cost-efficient option within that lineup. The model supports a context window of over one million tokens, making it suitable for tasks that require processing large amounts of information in a single request. It was trained on data up to June 2024. This model is intended as an upgrade path for users of Gemini 1.5 Flash who want improved output quality without changes to cost or latency. Its vision capabilities allow it to handle image understanding tasks alongside text-based workflows. The combination of speed, large context support, and multimodal input handling makes it well-suited for applications such as document analysis, image captioning, and high-throughput pipelines where cost efficiency is a priority.

Context: 1,048,576 Output: 8,192 tokens
Input: $0.08 Output: N/A
View model →
Text

Gemini 2.0 Flash

Feb 05, 2025

Gemini 2.0 Flash is a text generation model developed by Google, released as part of the Gemini 2.0 model family. It features a context window of 1,048,576 tokens and is designed to handle a broad range of everyday tasks with real-time response latency. The model's training data has a cutoff of June 2024. Gemini 2.0 Flash is positioned as an upgrade for users of the 1.5 Flash model who want meaningfully improved output quality, and for users of the 1.5 Pro model who want comparable or slightly improved quality at lower latency and cost. It is well-suited for applications that require processing long documents, maintaining extended conversations, or running high-throughput workloads where response speed matters.

Text Image File
Context: 1,048,576 Output: 8,192 tokens
Input: $0.15 Output: $0.40
View model →
Vision

Gemini 2.0 Flash Vision

Feb 05, 2025

Gemini 2.0 Flash Vision is a multimodal language model developed by Google, designed to process and reason over text, images, and other input types within a single context window of up to 1,048,576 tokens. It is part of the Gemini 2.0 Flash family, which emphasizes speed and efficiency alongside broad capability coverage including built-in tool use and multimodal generation. The model's training data has a cutoff of June 2024. Gemini 2.0 Flash Vision is well-suited for tasks that require understanding visual content alongside large volumes of text, such as document analysis, image-based question answering, and long-context reasoning. Its large context window makes it practical for workflows involving lengthy documents or multi-turn conversations that incorporate both images and text. The model is accessible through Google's Vertex AI platform and is intended for developers building applications that need fast, multimodal processing at scale.

Context: 1,048,576 Output: 8,192 tokens
Input: $0.15 Output: N/A
View model →
Image

Imagen 4 Fast

Release date unavailable

Imagen 4 Fast is a text-to-image generation model developed by Google, available through the fal.ai platform as a fast inference variant. It accepts natural language text prompts and produces images across a range of visual styles, including photorealistic scenes, illustrations, concept art, and painterly aesthetics. The model is designed to interpret complex, multi-element prompts and render fine details such as textures, fabrics, and lighting with precision. It was trained with data through early 2025 and is available for commercial use via fal.ai. Imagen 4 Fast is optimized for workflows that require rapid image generation without reducing visual fidelity. It supports a context window of up to 10,000 tokens, allowing for detailed and descriptive prompts. The model is well-suited for creative professionals, developers, and businesses building applications around product imagery, storytelling visuals, or concept development. Input types include text prompts, selection parameters, and a seed value for reproducible outputs.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Vision

Gemini 1.0 Pro Vision Deprecated

Release date unavailable

Handles both text and image inputs for content generation and problem-solving.

Context: N/A Output: 1,024 tokens
Input: N/A Output: N/A
View model →
Text

Gemini 1.5 Flash Deprecated

Release date unavailable

Speedy, cost-effective multimodal model for high-volume applications without compromising quality.

Text
Context: N/A Output: 8,192 tokens
Input: N/A Output: N/A
View model →
Vision

Gemini 1.5 Flash Vision Deprecated

Release date unavailable

Fast, cost-effective multimodal model for quality applications at high volume.

Context: N/A Output: 8,192 tokens
Input: N/A Output: N/A
View model →
Text

Gemini 1.5 Pro Deprecated

Release date unavailable

Proficient at multimodal tasks and content creation from image, audio, and video inputs.

Text
Context: N/A Output: 8,192 tokens
Input: N/A Output: N/A
View model →
Vision

Gemini 1.5 Pro Vision Deprecated

Release date unavailable

Adept at processing visual and text inputs for multimodal tasks and content creation.

Context: N/A Output: 8,192 tokens
Input: N/A Output: N/A
View model →
Text

Gemini 2.0 Flash Thinking Deprecated

Release date unavailable

Combining speed and performance, 2.0 Flash Thinking Experimental excels in science and math, showing its thinking to solve complex problems.

Text
Context: 128K Output: 8,192 tokens
Input: N/A Output: N/A
View model →
Text

Gemini 2.0 Pro Deprecated

Release date unavailable

An experimental update Gemini 2.0 for coding and complex prompts.

Text
Context: 128K Output: 8,192 tokens
Input: N/A Output: N/A
View model →
Text to Speech

Gemini 3.1 Flash TTS

Release date unavailable

The Gemini 3.1 Flash TTS Preview model provides powerful, low-latency speech generation with natural outputs, steerable prompts, and new expressive audio tags for precise narration control.

Context: N/A Output: 16,384 tokens
Input: $1.00 Output: N/A
View model →
Image

Imagen 3

Release date unavailable

Imagen 3 is a text-to-image generation model developed by Google, available through fal.ai, that produces photorealistic images from natural language prompts. It supports a range of visual styles from photorealism to animation and maintains consistent visual composition across five aspect ratios. A notable technical characteristic is its ability to accurately render readable text, signage, and typography within generated images, which has historically been a challenge for image generation models. The model accepts conversational prompts without requiring specialized syntax, and a seed parameter enables reproducible outputs for iterative workflows. Imagen 3 is well suited for use cases that require high visual fidelity and reliable in-image text, including marketing asset creation, product visualization, and concept art development. It supports batch generation of up to four images per request and outputs across aspect ratios including 1:1, 16:9, 9:16, 3:4, and 4:3. The model was trained through late 2024 and accepts text, select, and seed as input types. A companion variant, Imagen 3 Fast, is available for workflows where generation speed takes priority over maximum image quality.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Imagen 3 Fast

Release date unavailable

Imagen 3 Fast is a text-to-image model developed by Google, built on the Imagen 3 architecture and optimized for lower-latency generation. It accepts text prompts and produces images across a range of visual styles, from photorealistic scenes to illustrated and artistic outputs. A notable characteristic of the model is its ability to render legible text within generated images, which is a common challenge for image generation systems. It supports five aspect ratios — 1:1, 16:9, 9:16, 3:4, and 4:3 — and can generate up to four images per request. Imagen 3 Fast is available through the fal.ai platform with full API access, making it accessible to developers building content creation pipelines, prototyping tools, or real-time applications where generation speed is a priority. The model supports seed-based inputs for reproducible outputs, giving developers control over generation consistency. It accepts up to 10,000 context tokens for prompt input and was trained on data through late 2024. It is well suited for teams that need scalable, fast image generation without manual configuration of complex parameters.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

Imagen 4 Ultra

Release date unavailable

Imagen 4 Ultra is Google's flagship image generation model and the top tier of the Imagen 4 family, trained through early 2025. It accepts text prompts of up to 10,000 tokens and is designed to handle complex, multi-element descriptions including specific art styles, multi-scene compositions, and nuanced visual storytelling. The model supports image URL arrays as input, allowing users to reference existing images alongside text prompts. It is licensed for commercial use, making it available to businesses and creative professionals working on production-grade projects. Imagena 4 Ultra is best suited for use cases where image fidelity and detail are priorities, such as professional design work, advertising, and high-resolution visual content creation. It covers a wide range of output styles, from photorealistic portraits and landscapes to stylized illustrations and pixel art. According to community benchmarking discussions, Imagen 4 Ultra has achieved competitive Elo ratings in image arenas, including a reported tie with GPT-Image-1 in the Image Arena as of mid-2025. The model is accessible via the Google Gemini API as well as third-party inference platforms such as fal.ai.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Text

PaLM 2 Deprecated

Release date unavailable

Advanced language model with high efficiency and accuracy for complex language tasks and creative content generation.

Text
Context: N/A Output: 1,024 tokens
Input: N/A Output: N/A
View model →
Video

Veo 2

Release date unavailable

Veo 2 is Google's production-ready video generation model, released in April 2025 via the Gemini API under the model ID veo-2.0-generate-001. It accepts both text prompts and reference images as input, generating high-definition video output at resolutions up to 4K. The model includes physics-aware rendering that handles fluid dynamics, lighting, and object interactions, and it embeds SynthID watermarking in all generated videos to identify AI-created content. Veo 2 is available through both the Gemini API and Google's Vertex AI platform, making it accessible to developers via standard API calls without specialized infrastructure. It supports cinematic prompt controls such as aerial shots, panning, and time-lapses, and maintains consistent character appearance across scenes. The model is suited for developers, marketers, creative professionals, and educators who need to generate video content programmatically for use cases like product demos, ad campaigns, and educational visualizations.

Video
Context: 5,000 Output: N/A
Input: N/A Output: N/A
View model →
M

Mistral

17 models

Text

Mistral Large 3

Dec 02, 2025

Open source

Mistral Large 3 is a 675-billion-parameter mixture-of-experts (MoE) text generation model developed by Mistral. It is the first MoE model Mistral has released since the Mixtral series, and was trained from scratch on 3,000 NVIDIA H200 GPUs. The model is released under a permissive open-weight license, making the weights publicly available for download and self-hosting. Mistral Large 3 supports a 256,000-token context window and includes image understanding alongside text generation. It is particularly noted for multilingual conversation handling, with Mistral highlighting non-English and non-Chinese language performance as a focus area. The model is well-suited for tasks requiring long-context reasoning, multilingual text processing, and instruction following across general-purpose prompts.

Text
Context: 256,000 Output: 16,000 tokens
Input: $0.50 Output: $1.50
View model →
Text

Mistral Medium 3

May 07, 2025

Mistral Medium 3 is a text generation model released on May 7, 2025 by Mistral, a French AI company. It is designed to balance performance with cost efficiency, priced at $0.40 per million input tokens and $2.00 per million output tokens. The model supports a 128,000-token context window and was trained on data through early 2025. It is available through Mistral La Plateforme and Amazon SageMaker, with additional platform support planned. Mistral Medium 3 is built with enterprise deployment in mind, supporting self-hosted setups with a minimum of four GPUs as well as any cloud environment. It can be customized through continuous pre-training, fine-tuning, and integration with enterprise knowledge bases, making it applicable to domain-specific workflows in sectors such as financial services, energy, and healthcare. The model is noted for its strengths in coding tasks and multimodal understanding, and is suited for use cases including customer service automation, business process personalization, and complex dataset analysis.

Text Image File
Context: 128,000 Output: 16,000 tokens
Input: $0.40 Output: $2.00
View model →
Text

Mistral Nemo

Jul 19, 2024

Mistral NeMo is a text generation model developed by Mistral, a French AI company. It features a 128,000-token context window and is trained with function calling support, making it suitable for agentic and tool-use workflows. The model has particular strength across eleven languages: English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. Mistral NeMo is a 12-billion parameter model built in collaboration with NVIDIA, which is reflected in the "NeMo" name referencing NVIDIA's NeMo framework. It is designed for developers and organizations building multilingual applications where broad language coverage and a large context window are priorities. The model's combination of function calling capability, multilingual training, and long-context handling makes it a practical choice for global deployment scenarios.

Text Tools Structured Output
Context: 128,000 Output: 64,000 tokens
Input: $0.15 Output: $0.04
View model →
Text

Mixtral 8x22B Instruct Deprecated

Apr 17, 2024

Mistral's official instruct fine-tuned version of [Mixtral 8x22B](/models/mistralai/mixtral-8x22b). It uses 39B active parameters out of 141B, offering unparalleled cost efficiency for its size. Its strengths include: - strong math, coding,...

Text File Tools
Context: 65.5K Output: 64,000 tokens
Input: $2.00 Output: $6.00
View model →
Text

Mistral 7B Instruct

Oct 10, 2023

Mistral 7B Instruct is a 7-billion-parameter language model developed by Mistral AI and released in September 2023. It is the instruction-tuned variant of the base Mistral 7B model, fine-tuned to follow user instructions and produce clear, direct responses. The model uses grouped-query attention (GQA) and sliding window attention (SWA) techniques, which allow it to handle sequences efficiently within its 4,096-token context window. This model is well-suited for instruction-following tasks such as conversational AI, content summarization, and task-oriented dialogue. Because it is optimized to adhere closely to user-provided instructions, it performs consistently in structured workflows where predictable output format matters. It is available through Amazon Bedrock and is also openly accessible on Hugging Face, making it usable in a range of deployment environments.

Text
Context: 4,096 Output: 2,500 tokens
Input: $0.15 Output: N/A
View model →
Text

Mistral 7B Instruct Deprecated

Oct 10, 2023

Focused on instruction-based tasks, providing clear, concise responses adhering to user instructions.

Text
Context: N/A Output: 2,500 tokens
Input: N/A Output: N/A
View model →
Text

Ministral 3 14B

Release date unavailable

Ministral 3 14B is the largest model in the Ministral 3 family, developed by Mistral AI. It is an open-source text generation model with a 256,000-token context window, designed to handle long-form inputs and extended conversations. The model is released under an open license, making it available for local deployment and self-hosted use cases. The model is optimized for running on diverse hardware configurations, including consumer-grade local setups, which makes it suitable for developers and researchers who prefer on-device inference. Its 14 billion parameter count positions it as the largest variant in the Ministral 3 series. Common use cases include text generation, summarization, instruction following, and tasks that benefit from a large context window without requiring cloud-based infrastructure.

Text
Context: 256,000 Output: 16,000 tokens
Input: $0.20 Output: N/A
View model →
Text

Ministral 3 3B

Release date unavailable

Ministral 3 3B is a 3-billion-parameter language model developed by Mistral AI as part of the Ministral 3 family. It is the smallest model in that family and is released as open-weight, meaning the model weights are publicly available for download and local use. The model supports a 256,000-token context window and includes both language and vision capabilities in a compact form factor. Ministral 3 3B is designed specifically for edge deployment, making it suitable for running on local hardware, embedded systems, and resource-constrained environments. Its small parameter count allows it to operate efficiently across a wide range of hardware configurations without requiring cloud infrastructure. It is well-suited for developers building on-device applications, offline workflows, or latency-sensitive pipelines where a smaller footprint is a requirement.

Text
Context: 256,000 Output: 16,000 tokens
Input: $0.10 Output: N/A
View model →
Text

Ministral 3 8B

Release date unavailable

Ministral 3 8B is a text generation model developed by Mistral AI, part of the Ministral 3 model family. It is open source and designed with edge deployment in mind, meaning it is optimized to run efficiently across a range of hardware configurations, including local setups without cloud infrastructure. The model supports a 256,000-token context window, enabling it to process and reason over long documents in a single pass. Ministral 3 8B is well-suited for developers and organizations that need a capable language model deployable on-device or in resource-constrained environments. Its 8-billion parameter size makes it practical for local inference while still handling a broad range of text generation tasks. The open-source availability means it can be downloaded, fine-tuned, and self-hosted without requiring API access.

Text
Context: 256,000 Output: 16,000 tokens
Input: $0.15 Output: N/A
View model →
Text

Mistral 8x7b Deprecated

Release date unavailable

Mixtral 8x7B is a high-performance mixture-of-experts language model from Mistral AI, offering a 32K token context window with efficient, fast inference.

Text
Context: N/A Output: 8,192 tokens
Input: N/A Output: N/A
View model →
Text

Mistral Codestral

Release date unavailable

Mistral Codestral is an open-weight generative AI model built by Mistral and designed specifically for code generation tasks. It operates through a shared instruction and completion API endpoint, allowing developers to both write new code and interact with existing codebases. The model is trained on a dataset spanning more than 80 programming languages, including Python, Java, C, C++, JavaScript, Bash, Swift, and Fortran. Codestral is intended for developers building AI-assisted coding tools and applications, as it handles both code and English fluently. Its broad language coverage makes it applicable across a wide range of development environments and project types. Because it is open-weight, it can be deployed and integrated in ways that closed models typically do not permit.

Text
Context: 32,000 Output: 16,000 tokens
Input: $0.20 Output: N/A
View model →
Text

Mistral Large 24.02

Release date unavailable

Mistral Large 24.02 is a text generation model developed by Mistral, built around 123 billion parameters and designed to run on a single node for large-throughput inference. It features a 128,000-token context window, making it suited for long-document processing and extended conversational tasks. The model supports dozens of natural languages, including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean. Beyond natural language, Mistral Large 24.02 supports over 80 programming languages, including Python, Java, C, C++, JavaScript, and Bash, making it applicable to code generation and analysis tasks. Its single-node inference design means it can deliver high throughput without requiring distributed infrastructure. This combination of broad language coverage, large context capacity, and coding support makes it well-suited for multilingual applications, long-context document workflows, and software development assistance.

Text
Context: 128,000 Output: 16,000 tokens
Input: $4.00 Output: N/A
View model →
Text

Mistral Large 24.07

Release date unavailable

Mistral Large 24.07 is a text generation model developed by Mistral, released in July 2024 as the second iteration of their Large series. It features 123 billion parameters and a 128,000-token context window, making it suitable for long-document processing and extended conversational tasks within a single inference node. The model supports dozens of natural languages, including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean. One of the model's defining characteristics is its design for single-node inference, meaning the full 123B parameter model can run at high throughput without requiring multi-node infrastructure. It also supports over 80 coding languages, including Python, Java, C, C++, JavaScript, and Bash, making it applicable to software development workflows. On MindStudio, it is available through Amazon Bedrock under the identifier mistral-large-24.07-bedrock.

Text
Context: 128,000 Output: 16,000 tokens
Input: $2.00 Output: N/A
View model →
Text

Mistral Small 24.02

Release date unavailable

Mistral Small 24.02 is a text generation model developed by Mistral, designed to run on a single node while supporting a 128,000-token context window. It covers dozens of natural languages including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean, as well as over 80 coding languages such as Python, Java, C, C++, JavaScript, and Bash. The model has 123 billion parameters, which enables high-throughput inference without requiring multi-node infrastructure. This model is well-suited for long-context applications where fitting large documents or extended conversations into a single prompt is necessary. Its broad language coverage makes it applicable to multilingual workflows, while its coding language support makes it useful for code generation and analysis tasks. The single-node inference design is a practical consideration for teams managing deployment costs and infrastructure complexity.

Text
Context: 128,000 Output: 16,000 tokens
Input: $1.00 Output: N/A
View model →
Text

Mistral Small 3.1 (25.03)

Release date unavailable

Mistral Small 3.1 (25.03) is a text generation model developed by Mistral, released in March 2025. It features a 128,000-token context window, multimodal understanding, and support for dozens of spoken languages alongside more than 80 coding languages. The model is designed to run on a single node, making it practical for deployment without distributed infrastructure. This version introduces improved text performance and expanded context handling compared to earlier Mistral Small releases. At an inference speed of approximately 150 tokens per second, it is suited for tasks that require both throughput and long-context processing, such as document analysis, multilingual applications, and code generation. Its combination of broad language coverage and single-node efficiency makes it a practical choice for developers building production applications with constrained compute budgets.

Text
Context: 128,000 Output: 16,000 tokens
Input: $0.10 Output: N/A
View model →
Text

Mixtral 8x7B Instruct

Release date unavailable

Mixtral 8x7B Instruct is a sparse mixture-of-experts (SMoE) language model developed by Mistral AI and released under the Apache 2.0 license. It uses a routing mechanism that activates only a subset of its expert networks per token, allowing it to draw on a large total parameter count while keeping active computation lower than a dense model of equivalent size. The instruct variant has been fine-tuned to follow instructions and engage in conversational tasks. The model has a context window of 4,096 tokens and was trained on data through September 2023. Its open-weight, permissive license makes it suitable for commercial and research use cases where model access and reproducibility matter. It is well-suited for tasks such as text generation, summarization, question answering, and general instruction following.

Text
Context: 4,096 Output: 2,500 tokens
Input: $0.45 Output: N/A
View model →
Text

Mixtral 8x7B Instruct Deprecated

Release date unavailable

High-quality, efficient sparse model outperforming larger models in speed and benchmarks.

Text
Context: N/A Output: 2,500 tokens
Input: N/A Output: N/A
View model →
X

X.ai

17 models

Text

Grok 4.3

Apr 30, 2026

Grok 4.3 is a reasoning model from xAI. It accepts text and image inputs with text output, and is suited for agentic workflows, instruction-following tasks, and applications requiring high factual...

Text Image Tools
Context: 1M Output: 2,000,000 tokens
Input: $1.25 Output: $2.50
View model →
Text

Grok 4.20

Mar 31, 2026

Grok 4.20 is a text generation model developed by xAI, the AI division of X. This variant is specifically configured with reasoning disabled, meaning it skips the extended chain-of-thought process to deliver faster, lower-latency responses while still operating on the full Grok 4.20 architecture. It supports a context window of up to 2 million tokens, allowing it to ingest very long documents, large codebases, or extended conversation histories in a single pass. The model was made available via API in March 2026 as part of the Grok 4.20 Beta family, which also includes reasoning-enabled and multi-agent-tuned variants. This model is designed for agentic and tool-centric workflows where response speed is a priority over deep step-by-step reasoning. It is well-suited for automated pipelines, coding agents, data-processing tasks, and any application where the model needs to call external tools rapidly and reliably. Its instruction-following behavior is tuned for consistency, making outputs predictable across repeated or templated prompts. Developers building low-latency AI systems or integrating LLM capabilities into production pipelines are the primary intended audience.

Text Image File
Context: 2M Output: 2,000,000 tokens
Input: $2.00 Output: $2.50
View model →
Text

Grok 4.20 Reasoning

Release date unavailable

Grok 4.20 Reasoning is an experimental, reasoning-focused text generation model developed by xAI, the AI division of X. It is part of the Grok 4.20 beta series and is specifically designed to work through problems using deliberate, multi-step thinking before producing a response. This approach improves accuracy on tasks where a direct answer is likely to fall short, such as mathematical problem-solving, logical analysis, and scientific reasoning. The model supports a context window of 2,000,000 tokens, allowing it to process and reason over very long documents or extended conversation histories in a single pass. It is accessible through the xAI inference provider via the Inworld Router or Realtime API, making it straightforward to integrate into developer applications. Use cases where it is particularly well-suited include research assistance, code debugging, nuanced question answering, and any workflow that benefits from structured, step-by-step analysis.

Text
Context: 2,000,000 Output: 2,000,000 tokens
Input: $2.00 Output: N/A
View model →
Text

Grok 4.1 Fast

Release date unavailable

Grok 4.1 Fast is a speed-optimized text generation model developed by xAI, the AI division of X. It is the non-reasoning variant of Grok 4.1 Fast, meaning it skips the extended chain-of-thought processing used in its reasoning counterpart and instead delivers near-instant, pattern-matched responses. This design makes it well-suited for applications where low latency matters more than deliberative step-by-step analysis. The model supports a 2 million token context window, multimodal input (text and images), tool use, structured outputs, and implicit caching. Grok 4.1 Fast is built for real-time and high-throughput workloads such as customer support automation, finance workflows, and agentic pipelines that require rapid sequential tool calls. Its large context window allows it to process extensive documents, long conversation histories, or complex multi-step task instructions in a single pass. The model shares weights with the full Grok 4.1 Fast but trades deliberative reasoning for response speed, making it a practical choice when throughput and latency are the primary constraints.

Text
Context: N/A Output: 2,000,000 tokens
Input: $0.20 Output: $2.50
View model →
Text

Grok 4.1 Fast Reasoning

Release date unavailable

Grok 4.1 Fast Reasoning is a text generation model developed by xAI, the AI division of X. It is designed specifically for agentic and tool-calling workflows, trained through reinforcement learning in simulated environments across dozens of tool-use domains. The model supports a 2-million-token context window, accepts both text and image inputs, and produces text outputs with chain-of-thought reasoning enabled. The model is best suited for developers building autonomous agents, enterprise automation pipelines, and multi-step research or customer support applications. It supports structured outputs, function calling, and a range of tool integrations including web search, X search, code execution, file retrieval, and MCP tool integrations via the Agent Tools API. Its training cutoff is November 2025, and it is available through the xAI API as well as third-party cloud providers such as Oracle Cloud.

Text
Context: N/A Output: 2,000,000 tokens
Input: $0.20 Output: $2.50
View model →
Text

Grok 4 Fast

Release date unavailable

Grok 4 Fast is a text generation model developed by xAI, the AI division of X. It is built on learnings from Grok 4 and is designed to deliver high-quality reasoning at lower computational cost, using approximately 40% fewer thinking tokens on average compared to its full counterpart. The model features a 2 million token context window and supports both reasoning and non-reasoning modes within a single unified architecture. Grok 4 Fast is trained end-to-end with tool-use reinforcement learning, enabling it to handle agentic tasks such as web browsing, code execution, and real-time information synthesis. It accepts both text and image inputs and produces text output. The model is well-suited for developers and enterprises that need multi-step reasoning, long-context document processing, and real-time web research without the computational overhead of a full frontier model.

Text
Context: N/A Output: 2,000,000 tokens
Input: $0.20 Output: $2.50
View model →
Text

Grok 4 Fast Reasoning

Release date unavailable

Grok 4 Fast Reasoning is a text generation model developed by xAI, released in September 2025 as a cost-efficient counterpart to their flagship Grok 4 model. It is built using large-scale reinforcement learning and uses approximately 40% fewer thinking tokens on average compared to Grok 4, while achieving comparable benchmark results. The model supports a 2 million token context window, making it suitable for processing large documents, multi-file codebases, and extended conversations. The model accepts both text and image inputs and outputs text, with a unified architecture that blends chain-of-thought reasoning with faster response modes depending on task complexity. It is trained end-to-end with tool-use reinforcement learning, enabling agentic web search, browsing X (Twitter), and real-time information synthesis. Grok 4 Fast Reasoning is well-suited for developers and users working on research, coding assistance, agentic workflows, and complex question answering where efficiency and speed are priorities.

Text
Context: N/A Output: 2,000,000 tokens
Input: $0.20 Output: $2.50
View model →
Video

Grok Imagine

Release date unavailable

Grok Imagine Video is a video generation model developed by X.ai, capable of converting text prompts or static images into short video clips with synchronized audio. It launched in August 2025 and reached a major 1.0 release in February 2026. The model runs on X.ai's proprietary Aurora autoregressive engine, trained on 110,000 NVIDIA GB200 GPUs, and generates 720p video at 24 fps with clip lengths between 6 and 15 seconds. What sets Grok Imagine Video apart is its built-in audio generation, which produces character dialogue, background music, and sound effects alongside the visuals without requiring separate post-production. It supports seven aspect ratios — including 16:9, 9:16, and 1:1 — and offers three creative modes: Normal, Fun, and Spicy. Generation typically completes in around 30 seconds, making it well suited for social media creators, marketers, and content teams that need fast turnaround on short-form video.

Video
Context: 5,000 Output: 500k
Input: $1.25 Output: $2.50
View model →
Text

Grok 4

Jul 09, 2025

Grok 4 is a text generation model developed by xAI, released on July 9, 2025, and trained using reinforcement learning on xAI's 200,000-GPU Colossus cluster. It features a 256,000-token context window and was built with a 6x improvement in compute efficiency over its predecessor, with verifiable training data expanded well beyond mathematics and coding. The model is designed for tasks requiring deep reasoning, including expert-level problems in science, mathematics, and software development. What distinguishes Grok 4 is its native tool use — it was trained to autonomously operate a code interpreter and web browser, selecting its own search queries to produce thorough answers. It also integrates real-time web search and X (Twitter) search, including keyword, semantic, and media search. A variant called Grok 4 Heavy runs multiple reasoning agents in parallel at inference time to handle the most demanding problems, and it was the first model to score above 50% on the Humanity's Last Exam benchmark. Grok 4 is available to SuperGrok and Premium+ subscribers on grok.com and through the xAI API.

Text
Context: 256,000 Output: 256,000 tokens
Input: $3.00 Output: $15.00
View model →
Text

Grok 3 Mini Fast

Release date unavailable

Grok 3 Mini Fast Beta is a compact text generation model developed by xAI, the AI division of X. It belongs to the Grok 3 model family and is designed to deliver faster response times compared to the full Grok 3 models, making it suitable for latency-sensitive applications. The model supports extended thinking, function calling, and real-time web search, and operates with a 131,072-token context window. Grok 3 Mini Fast Beta is well-suited for developers and businesses building high-throughput applications that require reasoning capability without the overhead of a larger model. Practical use cases include question answering, document summarization, data extraction, and tool-augmented agentic workflows. Its combination of speed, extended context, and tool integration makes it a practical option for production environments where response time is a priority.

Text
Context: 131,072 Output: 8,192 tokens
Input: $0.60 Output: N/A
View model →
Vision

Grok 2 Vision

Release date unavailable

Grok 2 Vision (grok-2-vision-1212) is a multimodal language model developed by xAI and released in December 2024. It accepts combined image and text inputs and is designed to understand, analyze, and respond to visual content alongside natural language. The model supports images up to 20MiB in JPG, JPEG, or PNG format and can process inputs in any order. It also includes multilingual support and improved instruction-following compared to earlier Grok vision releases. Grok 2 Vision is suited for production use cases that require visual comprehension, such as image captioning, visual question answering, chart and document analysis, and building AI assistants that respond to visual inputs. It supports tool calling and structured outputs, making it straightforward to integrate into developer workflows. With a 32,768-token context window, it can handle moderately long conversations that mix text and image content.

Context: 32,768 Output: 1M
Input: $2.00 Output: $2.50
View model →
Text

Grok 3

Release date unavailable

Grok 3 is the flagship large language model from xAI, developed and released in February 2025. It was built from the ground up in approximately one year and is designed to handle demanding tasks including advanced reasoning, coding, and creative writing. The model is available via API under the identifier grok-3-latest and supports a context window of 131,072 tokens. It includes a dedicated Thinking mode that enables multi-step reasoning on complex problems. Grok 3 is well-suited for tasks that require structured, multi-step problem solving, such as scientific research, advanced mathematics, and complex software development. It scored 96% on AIME, a challenging mathematics competition benchmark, and 85% on GPQA, a graduate-level science reasoning benchmark. The model also supports image understanding, function calling, and structured output generation, making it usable across a range of developer and research workflows. It ranked first in creative writing evaluations at the time of its release.

Text
Context: 131,072 Output: 8,192 tokens
Input: $3.00 Output: $2.50
View model →
Text

Grok 3 Fast

Release date unavailable

Grok 3 Fast is a performance-optimized variant of xAI's Grok 3 model, released in April 2025 as part of the Grok 3 family. It is designed to deliver faster response times compared to the standard Grok 3 Beta while retaining the same core language understanding, function calling, and web search capabilities. The model supports a 131,072-token context window, making it capable of handling long documents and extended multi-turn conversations. Grok 3 Fast is best suited for applications where response latency matters, such as real-time chat interfaces, high-throughput processing pipelines, and interactive AI assistants. Its support for function calling allows developers to integrate external tools and APIs, enabling agentic workflows that can act on live information. The model exposes an OpenAI-compatible API, which simplifies adoption for developers already working within that ecosystem.

Text
Context: 131,072 Output: 8,192 tokens
Input: $5.00 Output: N/A
View model →
Text

Grok 3 Mini

Release date unavailable

Grok 3 Mini Beta is a compact text generation model developed by xAI, the AI division of X. It is designed as a thinking model, meaning it reasons through problems step by step before producing a final answer, and it exposes that reasoning trace so users can follow the model's logic in full. The model supports adjustable reasoning effort, defaulting to a lower setting for speed but allowing a high-effort mode for more demanding problems. It has a 131,072-token context window and was trained with data up to April 2025. Grok 3 Mini is best suited for tasks that rely heavily on structured reasoning rather than broad world knowledge — including math problems, logic puzzles, coding challenges, and quantitative analysis. According to xAI's published benchmarks, it scores 95.8% on AIME 2024 and 80.4% on LiveCodeBench. It also supports function calling and web search, making it usable in agentic workflows. Epoch AI has noted that with high reasoning effort, Grok 3 Mini outperforms the larger Grok 3 model on math benchmarks.

Text
Context: 131,072 Output: 8,192 tokens
Input: $0.30 Output: N/A
View model →
Vision

Grok 4.3 Vision

Release date unavailable

Structured model profile with pricing, context, and capability details.

Context: N/A Output: 2,000,000 tokens
Input: $1.25 Output: N/A
View model →
Image

Grok Imagine

Release date unavailable

Grok Imagine (grok-imagine-image) is a text-to-image generation model developed by xAI, the AI company founded by Elon Musk. It is part of the Grok Imagine family and allows users and developers to generate high-resolution images from plain-language text descriptions. The model was unveiled in early August 2025 and expanded from a subscriber-only feature to a broadly available tool accessible via xAI's developer API. It sits alongside the more premium grok-imagine-image-pro variant, serving as the standard, faster option in the family. Grok Imagine supports up to 300 requests per minute, making it suited for applications that require image generation at volume. It accepts a 131,072-token context window and is accessible through xAI's API for integration into apps, tools, and workflows. The model is best suited for developers and creators who need reliable, high-throughput image generation for use cases such as prototyping, content creation, or building image-powered products. No artistic expertise is required to use it — a text description is sufficient to produce a detailed image.

Image
Context: 131,072 Output: 500k
Input: $1.25 Output: $2.50
View model →
Image

Grok Imagine Pro

Release date unavailable

Grok Imagine Pro is xAI's advanced text-to-image generation model, sitting at the top of xAI's image generation lineup above the standard grok-imagine-image. Published under the X brand, it accepts text prompts along with image URL inputs and selection parameters to produce detailed visual outputs. The "pro" designation reflects its position as the higher-quality tier within xAI's image generation offerings. Grok Imagine Pro is well-suited for developers and creators who require high-fidelity AI-generated imagery within production pipelines or creative workflows. It supports a context window of 131,072 tokens, allowing for detailed and nuanced text prompts. Use cases include content generation, creative projects, and any application where prompt adherence and image detail are priorities.

Image
Context: 131,072 Output: N/A
Input: $1.25 Output: $2.50
View model →
A

Anthropic

12 models

Text

Claude 4.7 Opus

Apr 16, 2026

Opus 4.7 is the next generation of Anthropic's Opus family, built for long-running, asynchronous agents. Building on the coding and agentic strengths of Opus 4.6, it delivers stronger performance on...

Text Image File
Context: 1M Output: 128,000 tokens
Input: $5.00 Output: $25.00
View model →
Text

Claude 4.6 Sonnet

Feb 17, 2026

Claude Sonnet 4.6 is a text generation model developed by Anthropic, released in February 2026 as an upgrade to the Sonnet line of mid-tier models. It features a 1 million token context window in beta, allowing it to process entire codebases, lengthy legal documents, or large collections of research papers within a single request. The model is designed for coding, agentic workflows, computer use, and professional knowledge work at scale. Sonnet 4.6 is particularly suited for developers and enterprises running high-volume workloads that require consistent instruction following, accurate tool selection, and reliable error correction across long sessions. It includes improved computer use capabilities, enabling it to navigate browsers, fill multi-step web forms, and automate desktop workflows. Anthropic's safety evaluations found it to be as safe as or safer than other recent Claude models, with noted resistance to prompt injection attacks.

Text Image File
Context: 1M Output: 128,000 tokens
Input: $3.00 Output: $15.00
View model →
Text

Claude 4.6 Opus

Feb 04, 2026

Claude Opus 4.6 is Anthropic's most capable text generation model, released on February 5, 2026. It is designed for long-horizon agentic tasks, complex reasoning, and professional knowledge work across domains such as software development, finance, and legal analysis. A defining feature of this release is its 1 million token context window, available in beta, which allows the model to process and reason over very large volumes of information within a single session. It also introduces adaptive thinking, which automatically calibrates the depth of reasoning applied based on the complexity of the task at hand. Opus 4.6 is built to handle demanding, real-world workloads with minimal human oversight. It can orchestrate teams of subagents, parallelize work across tools, and sustain long-running tasks across the full software development lifecycle from architecture through deployment. The model supports tool use and MCP server integration, making it suitable for enterprise workflows and autonomous agent pipelines. It is best suited for senior engineers, analysts, and organizations that need to delegate complex, multi-step challenges to an AI system.

Text Image File
Context: 1M Output: 128,000 tokens
Input: $5.00 Output: $25.00
View model →
Text

Claude 4.5 Opus

Nov 24, 2025

Claude 4.5 Opus is Anthropic's top-tier large language model, released on November 24, 2025. It is designed for demanding tasks including software engineering, long-horizon autonomous workflows, and complex reasoning, with a 200,000-token context window that supports multi-file operations and extended document analysis. The model includes an "effort" parameter that gives developers control over reasoning depth, allowing optimization for either speed or accuracy depending on the task at hand. Claude 4.5 Opus is particularly suited for enterprises and developers working on large-scale software engineering, autonomous agent orchestration, financial modeling, legal analysis, and deep research workflows. It features enhanced computer use capabilities, including a zoom tool for detailed screen inspection, enabling UI-based automation. Early users reported that the model handles ambiguous, multi-system problems with minimal guidance, and some reported token usage reductions of up to 65% compared to earlier models when solving equivalent problems.

Text Image File
Context: 200K Output: 64,000 tokens
Input: $5.00 Output: $25.00
View model →
Text

Claude 4.5 Haiku

Oct 15, 2025

Claude 4.5 Haiku is a lightweight text generation model developed by Anthropic, released in October 2025. It is designed to deliver high throughput and low latency while maintaining strong performance on coding and reasoning tasks. The model supports a 200,000-token context window and can generate up to 64,000 tokens in a single response, making it capable of handling long documents and complex multi-turn conversations. It accepts text, images, and PDFs as input and is available through Anthropic's API, AWS Bedrock, and Google Cloud Vertex AI. Claude 4.5 Haiku is built for production applications where speed and cost efficiency are priorities, such as customer support systems, real-time coding assistants, document processing pipelines, and autonomous AI agents. It supports tool calling, reasoning, and multi-step workflow automation, enabling agentic use cases without requiring a heavier model. Its knowledge cutoff is February 2025. Developers looking to build high-volume applications will find it suited to scenarios where response time and per-token cost are key constraints.

Text Image File
Context: 200,000 Output: 64,000 tokens
Input: $1.00 Output: $5.00
View model →
Text

Claude 4.5 Sonnet

Sep 29, 2025

Claude 4.5 Sonnet is a text generation model developed by Anthropic, released in September 2025. It is designed for software development, autonomous agent workflows, and direct computer interaction, supporting a 200,000-token context window. The model is trained with a knowledge cutoff of September 2025 and is available through Anthropic's API as well as Amazon Bedrock. The model is built to handle extended, multi-step tasks — including executing commands, editing files, and running tests — with sustained coherence over long sessions. It scores 61.4% on OSWorld, a benchmark for real-world computer task completion, and ranks at the top of the SWE-bench Verified leaderboard for software engineering tasks. Claude 4.5 Sonnet integrates with tools like Claude Code, the Claude Agent SDK, and MCP servers, making it well-suited for building production AI agents and developer tooling.

Text Image File
Context: 200,000 Output: 64,000 tokens
Input: $3.00 Output: $15.00
View model →
Text

Claude 4.1 Opus

Aug 05, 2025

Claude Opus 4.1 is Anthropic's flagship text generation model, released on August 5, 2025 as an upgrade to Claude Opus 4. It is designed for demanding workflows that require sustained reasoning across long, multi-step tasks, with particular strength in software development, autonomous research, and agentic problem solving. The model supports a 200,000-token context window, up to 32,000 output tokens, and accepts both text and image inputs. It is multilingual, with documented support for French, Arabic, Mandarin, Japanese, Korean, Spanish, and Hindi. On the SWE-bench Verified benchmark for real-world software bug fixing, Claude Opus 4.1 scores 74.5%, and it delivers a one standard deviation improvement over Opus 4 on Windsurf's junior developer benchmark for autonomous coding tasks. It supports extended thinking with up to 64,000 reasoning tokens, enabling deeper deliberation on complex problems. The model is available through the Anthropic API, Claude Code, Amazon Bedrock, and Google Cloud Vertex AI, making it suited for developers, researchers, and enterprises running complex multi-file code refactoring, long-horizon agent workflows, and in-depth research synthesis.

Text Image File
Context: 32,000 Output: 32,000 tokens
Input: $15.00 Output: $75.00
View model →
Text

Claude 4 Opus

May 22, 2025

Claude Opus 4 is a text generation model released by Anthropic on May 22, 2025. It is a hybrid model that supports both near-instant responses and extended thinking, allowing it to alternate between multi-step reasoning and tool use — such as web search — within a single workflow. The model carries a 200,000-token context window and supports vision, function calling, prompt caching, and structured outputs. On release, it scored 72.5% on SWE-bench Verified, 79.6% on GPQA Diamond, and 75.5% on AIME 2025. Claude Opus 4 is designed for tasks that require sustained, complex reasoning across long contexts, including refactoring large codebases, synthesizing research across many documents, and coordinating multi-step agentic workflows. Anthropic has classified it under ASL-3 safety measures — the first Claude model to receive that designation — which applies restrictions related to potential misuse in sensitive domains. It is well-suited for developer and enterprise applications that involve autonomous task execution, long-horizon planning, or processing large volumes of text and image data in a single session.

Text Image File
Context: 200,000 Output: 32,000 tokens
Input: $15.00 Output: $75.00
View model →
Text

Claude 4 Sonnet

May 22, 2025

Claude Sonnet 4 (claude-sonnet-4-20250514) is a text generation model developed by Anthropic and released on May 22, 2025. It sits in the mid-tier of Anthropic's Claude 4 model family, designed to balance capability with computational efficiency for production use. The model supports a 200,000-token context window and accepts text, images, and PDFs as input. It includes an optional extended thinking mode that allows the model to perform step-by-step reasoning when tasks require greater depth. Claude Sonnet 4 is built for high-volume workloads where consistent performance and reliability matter. It scores 72.7% on SWE-bench, reflecting strong performance on software engineering tasks such as code generation, debugging, and codebase navigation. The model also supports agentic tool use, making it suitable for multi-step workflows and integration with external APIs. Common use cases include code review, customer support automation, data analysis, and long-document processing.

Text Image File
Context: 200,000 Output: 64,000 tokens
Input: $3.00 Output: $15.00
View model →
Text

Claude 3 Haiku

Mar 13, 2024

Claude 3 Haiku is a text generation model developed by Anthropic, positioned as the fastest and most affordable model in the Claude 3 family. It features a 200,000-token context window and vision capabilities, making it suitable for tasks that require processing large documents or analyzing images alongside text. The model's training data has a cutoff of August 2023. Haiku is designed for enterprise use cases where throughput and cost efficiency matter, such as customer support, real-time chat, and batch processing of large datasets. It is capable of processing approximately 21,000 tokens — roughly 30 pages — per second for prompts under 32,000 tokens, which makes it well-suited for latency-sensitive applications and workloads that involve running many smaller tasks in parallel.

Text Image Tools
Context: 200,000 Output: 4,096 tokens
Input: $0.25 Output: $1.25
View model →
Text

Claude 3 Sonnet

Release date unavailable

Claude 3 Sonnet is a large language model developed by Anthropic, released as part of the Claude 3 model family in early 2024. It is designed to occupy a middle position within that family, offering a balance between response quality and processing speed suited to high-volume, enterprise-scale deployments. The model supports a 200,000-token context window, enabling it to process and reason over long documents, codebases, and extended conversations in a single pass. Claude 3 Sonnet is particularly well-suited for organizations running large-scale AI workloads where throughput and cost efficiency are priorities alongside output quality. Its training data has a cutoff of August 2023, and it is available through Anthropic's API as well as cloud providers including Amazon Web Services via Bedrock. The model handles tasks such as summarization, question answering, content drafting, and code assistance across a wide range of professional contexts.

Text
Context: 200,000 Output: 4,096 tokens
Input: $3.00 Output: N/A
View model →
Text

Claude Instant Deprecated

Release date unavailable

Structured model profile with pricing, context, and capability details.

Text
Context: N/A Output: 4,096 tokens
Input: N/A Output: N/A
View model →
B

Blackforestlabs

12 models

Image

FLUX 1.1 [pro]

Release date unavailable

FLUX 1.1 [pro] is a text-to-image generation model developed by Black Forest Labs, the team behind the FLUX model family. It accepts detailed text prompts and produces images at resolutions up to 2K and 4 megapixels, with support for aspect ratios including 1:1, 16:9, 9:16, 4:3, and 3:4. The model represents an upgrade over FLUX 1.0 [pro], delivering generation speeds approximately six times faster and improved adherence to the content and structure described in user prompts. FLUX 1.1 [pro] is designed for use cases that require high visual fidelity from text descriptions, including illustrations, advertising visuals, concept art, portraits, and photorealistic scenes. It is accessible via API, making it suitable for developers integrating image generation into applications, as well as for graphic designers and visual marketers working in professional workflows. A 4-megapixel image can be generated in approximately 10 seconds.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX 1.1 [pro] Ultra

Release date unavailable

FLUX 1.1 [pro] Ultra is a text-to-image generation model developed by Black Forest Labs. It produces images at resolutions up to 4 megapixels and is designed to closely follow text prompt instructions while maintaining image quality at high resolutions. The model offers two generation modes: Ultra mode, which prioritizes strict prompt adherence, and Raw mode, which produces a more naturalistic rendering with fewer synthetic-looking artifacts. FLUX 1.1 [pro] Ultra is suited for professional and creative applications that require high-resolution output, such as concept art, print materials, and social media visuals. It is accessible via the Black Forest Labs API, making it straightforward to integrate into existing workflows and platforms. The model accepts a seed input for reproducible outputs alongside configurable generation settings.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 [dev] LoRA

Release date unavailable

FLUX.1 [dev] LoRA is an image generation model built on FLUX.1 [dev], a 12-billion parameter rectified flow transformer developed by Black Forest Labs and released in August 2024. It extends the base FLUX.1 [dev] model with LoRA (Low-Rank Adaptation) support, allowing users to load pre-trained style and character adapters to shape the visual output without retraining the underlying model. The model is served through WaveSpeed AI's inference platform, which provides a REST API with no cold starts and consistent availability. It supports both text-to-image and image-to-image workflows, with output resolutions ranging from 256×256 up to 1536×1536 pixels. This model is well suited for developers and creators who need stylistically flexible image generation at scale. By swapping LoRA adapters — such as community options like Flux-Super-Realism-LoRA or yarn_art_Flux_LoRA — users can shift between hyper-realistic photography, painterly aesthetics, and character-driven art within the same base model. A prompt enhancer input is also available to refine natural language prompts before generation. Common use cases include product visualization, character design, creative exploration, and content production workflows.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 [dev] Ultra-Fast

Release date unavailable

This variant is optimized for ultra-fast inference, making it suited for iterative creative workflows, rapid prototyping, and applications where generation latency matters. Its input schema includes dual image URL fields, LoRA configuration, selection parameters, numeric controls, and a seed value, giving developers precise control over output dimensions, style, and reproducibility. It is well-suited for graphic designers, developers building image generation pipelines, and creators who need consistent, customizable visual outputs at scale.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 [schnell] LoRA

Release date unavailable

FLUX.1 [schnell] LoRA is an image generation model developed by Black Forest Labs that combines the schnell (fast) variant of the FLUX.1 architecture with LoRA (Low-Rank Adaptation) support, enabling fine-tuned style and subject customization on top of base image generation. It accepts text prompts alongside LoRA weights and source images as inputs, allowing users to steer outputs toward specific visual styles, characters, or aesthetics without retraining the full model. The model supports a context window of 10,000 tokens for prompt input and accepts configuration parameters including seed values and selectable generation options. This model is well-suited for workflows that require repeatable or stylistically consistent image outputs, such as brand asset creation, character design, and concept art iteration. By accepting LoRA inputs directly, it gives developers and designers a way to apply custom-trained adaptations at inference time rather than relying solely on prompt engineering. It is available on MindStudio without requiring separate API key configuration, making it accessible for integration into AI-powered applications.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 Kontext [max]

Release date unavailable

FLUX.1 Kontext [max] is an image generation and editing model developed by Black Forest Labs. It accepts image URLs, configuration selects, and seed inputs to support both generating new images from text prompts and editing existing images while preserving their original context and composition. A notable characteristic of the model is its ability to accurately render text within generated images, which has historically been a difficult task for image generation systems. It also supports multi-image input, allowing multiple reference images to guide the generation process. FLUX.1 Kontext [max] is the highest-tier variant in the Kontext model family from Black Forest Labs, positioned for workflows that require precise contextual understanding and high-fidelity output. It is suited for creative professionals, designers, and developers who need reliable image editing and generation within production pipelines. The model integrates with tools such as ComfyUI and MCP-compatible servers, and it carries a context window of 10,000 tokens. Its REMIX tag indicates support for remixing and transforming existing visual content.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.1 Kontext [pro]

Release date unavailable

FLUX.1 Kontext [pro] is an image generation and editing model developed by Black Forest Labs, released in May 2025. It is designed to accept an existing image alongside a text prompt and return a modified version of that image, making targeted changes while preserving the overall composition and structure. This context-aware approach distinguishes it from text-to-image-only models, as it is built specifically for in-place editing workflows rather than generating images from scratch alone. The model accepts image URLs, configurable settings, and a seed value as inputs, giving users control over reproducibility and output variation. It is well suited for workflows that require consistent visual identity across edits, such as changing materials, lighting, or stylistic elements in product renders, architectural visualizations, or creative compositions. With a context window of 10,000 tokens, it can process detailed natural language instructions for precise prompt following.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [dev] LoRA

Release date unavailable

FLUX.2 [dev] LoRA is a text-to-image model published by Black Forest Labs, built on a 32-billion parameter diffusion transformer architecture. It extends the FLUX.2 [dev] base model with Low-Rank Adaptation (LoRA) support, enabling users to inject custom styles, characters, or brand identities into image outputs without retraining the full model. The model uses a Mistral Small 3.1 text encoder for prompt processing and runs on WaveSpeedAI's infrastructure with no cold starts. It was made available in November 2025. The model supports stacking up to four LoRA adapters simultaneously in a single generation request, with independently adjustable strength per adapter. This makes it well-suited for brand-consistent marketing, character-consistent content creation, product visualization, and design iteration workflows. Custom LoRAs can be trained on as few as 15 to 30 images, lowering the barrier for teams that need fine-grained visual control. The model also supports batch generation of one to four images per request, useful for producing consistent campaign sets or A/B variants.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [klein] 9B

Release date unavailable

FLUX.2 [klein] 9B with LoRA support is a high-quality text-to-image model with 9B parameters, offering enhanced realism, crisper text generation, and fast LoRA customization. Ready-to-use REST inference API, best performance, no cold starts, affordable pricing.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [max]

Release date unavailable

FLUX.2 [max] is the flagship model in Black Forest Labs' FLUX.2 family, released in December 2025. It is designed for image generation and editing tasks that require high fidelity, accurate prompt following, and visual consistency across complex edits. The model accepts image URL arrays and seed inputs, enabling reference-based generation workflows where source imagery guides the output. It supports a context window of 46,864 tokens, which is notably large for an image generation model. A distinguishing feature of FLUX.2 [max] is its Grounded Generation capability, which allows the model to retrieve real-time information from the web during generation — enabling visuals tied to current events, live data, or recent trends without manual reference uploads. The model also supports character consistency across scenes and styles, multi-reference image editing, and product photography workflows. These characteristics make it suited for professional use cases such as brand work, e-commerce imagery, storytelling, and cinematic visual production.

Image
Context: 46,864 Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [pro]

Release date unavailable

FLUX.2 [pro] is a production-grade image generation model developed by Black Forest Labs, released in late 2025. It uses a rectified-flow transformer backbone paired with a Mistral-class vision-language model to handle both image generation and editing within a single unified architecture. The model supports a 32,000-token context window, enabling detailed, multi-part prompts with compositional and spatial constraints. Outputs can reach up to 4 megapixels, with fine detail in faces, hands, and textures suited for commercial use. A defining feature of FLUX.2 [pro] is its ability to accept up to 8–10 reference images simultaneously, maintaining consistent character, product, and style identity across generated scenes. It also supports hex color matching, reliable typography rendering, structured JSON prompts, and pose guidance, making it well-suited for brand-controlled workflows. Built-in C2PA cryptographic metadata provides content provenance, and layered safety filtering blocks IP-infringing and explicit content at inference time. The model is designed for use cases such as e-commerce product imagery, advertising campaigns, and any workflow requiring consistent visual identity across multiple assets.

Image
Context: 32K Output: N/A
Input: N/A Output: N/A
View model →
Image

FLUX.2 [turbo]

Release date unavailable

FLUX.2 [turbo] is an image generation model developed by Black Forest Labs, designed to convert text descriptions into images across a wide range of styles including photorealistic scenes, illustrations, concept art, and character design. The model supports resolutions up to 2K and 4MP output, with a context window of 10,000 tokens for prompt input. It accepts select and seed inputs, giving users control over style options and reproducibility of results. The model is positioned for workflows where generation speed is a priority, producing images in approximately 10 seconds at 4MP resolution. It supports multiple aspect ratios including 1:1, 16:9, 9:16, 4:3, and 3:4, making it adaptable for different creative and commercial formats. FLUX.2 [turbo] is well-suited for graphic designers, visual marketers, and developers integrating image generation into applications via API.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
B

ByteDance

10 models

Image

Seedream 4.5

Release date unavailable

Seedream 4.5 is an image generation and editing model developed by ByteDance, built on a unified architecture that handles both creating images from text prompts and modifying existing images within a single system. It accepts up to 10 reference images in one request, enabling multi-source compositing workflows such as product swaps and element transfers between images while preserving depth, perspective, and lighting consistency. The model supports output resolutions up to 2048×2048 pixels across multiple aspect ratios including 1:1, 16:9, and 9:16. One of Seedream 4.5's most documented characteristics is its text rendering accuracy — it can produce correctly spelled, legible text in various font styles and non-Latin scripts integrated naturally into scenes like signs, packaging, and posters. The model ranks 10th globally on the LM Arena leaderboard with a score of 1147 and was trained through December 2025. It is well suited for designers, marketers, and e-commerce teams who need production-ready visuals driven by natural language prompts, without requiring manual selection tools or layer-based editing workflows.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Lip Sync

LatentSync

Release date unavailable

LatentSync is an end-to-end lip-synchronization model developed by ByteDance that takes a source talking-head video and a target audio track as inputs and produces a new video with mouth movements precisely aligned to the provided speech. It is built on audio-conditioned latent diffusion, meaning it operates directly in the latent space rather than relying on 3D meshes or 2D facial landmarks. The model supports multiple languages and accents, works with diverse speakers and recording conditions, and handles both real people and stylized characters such as anime. A key technical feature of LatentSync is Temporal REPresentation Alignment (TREPA), a technique designed to reduce flicker, jitter, and frame-to-frame artifacts, keeping head pose and lip motion stable across longer sequences. The model is recommended for use with high-resolution source videos — 720p, 1080p, or 4K — paired with clean audio recordings. It is well suited for workflows involving lip-syncing, audio dubbing, digital human creation, and video-audio alignment.

Context: 50,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

DreamActor V2

Release date unavailable

DreamActor V2 is a video generation model developed by ByteDance that animates static images by transferring motion from a reference driving video onto a target character. It is the second generation of ByteDance's DreamActor series and was made available in February 2026. Rather than relying on skeleton extraction or pose estimation pipelines, it uses a spatiotemporal in-context learning framework that reads motion directly from raw video pixels, which allows it to handle character types that traditional pose-based methods struggle with, including animals, cartoon mascots, fantasy creatures, and 3D renders. DreamActor V2 accepts two inputs — a character image and a driving video — and produces animated video outputs up to 15 seconds at 720p resolution across a range of aspect ratios. It transfers facial expressions, head orientation, eye direction, lip movement, hand gestures, and full-body motion while maintaining the structural consistency of the source character across frames. This makes it applicable to use cases such as social media content creation, brand animation, virtual avatar production, game asset prototyping, and educational video generation.

Video
Context: 1000 Output: N/A
Input: N/A Output: N/A
View model →
Lip Sync

Omni Human 1.5

Release date unavailable

OmniHuman 1.5 is an avatar animation model developed by ByteDance that converts still images into fully animated digital humans using audio input. It generates synchronized lip movements, facial expressions, and body language by combining audio signals with semantic understanding from Multimodal Large Language Models. The model is built on a dual-system cognitive architecture inspired by System 1 and System 2 theory, enabling both fast reactive animations and deliberate, context-aware responses. It supports a context window of 50,000 tokens and was trained through September 2025. The model works across a wide range of visual styles, including realistic photographs, anime characters, illustrated portraits, and stylized artwork, as well as non-human subjects like animals and anthropomorphic figures. It can produce videos exceeding one minute in length with dynamic motion, camera movement, and multi-character interactions. OmniHuman 1.5 is suited for use cases such as virtual persona creation, NPC animation in games, AI spokesperson production, virtual instructor development, and video content creation without large production teams. It accepts image URLs and audio URLs as inputs.

Context: 50,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Seedance 1.5 Pro

Release date unavailable

Seedance 1.5 Pro is an image-to-video generation model developed by ByteDance that transforms static images into cinematic video clips at up to 1080p resolution. It uses a dual-branch Diffusion-Transformer (DB-DiT) architecture to generate video and audio simultaneously in a single pass, producing millisecond-level lip-sync and environmental audio without requiring post-production editing. Videos can range from 5 to 10 seconds in duration and support aspect ratios including 16:9, 9:16, and 21:9. What distinguishes Seedance 1.5 Pro is its native audio-visual synthesis, which generates speech, sound effects, and ambient audio in sync with the video rather than layering them separately afterward. It supports multilingual lip-sync across six languages and offers over 15 controllable camera movements — such as dolly zooms, tracking shots, and orbits — specified through text prompts. The model is well-suited for content creators, marketers, and developers working on dialogue-driven content, social media clips, and multilingual voiceover projects where visual consistency and synchronized audio are required.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Seedance 2.0

Release date unavailable

Seedance 2.0 generates Hollywood-grade cinematic videos from text prompts with native audio-visual synchronization, director-level camera and lighting control, and exceptional motion stability. Built on Seed's unified multimodal architecture, it leads on instruction adherence, motion quality, and visual aesthetics.

Video
Context: 50K Output: 10,000 tokens
Input: N/A Output: N/A
View model →
Video

Seedance 2.0 Fast

Release date unavailable

Seedance 2.0 generates Hollywood-grade cinematic videos from text prompts with native audio-visual synchronization, director-level camera and lighting control, and exceptional motion stability. Built on Seed's unified multimodal architecture, it leads on instruction adherence, motion quality, and visual aesthetics.

Video
Context: 50K Output: 10,000 tokens
Input: N/A Output: N/A
View model →
Video

Seedance 2.0 Fast Turbo

Release date unavailable

Seedance 2.0 generates Hollywood-grade cinematic videos from text prompts with native audio-visual synchronization, director-level camera and lighting control, and exceptional motion stability. Built on Seed's unified multimodal architecture, it leads on instruction adherence, motion quality, and visual aesthetics.

Video
Context: 50K Output: 10,000 tokens
Input: N/A Output: N/A
View model →
Image

Seedream 4.0

Release date unavailable

Seedream 4.0 is an image generation model developed by ByteDance, designed to produce images from text prompts and source image inputs. It supports a context window of 10,000 tokens and accepts image URL arrays alongside numerical parameters, enabling flexible control over generation behavior. The model is part of ByteDance's Seedream series and is available through MindStudio's model catalog. Seedream 4.0 is best suited for workflows that require image generation guided by reference images, making it useful for tasks like style transfer, image variation, and visually consistent content creation. Its support for source image inputs distinguishes it from purely text-to-image pipelines, allowing users to anchor outputs to existing visual references. Developers can integrate it into MindStudio applications without managing separate API keys.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Seedream 5.0 Lite

Release date unavailable

Seedream 5.0 Lite is a lightweight image editing model developed by ByteDance, released in February 2026 as part of the Seedream 5.0 family from their Seed team. It is designed for creative workflows that require fast, high-quality image transformations driven by natural language prompts and reference images. The model supports combining elements from up to 14 reference images in a single edit, making it suited for compositing and character fusion tasks. It is positioned as a more efficient alternative to the full Seedream 5.0 suite while retaining the core editing capabilities of the family. A notable characteristic of Seedream 5.0 Lite is its focus on identity and face preservation, maintaining facial features such as eyes, jawline, proportions, and skin tone through style transformations. ByteDance specifically improved small-face rendering and skin texture restoration in this release compared to Seedream 4.5 Edit. The model also keeps non-edited regions stable, reducing unintended changes to areas outside the edit target. It is well suited for use cases including style transfer, e-commerce product visualization, social media content creation, and rapid concept art prototyping.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
D

DeepSeek

10 models

Text

DeepSeek V4 Flash

Apr 24, 2026

Open source

DeepSeek V4 Flash is an efficiency-optimized Mixture-of-Experts model from DeepSeek with 284B total parameters and 13B activated parameters, supporting a 1M-token context window. It is designed for fast inference and...

Text Tools Structured Output
Context: 1.0M Output: 384,000 tokens
Input: $0.14 Output: $0.00
View model →
Text

DeepSeek V4 Pro

Apr 24, 2026

Open source

DeepSeek V4 Pro is a large-scale Mixture-of-Experts model from DeepSeek with 1.6T total parameters and 49B activated parameters, supporting a 1M-token context window. It is designed for advanced reasoning, coding,...

Text Tools Structured Output
Context: 1.0M Output: 384,000 tokens
Input: $1.74 Output: $0.87
View model →
Text

Kimi K2.6

Apr 21, 2026

Open source

Kimi K2.6 is Moonshot AI's next-generation multimodal model, designed for long-horizon coding, coding-driven UI/UX generation, and multi-agent orchestration. It handles complex end-to-end coding tasks across Python, Rust, and Go, and...

Text Image Tools
Context: 262.1K Output: 16,384 tokens
Input: $0.75 Output: $4.00
View model →
Text

Kimi K2.5

Jan 27, 2026

Kimi K2.5 is an open-source multimodal model developed by Moonshot AI and released in January 2026. It uses a Mixture-of-Experts architecture with 1 trillion total parameters and approximately 32 billion active at inference time, trained on roughly 15 trillion mixed visual and text tokens. Unlike models that add vision as a secondary capability, Kimi K2.5 was trained natively on both image and text data, enabling integrated understanding of charts, documents, video, and code. The model supports two operating modes — Instant Mode for direct responses and Thinking Mode for step-by-step reasoning on complex problems — within a 256,000-token context window. It introduces an Agent Swarm paradigm that can coordinate up to 100 parallel sub-agents, reducing execution time by 4.5x on parallelizable tasks. Kimi K2.5 is released under a modified MIT license, making it available for local deployment, fine-tuning, and commercial use, and is particularly suited for visual programming, document analysis, automated research, and multi-step agentic workflows.

Text Image Tools
Context: 262,144 Output: 16,384 tokens
Input: $0.45 Output: $1.90
View model →
Text

DeepSeek V3.2

Dec 01, 2025

DeepSeek-V3.2 is an open-weight large language model developed by DeepSeek and released on December 1, 2025. It uses a Mixture-of-Experts architecture combined with a novel sparse attention mechanism called DeepSeek Sparse Attention (DSA), which reduces computational complexity to near-linear scale (O(kL)) for long-context tasks. The model supports a 160,000-token context window and is available under the MIT License on Hugging Face. DeepSeek-V3.2 introduces three notable technical advances: a scalable reinforcement learning training framework, a large-scale agentic task synthesis pipeline covering over 1,800 environments and 85,000+ complex instructions, and native support for Thinking in Tool-Use — the ability to reason while invoking external tools in both thinking and non-thinking modes. It is best suited for complex multi-step reasoning, agentic workflows involving search and code execution, long-context document processing, and developers building AI applications that require integrated reasoning and tool use.

Text Tools Structured Output
Context: 160,000 Output: 8,000 tokens
Input: $0.26 Output: $0.38
View model →
Text

DeepSeek V3.1

Aug 21, 2025

DeepSeek-V3.1 is a 671-billion parameter large language model developed by DeepSeek, using a Mixture-of-Experts (MoE) architecture that activates 37 billion parameters at any given time. It supports a 128,000-token context window and was trained through August 2025, with an enhanced base model built using a two-phase long-context extension process that included 630 billion tokens at the 32K phase and 209 billion tokens at the 128K phase. The model accepts text input and produces text output across a wide range of general-purpose tasks. What distinguishes DeepSeek-V3.1 from earlier versions is its hybrid thinking design: a single model that can operate in a fast conversational mode or a slower step-by-step reasoning mode, selectable through prompting rather than requiring a separate model. Post-training improvements have also focused on tool use and agentic workflows, including multi-step API calls, web search, and code execution. This makes it well-suited for coding, mathematical reasoning, long-document analysis, and complex multi-turn agent tasks.

Text Tools Structured Output
Context: 128,000 Output: 8,000 tokens
Input: $0.27 Output: $0.79
View model →
Text

DeepSeek-R1

Jan 22, 2025

DeepSeek-R1 is a text generation model developed by DeepSeek, a Chinese AI company. It is a reasoning-focused model that generates a Chain of Thought (CoT) before producing a final answer, a technique designed to improve accuracy on multi-step problems. The model was trained through late 2024 and supports a context window of 64,000 tokens. DeepSeek released the model weights publicly, making it available for local deployment and research use. DeepSeek-R1 is well suited for tasks that benefit from structured reasoning, such as mathematics, logic puzzles, coding challenges, and scientific problem-solving. Because the model externalizes its reasoning steps before answering, users can inspect the thought process that led to a given response. DeepSeek also released a series of distilled versions of R1 based on smaller base models, broadening its accessibility across different hardware configurations.

Text
Context: 64,000 Output: 8,000 tokens
Input: $0.55 Output: N/A
View model →
Text

DeepSeek-V3

Dec 26, 2024

DeepSeek-V3 is a large language model developed by DeepSeek, a Chinese AI company. It is a general-purpose text generation model designed to handle a wide range of tasks including coding, reasoning, summarization, and open-ended conversation. The model supports a 128,000-token context window and was trained on data through late 2024. It is identified by the model ID deepseek-chat and is available via API. DeepSeek-V3 uses a Mixture-of-Experts (MoE) architecture with 671 billion total parameters, activating 37 billion per forward pass, which allows it to maintain efficiency at scale. The model was trained using an optimized pipeline that includes multi-token prediction and FP8 mixed-precision training. It is well-suited for tasks that require long-context understanding, instruction following, and multi-step reasoning across technical and general domains.

Text Tools Structured Output
Context: 128,000 Output: 8,000 tokens
Input: $0.27 Output: $0.89
View model →
Text

DeepSeek R1 Turbo

Release date unavailable

DeepSeek R1 Turbo is a text generation model developed by DeepSeek, designed as an accelerated variant of the R1 reasoning model family. It retains the chain-of-thought reasoning capabilities of the base R1 model while incorporating architectural and inference optimizations aimed at reducing latency. The model supports a 128,000-token context window and was trained on data through late 2024. It accepts text input and produces text output across a wide range of analytical and generative tasks. DeepSeek R1 Turbo is particularly well-suited for applications where multi-step reasoning is required but response time is a practical constraint. Common use cases include coding assistance, mathematical problem-solving, logical deduction, and structured analytical workflows. Developers building interactive tools or real-time applications that depend on reasoning-intensive outputs are the primary intended audience for this model.

Text
Context: 128,000 Output: 8,000 tokens
Input: $1.00 Output: N/A
View model →
Text

DeepSeek-V3 Deprecated

Release date unavailable

General-purpose LLM from Chinese AI company DeepSeek.

Text
Context: N/A Output: 8,000 tokens
Input: N/A Output: N/A
View model →
K

Kling

10 models

Video

Kling 3.0

Release date unavailable

Kling 3.0 is a video generation model developed by Kling, released with a training date of February 2026. It supports both text-to-video and image-to-video workflows, accepting text prompts, image URLs, and multiple configuration options as inputs. The model is identified by the ID kling-video-v3.0-std and is available on MindStudio as part of the Kling model family. Kling 3.0 is suited for creators and developers who need to generate video content from written descriptions or existing images. Its dual input support makes it flexible for use cases ranging from concept visualization to animating static imagery. The model accepts a context window of up to 10,000 tokens, giving users room to provide detailed prompts and configuration parameters.

Video
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling 2.6

Release date unavailable

Kling 2.6 is a video generation model developed by Kling, capable of producing videos from text prompts or input images. It supports both text-to-video and image-to-video workflows, accepting text descriptions, image URLs, and selection-based inputs to guide the generation process. The model was added to MindStudio in March 2026 and carries a training date of December 2025. Kling 2.6 is suited for creators and developers who need to generate video content programmatically without managing their own infrastructure. Its dual input modality — text and image — makes it applicable to a range of use cases including content creation, storyboarding, and visual prototyping. The model operates under the identifier kling-video-v2.6-std and is accessible through MindStudio without requiring separate API key configuration.

Video
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Kling Image O1

Release date unavailable

Kling Image O1, formally known as Kling Omni Image O1, is an image generation model developed by Kuaishou Technology, the company behind the Kling AI ecosystem. It is built on a Multimodal Visual Language (MVL) framework that combines natural language understanding with multi-reference image processing, allowing it to accept between 1 and 10 reference images simultaneously and extract consistent visual features across all outputs. The model was trained through December 2025 and supports a context window of 10,000 tokens. The model is designed to address a common challenge in AI image generation: maintaining consistent character identity, style, and visual detail across multiple generated images. It is particularly suited for workflows such as IP character design, comic and manga creation, brand merchandise imagery, and serialized visual content where cross-image consistency is a requirement. Inputs include image URL arrays alongside select and toggle controls, giving users structured options for guiding generation behavior.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Lip Sync

AI Avatar Standard

Release date unavailable

Kling AI Avatar Standard is an audio-driven talking-head model developed by Kling that animates a single still portrait image into a synchronized speaking video. It accepts a portrait photo and an audio track as inputs, then generates a video with phoneme-aligned lip movements, natural eye blinks, and subtle head motion while preserving the subject's identity throughout. The model supports both real voice recordings and text-to-speech generated audio, and an optional text prompt can influence background style or framing. Output duration is variable and determined by the length of the provided audio, up to a maximum of 10 minutes. Kling AI Avatar Standard is designed for everyday production workflows where reliable, clean avatar video is needed at scale. Typical use cases include explainer videos, customer support avatars, internal training materials, and product demonstrations. For best results, the model expects a clear, front-facing portrait with even lighting and at least 512px resolution, paired with a clean voice recording sampled at 16–48 kHz. It is available via API through WaveSpeed and is accessible on MindStudio without requiring separate API key management.

Context: 50,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling 2.6 Pro Motion Control

Release date unavailable

Kling V2.6 Pro Motion Control is an AI video generation model developed by Kuaishou Technology that animates static character images by extracting and transferring motion from real reference video clips. Rather than generating movement from text descriptions alone, it uses a 3D face and body reconstruction system built on deep learning-based 3D modeling to map human faces and body movements from 2D inputs, then applies those motion paths frame-by-frame to a subject image. The model runs on a Diffusion Transformer Architecture and produces output at 30 frames per second with coherent motion transitions throughout the generated clip. The model accepts reference videos between 3 and 30 seconds in length and supports a wide range of movement types, including dance routines, martial arts, walking cycles, and subtle gestures. It preserves the subject's appearance consistently across all frames without identity drift, and it supports optional text prompts to adjust scene styling, lighting, and atmosphere while keeping the motion transfer intact. Kling V2.6 Pro Motion Control is well suited for social media character animation, brand mascot animation, film pre-production prototyping, digital human content creation, and educational demonstrations.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling 3.0 Motion Control

Release date unavailable

Kling 3.0 Motion Control is a video generation model developed by Kling that specializes in motion transfer. It takes a reference video and a source still image as inputs, then animates the still image by applying the motion patterns extracted from the reference video. This makes it distinct from standard text-to-video or image-to-video models, as the motion itself is explicitly guided by an existing video clip rather than inferred from a prompt alone. The model is well-suited for workflows where consistent, repeatable motion is required across different subjects or scenes — for example, applying a specific walking cycle, gesture, or camera movement to a new character or background image. It accepts image URLs, video URLs, text, and configuration inputs, giving users control over how the motion transfer is applied. With a context window of 1000 tokens, it is designed for focused, single-generation tasks rather than extended multi-turn interactions.

Video
Context: 1000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling 3.0 Pro

Release date unavailable

Kling 3.0 Pro is a video generation model developed by Kling, designed to produce video content from both text prompts and image inputs. It represents the 3.0 Pro tier of Kling's video model lineup, with a training cutoff of February 2026 and availability on MindStudio starting March 2026. The model accepts text descriptions, image URLs, and configurable selection parameters to control output characteristics. Kling 3.0 Pro is suited for workflows that require generating video from written descriptions or existing images, making it applicable to content creation, prototyping, and visual storytelling tasks. Its support for both text-to-video and image-to-video modalities gives it flexibility across different starting points for video production. The model operates with a context window of 10,000 tokens, accommodating detailed prompts for more precise video generation.

Video
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Kling Image O3

Release date unavailable

Kling Image O3 is the first image generation model released by Kling AI, designed to produce high-quality visuals from text prompts or reference images. It is notable for its ability to accurately render text within generated images, a capability that many image generation models handle poorly, making it well-suited for designs involving typography, signage, or branded content. The model supports resolutions up to 4K across a wide range of aspect ratios, including landscape dimensions up to approximately 6256×2681 pixels and portrait dimensions up to 3548×4730 pixels. Kling Image O3 accepts both text prompts and image inputs, allowing users to guide generation from an existing reference image as well as from a written description. Its combination of high-resolution output, compositional awareness, and in-image text rendering makes it particularly relevant for professional use cases such as game asset creation, marketing materials, and editorial illustration. The model is available through MindStudio without requiring separate API key management.

Image
Context: 2,500 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling O1

Release date unavailable

Kling Video O1 is an AI video generation model developed by Kuaishou Technology, built on a Multimodal Visual Language (MVL) framework that accepts text, images, and video as inputs within a single unified system. The model supports three distinct operating modes — Reference Images, Reference Video, and Video Editing — allowing creators to animate static visuals, generate or extend footage from a reference video, or modify specific elements within an existing clip while leaving the rest of the scene intact. A defining feature of Kling Video O1 is its Elements system, which lets users upload up to four images of a character or object from different angles to give the model a near-3D understanding of the subject. This enables consistent identity preservation across multiple shots and dynamic camera movements, addressing a common challenge in AI video generation. The model is well suited for use cases in film production, advertising, and social media content creation where reference-driven control and shot-to-shot consistency are required.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Kling O3

Release date unavailable

Kling Video O3, also known as Kling 3.0 Omni, is a video generation model developed by Kuaishou and launched in February 2026. It is the premium tier of the Kling 3.0 model family, designed specifically for structured, multi-shot storytelling rather than single isolated clips. The model accepts text, images, and video as inputs, and uses Multimodal Visual Language (MVL) technology to reason about scene composition, spatial relationships, and motion in a unified pass. It supports clip lengths of up to 15 seconds across up to six distinct shots generated in a single request. Kling Video O3 is built for workflows where visual consistency is critical — such as brand marketing, recurring character content, and cinematic pre-production. It preserves a subject's exact appearance, including facial features, clothing, logos, and on-screen text, across shots and scene transitions when a reference image or video is provided. The model also generates synchronized audio natively alongside video, covering ambient sound, dialogue, and multilingual lip-sync without requiring separate post-production. It is best suited for production scenarios where a character, product, or campaign identity has already been defined and consistent output at scale is the goal.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
P

Perplexity

9 models

Text

Sonar Deep Research

Mar 07, 2025

Sonar Deep Research is a text generation model developed by Perplexity AI, released in February 2025. It is designed specifically for complex, multi-step research tasks that require gathering and synthesizing information from a large number of web sources. Rather than returning a single retrieved answer, it autonomously plans a research strategy, conducts dozens of iterative web searches, evaluates the results, and refines its approach before producing a detailed, citation-backed report. It operates with a 128,000-token context window, allowing it to handle substantial volumes of text and references within a single session. Sonar Deep Research is best suited for tasks where thoroughness and accuracy take priority over response speed, such as academic research, market analysis, competitive intelligence, and due diligence investigations. It includes a dedicated reasoning phase in which the model thinks through gathered material before generating its final output, which helps produce more nuanced and accurate responses. The model does not use customer queries or outputs for training purposes. It is well-suited for professionals, researchers, and developers working in domains like finance, technology, healthcare, and current events who need reliable, well-sourced reports.

Text Reasoning
Context: 128,000 Output: 8,000 tokens
Input: $2.00 Output: $8.00
View model →
Text

Sonar Pro

Mar 07, 2025

Sonar Pro is a search-augmented text generation model developed by Perplexity, designed to handle complex research queries that require thorough source attribution and multi-step reasoning. It operates with a 200,000-token context window, allowing it to process large volumes of information within a single session. The model supports both text and image inputs and can produce up to 8,192 output tokens per response. It also includes function calling, structured output generation, and a reasoning mode for analytical tasks. Sonar Pro is Perplexity's premium tier offering within the Sonar model family, delivering approximately twice the citations and search results compared to the standard Sonar model. This makes it particularly well-suited for enterprise applications, professional research workflows, and use cases that demand comprehensive source coverage and reliable multi-step query handling. The model's training data extends through March 2025, and its live web search integration means responses can draw on current information beyond that date. It is available via API for developers building research-intensive or knowledge-heavy applications.

Text Image
Context: 200K Output: 8,000 tokens
Input: $3.00 Output: $15.00
View model →
Text

Sonar Reasoning Pro

Mar 07, 2025

Sonar Reasoning Pro is a text generation model developed by Perplexity AI, built on top of DeepSeek R1 and augmented with Perplexity's proprietary real-time web search capabilities. It uses Chain-of-Thought reasoning to work through problems step by step before producing a final answer, making it distinct from models that rely solely on static training data. The model supports a 128,000-token context window and multiple languages, and was made available in February 2025. Sonar Reasoning Pro is designed for tasks where accuracy, source transparency, and up-to-date information are important. Because it actively queries the web during inference, it can surface current information and provide citations alongside its responses. It is best suited for in-depth research, complex multi-step analytical questions, and scenarios where users need a well-reasoned explanation grounded in verifiable, recent sources.

Text Image Reasoning
Context: 128,000 Output: 8,000 tokens
Input: $2.00 Output: $8.00
View model →
Text

Sonar

Jan 27, 2025

Sonar is Perplexity AI's in-house text generation model, built on Meta's Llama 3.3 70B and optimized for web-grounded question answering. Released in January 2025, it retrieves live internet data at query time rather than relying solely on static training knowledge, and every response includes inline source citations for transparency. It supports a 128,000-token context window and runs at approximately 121 tokens per second using Cerebras wafer-scale inference. Sonar is designed for developers and businesses that need to embed fast, factual, and source-backed search capabilities into their own applications. It offers three search depth modes — High, Medium, and Low — allowing teams to balance thoroughness against response speed depending on their use case. On the SimpleQA benchmark, Sonar achieved an F-score of 0.773, reflecting its focus on factual accuracy. It is particularly well-suited for high-volume applications such as sales research tools, medical information platforms, and real-time in-meeting search features.

Text Image
Context: 128,000 Output: 32,768 tokens
Input: $1.00 Output: $1.00
View model →
Text

Sonar Large Chat Deprecated

Release date unavailable

Perplexity's latest model family surpassing earlier versions in cost-efficiency, speed, and performance.

Text
Context: N/A Output: 32,768 tokens
Input: N/A Output: N/A
View model →
Text

Sonar Large Online Deprecated

Release date unavailable

Perplexity's latest model family surpassing earlier versions in cost-efficiency, speed, and performance.

Text
Context: N/A Output: 28,000 tokens
Input: N/A Output: N/A
View model →
Text

Sonar Reasoning Deprecated

Release date unavailable

Lightweight reasoning offering powered by reasoning models trained with DeepSeek R1.

Text
Context: N/A Output: 32,768 tokens
Input: N/A Output: N/A
View model →
Text

Sonar Small Chat Deprecated

Release date unavailable

Perplexity's latest model family surpassing earlier versions in cost-efficiency, speed, and performance.

Text
Context: N/A Output: 32,768 tokens
Input: N/A Output: N/A
View model →
Text

Sonar Small Online Deprecated

Release date unavailable

Perplexity's latest model family surpassing earlier versions in cost-efficiency, speed, and performance.

Text
Context: N/A Output: 28,000 tokens
Input: N/A Output: N/A
View model →
W

Wan

9 models

Video

Wan 2.6

Release date unavailable

Wan 2.6 is a video generation model developed by Alibaba that produces 1080p video at 24 frames per second for clips up to 15 seconds in length. It accepts text, image, or video as input and generates complete video output — including synchronized audio, dialogue, sound effects, and lip movements — in a single generation pass, without requiring a separate audio pipeline. The model was trained with a cutoff of December 2025 and is available as an open-source release. Wan 2.6 is designed for creators, marketers, and developers who need publish-ready video content without extensive post-production work. Its distinguishing features include multi-shot narrative handling across a single clip, character consistency when using reference figures, physics simulation for realistic motion, and style transfer from reference videos. These capabilities make it suited for use cases such as social media content, product demonstrations, commercials, and short narrative sequences.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Wan 2.6

Release date unavailable

Wan 2.6 is a multimodal AI generation model developed by Alibaba Cloud and released in December 2025. It uses a Mixture-of-Experts architecture with 14 billion total parameters, activating roughly 20% of them during inference. The model supports text-to-video, image-to-video, reference-to-video, and image generation modes, and accepts prompts in both English and Chinese. Video outputs can reach up to 15 seconds at 1080p resolution and 24 frames per second. What distinguishes Wan 2.6 from many generation models is its native audio output — synchronized dialogue, sound effects, and lip-sync are generated alongside video without requiring separate post-production tools. The model also supports multi-shot storytelling from a single prompt, maintaining character consistency across scenes with automatic camera transitions. It is well suited for content creators, marketers, and developers who need high-fidelity video and image output, particularly those aiming to produce publish-ready content with minimal manual editing.

Image
Context: 2,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Wan 2.5

Release date unavailable

Wan 2.5 is an open-source AI video generation model developed by Alibaba's DAMO Academy. It produces video clips up to 10 seconds long at resolutions up to 1080p, and generates synchronized audio — including dialogue with lip-sync, ambient sound effects, and background music — alongside the visuals in a single generation step. The model accepts text prompts, still images, audio tracks, or existing video clips as input, and supports cinematic controls such as camera movement types, lighting styles, and depth of field specified directly in the prompt. Wan 2.5 is designed for content creators, filmmakers, advertisers, and developers who need video output with accompanying audio without separate post-production workflows. It supports prompts and generated dialogue in at least 8 languages, and offers 480p, 720p, and 1080p as standard output resolutions with native 4K available in preview. Compared to its predecessor Wan 2.2, this version doubles the maximum video duration from 5 to 10 seconds, raises the standard resolution from 720p to 1080p, and introduces the audio generation system as an entirely new feature.

Image
Context: 2,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Wan 2.2

Mar 26, 2025

Wan 2.2 is a multimodal video generation model developed by Alibaba's Tongyi Laboratory and released in July 2025 under the Apache 2.0 license. It is the first video diffusion model to apply a Mixture-of-Experts (MoE) architecture, which splits processing between high-noise expert networks that handle overall layout and composition and low-noise expert networks that refine fine details. The model supports both text-to-video and image-to-video generation, with native bilingual prompting in English and Chinese. It is available in a 5B parameter variant suited for consumer hardware and a 14B parameter variant for higher-quality output. Wan 2.2 was trained on a dataset expanded significantly from its predecessor, with image data increasing by 65.6% and video data by 83.2%. It includes a dedicated aesthetic fine-tuning stage informed by film industry standards, further refined through reinforcement learning to align with human visual preferences. Specialized modules — Wan-Animate and Wan-Move — allow users to animate a character from a single image or transfer motion from one video to another subject. The model is natively supported by ComfyUI and accepts LoRA adapters and source images as inputs alongside text prompts.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Wan 2.2 Deprecated

Release date unavailable

Structured model profile with pricing, context, and capability details.

Image
Context: N/A Output: N/A
Input: N/A Output: N/A
View model →
Video

Wan 2.5

Release date unavailable

Wan 2.5 is an open-source AI video generation model developed by Alibaba's DAMO Academy. It generates videos up to 10 seconds long at resolutions ranging from 480p to 1080p HD, with native 4K available in preview, all rendered at 24 frames per second. The model's defining characteristic is its ability to generate audio and video simultaneously in a single step — producing character dialogue with lip-sync, environmental ambient sounds, and background music directly from a text or image prompt, without requiring separate post-production audio work. It supports multiple input modes including text-to-video, image-to-video, audio-to-video, and video-to-video refinement. Wan 2.5 is designed for content creators, filmmakers, advertisers, and developers who need production-ready video with synchronized audio. It supports cinematic camera controls such as dolly, tracking, and crane movements, as well as lighting styles, depth of field, and particle effects like rain and fire. The model handles photorealistic, anime, illustrated, and stylized visual aesthetics, and processes prompts in at least 8 languages with matching audio generation. Its open-source nature makes it accessible for local deployment and integration into custom pipelines.

Video
Context: 2,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Wan 2.7

Release date unavailable

Seedance 2.0 generates Hollywood-grade cinematic videos from text prompts with native audio-visual synchronization, director-level camera and lighting control, and exceptional motion stability. Built on Seed's unified multimodal architecture, it leads on instruction adherence, motion quality, and visual aesthetics.

Video
Context: 50K Output: 10,000 tokens
Input: N/A Output: N/A
View model →
Image

Wan 2.7

Release date unavailable

Alibaba's powerful multimodal AI model that generates cinematic 1080p video with native audio synchronization, multi-shot storytelling, and advanced image creation.

Image
Context: 2K Output: N/A
Input: N/A Output: N/A
View model →
Image

Wan 2.7 Pro

Release date unavailable

Alibaba's powerful multimodal AI model that generates cinematic 1080p video with native audio synchronization, multi-shot storytelling, and advanced image creation.

Image
Context: 2K Output: N/A
Input: N/A Output: N/A
View model →
I

Ideogram

8 models

Image

Ideogram Upscale

Release date unavailable

Ideogram Upscale is an AI-powered image enhancement tool developed by Ideogram AI, first launched in June 2024. It takes lower-resolution images and scales them up to 8K resolution while preserving and sharpening fine detail. A notable aspect of the tool is its integration with Topaz Labs technology, which is incorporated directly into the Ideogram platform to improve output quality. The model accepts both images generated within Ideogram and externally sourced images brought in for enhancement. Ideogram Upscale is designed for designers, marketers, and creators who need production-ready assets at print or large-format quality. Common use cases include preparing graphics for merchandise, advertising materials, logos, and high-resolution digital displays. The tool is available both through the Ideogram web platform and via the Ideogram API, allowing developers to integrate upscaling into automated pipelines and custom workflows.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V1

Release date unavailable

Ideogram v1 is an AI image generation model developed by Ideogram AI that creates visuals from text prompts. It was added to MindStudio in August 2024 and accepts inputs including text descriptions and configurable parameters. The model supports a range of artistic styles, from photorealistic imagery to illustrations and graphic design aesthetics, with a context window of 10,000 tokens for prompt input. What distinguishes Ideogram v1 from many other image generation models is its ability to render legible, accurately spelled text directly within generated images. This makes it particularly useful for designers, marketers, and content creators who need to produce assets like posters, banners, social media graphics, and branded materials where typography is part of the composition. Its strong prompt adherence also allows it to translate detailed descriptions into coherent visual outputs.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V1 Remix

Release date unavailable

Ideogram v1 Remix is an image generation model developed by Ideogram AI that takes an existing image and a text prompt as inputs to produce a transformed version of that image. It is designed to reinterpret visual content by applying new styles, moods, or artistic directions while preserving the underlying compositional structure of the source image. The model builds on Ideogram's v1 image generation foundation and adds image-guided creation as a core workflow. Ideogram v1 Remix is particularly suited for creative professionals, designers, and artists who need to iterate on visual concepts or explore stylistic variations from a starting reference. One of its notable characteristics is its text rendering accuracy, a trait carried over from the broader Ideogram model family. Users can control outputs through parameters including style selection, aspect ratio, and a seed value for reproducibility, making it useful for both exploratory and production-oriented creative work.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V2

Release date unavailable

Ideogram V2 is a text-to-image model developed by Ideogram and released in August 2024 as the second generation of their image generation platform. It accepts text prompts along with style, aspect ratio, and other configuration inputs to produce images, and includes a Magic Prompt feature that automatically expands simple prompts into more detailed descriptions before generation. The model supports a prompt input up to 10,000 characters and accepts a seed value for reproducible outputs. Ideogram V2 is particularly suited for use cases that require legible, well-styled text embedded directly within generated images, such as social media graphics, posters, banners, product labels, and logo concepts. It is used by designers, marketers, and content creators who need images where typography accuracy is a priority. The model offers style control options and is available through the Ideogram platform as well as via API for integration into third-party applications.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V2 Remix

Release date unavailable

Ideogram V2 Remix is an image-to-image generation model developed by Ideogram. Rather than generating images from scratch, it takes an uploaded source image and a text prompt, then produces a transformed version that blends the written creative direction with the original composition and visual elements. It accepts a range of common image formats including jpg, jpeg, png, webp, gif, and avif. The model is designed for designers, artists, and content creators who want to iterate on existing visuals rather than start from a blank canvas. It supports stylistic and thematic transformations guided by natural language, making it useful for exploring concept variations, adapting imagery to new aesthetics, or generating multiple creative directions from a single reference image. A seed input is available for reproducibility, and multiple selection parameters allow control over style, aspect ratio, and other output characteristics.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V3

Release date unavailable

Ideogram V3, also referred to as Ideogram 3.0, is a text-to-image generation model released by Ideogram in March 2025. It accepts text prompts alongside optional inputs such as style reference images, aspect ratio selections, and rendering mode preferences to produce photorealistic images. One of its defining technical characteristics is its ability to render legible, accurate typography directly within generated images — a capability that has historically been a challenge for image generation models. It also supports a Reframe variant that enables outpainting and multi-aspect-ratio adaptation. Ideogram V3 is available in three rendering tiers — Turbo, Balanced, and Quality — allowing users to trade off generation speed against output fidelity depending on their workflow. The model is particularly suited for use cases where visual accuracy and readable text within images are both required, such as advertising assets, e-commerce photography, branded content, UX mockups, and editorial design. Its style reference control feature allows a reference image to guide color grading, texture, and compositional style across a set of generated outputs. The model accepts a seed input, enabling reproducible results when the same prompt and settings are reused.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Ideogram V3 Remix

Release date unavailable

Ideogram V3 Remix is an image editing model developed by Ideogram, a company founded by former Google Brain researchers. It extends the base Ideogram V3 image generation model with a remixing system that allows users to transform existing images using text prompts, with a 0–100 strength slider that controls how much the output deviates from the source image. Users can supply their own images or work with images previously generated in Ideogram, and the model will automatically generate a descriptive prompt when an external image is uploaded. The model accepts up to three style reference images to guide color palette, texture, and mood, and supports reusable Style Codes for maintaining brand consistency across outputs. Ideogram is particularly noted for its ability to render legible, correctly spelled text within generated images, making it well-suited for posters, packaging, logos, and marketing materials. It is designed for designers, marketers, and creative professionals who need to iterate on visual concepts without rebuilding them from scratch.

Image
Context: 10K Output: N/A
Input: N/A Output: N/A
View model →
Vision

Ideogram Vision

Release date unavailable

Ideogram Vision is a multimodal AI model developed by Ideogram that combines image understanding with natural language processing. It is designed to analyze and interpret images in conjunction with text prompts, enabling tasks such as visual question answering, image description, and vision-language reasoning. The model extends Ideogram's AI platform beyond image generation into visual comprehension. It supports a context window of 32,000 tokens, allowing for detailed and extended interactions involving both image and text inputs. Ideogram Vision is best suited for applications that require understanding the content of an image and responding to queries about it in natural language. This includes use cases such as extracting information from visual content, describing scenes or objects, and combining visual context with text-based reasoning tasks. The model is accessible through the MindStudio platform without requiring separate API key management. It is particularly relevant for developers and teams building workflows that involve image analysis as a core component.

Context: 32,000 Output: N/A
Input: $0.01 Output: N/A
View model →
M

Meta

8 models

Text

Llama 4 Maverick

Apr 05, 2025

Open source

Llama 4 Maverick is a multimodal mixture-of-experts model developed by Meta, released in early 2025. It has 17 billion active parameters drawn from a pool of 400 billion total parameters across 128 experts, and supports both text and image inputs. The model handles 12 languages and offers a 130,000-token context window, making it suited for long-document and multilingual tasks. Maverick is designed for general assistant and chat use cases, with particular strengths in image understanding and creative writing. It uses a sparse MoE architecture, meaning only a subset of parameters are activated per inference pass, which allows the model to deliver broad capability at a more efficient compute cost. Developers building applications that require cross-language support, visual reasoning, or extended context handling are the primary target audience for this model.

Text Image Structured Output
Context: 130,000 Output: 60,000 tokens
Input: $0.20 Output: $0.60
View model →
Text

Llama 4 Scout

Apr 05, 2025

Llama 4 Scout is a multimodal AI model developed by Meta, released in early 2025 as part of the Llama 4 model family. It uses a Mixture of Experts (MoE) architecture with 17 billion active parameters, 16 experts, and 109 billion total parameters, processing both text and image inputs through a unified model backbone. The model supports a 130,000-token context window and is available under Meta's Llama 4 Community License. Llama 4 Scout is designed for developers and enterprises building applications that require multimodal understanding across text and vision. Its MoE design activates only a subset of parameters per token, making inference more compute-efficient relative to dense models of comparable total parameter count. It is well-suited for tasks such as document analysis, image-grounded question answering, and long-context text generation.

Text Image Tools
Context: 130,000 Output: 60,000 tokens
Input: $0.10 Output: $0.30
View model →
Text

Llama-2 13B Chat Deprecated

Jul 18, 2023

Balanced model for detailed language processing, offering advanced understanding and generation.

Text
Context: N/A Output: 2,500 tokens
Input: N/A Output: N/A
View model →
Text

Llama-2 70B Chat Deprecated

Jul 18, 2023

Provides depth and complexity in language understanding for sophisticated content creation.

Text
Context: N/A Output: 2,500 tokens
Input: N/A Output: N/A
View model →
Text

Llama 4 Scout

Apr 11, 2022

Llama 4 Scout is a multimodal AI model developed by Meta, released in early 2025 as part of the Llama 4 model family. It uses a Mixture of Experts (MoE) architecture with 17 billion active parameters, 16 experts, and 109 billion total parameters, meaning only a subset of parameters is activated per token during inference. The model processes both text and image inputs within a unified backbone and supports a 130,000-token context window. Llama 4 Scout is designed for developers and enterprises building applications that require combined text and vision understanding. Its MoE design makes it more compute-efficient during training and inference compared to dense models of similar total parameter counts. On MindStudio, it is served via Groq, which provides low-latency inference for the instruct-tuned variant.

Text
Context: 130,000 Output: 8,192 tokens
Input: $0.11 Output: N/A
View model →
Text

Code Llama Deprecated

Release date unavailable

Tailored for code comprehension, generation, and debugging with an instructive design.

Text
Context: N/A Output: 2,500 tokens
Input: N/A Output: N/A
View model →
Text

Llama 3 70B Deprecated

Release date unavailable

Structured model profile with pricing, context, and capability details.

Text
Context: N/A Output: 8,192 tokens
Input: N/A Output: N/A
View model →
Text

Llama 3 8B Deprecated

Release date unavailable

Structured model profile with pricing, context, and capability details.

Text
Context: N/A Output: 8,192 tokens
Input: N/A Output: N/A
View model →
Q

Qwen

7 models

Text

Qwen3.6-35B-A3B

Apr 27, 2026

Qwen3.6-35B-A3B is an open-weight multimodal model from Alibaba Cloud with 35 billion total parameters and 3 billion active parameters per token. It uses a hybrid sparse mixture-of-experts architecture combining Gated...

Text Image Video
Context: 262.1K Output: 262,144 tokens
Input: $0.20 Output: $1.00
View model →
Image

Qwen Image

Aug 04, 2025

Qwen-Image is an image generation and editing model developed by Alibaba's Qwen team. It accepts text prompts and source images as input and supports both text-to-image generation and a wide range of image editing tasks, including style transfer, object addition and removal, background changes, and pose manipulation. The model uses a dual-encoding architecture that processes images through both Qwen2.5-VL for semantic understanding and a VAE encoder for visual fidelity, feeding into an MMDiT backbone. What distinguishes Qwen-Image from many other generation models is its ability to render complex text accurately within images, including multi-line layouts and logographic scripts such as Chinese characters. This capability is built using a curriculum learning strategy that progressively scales from simple to complex text rendering tasks during training. The model has been evaluated on benchmarks covering image generation, image editing, and text rendering, including GenEval, DPG, GEdit, LongText-Bench, ChineseWord, and CVTG-2K. It is well-suited for workflows that require accurate in-image typography, multilingual text, or detailed image editing from a source image.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Text

Qwen3 235B

Apr 28, 2025

Qwen3 235B is an instruction-tuned large language model developed by Alibaba's Qwen team, built on a Mixture-of-Experts (MoE) architecture with 235 billion total parameters. During inference, only 22 billion parameters are activated at a time, which reduces computational cost relative to the model's full parameter count. The model supports a native context window of 262,144 tokens and is released under the Apache 2.0 license, permitting commercial use. This release, versioned as Qwen3-235B-A22B-Instruct-2507, is the non-thinking instruct variant, meaning it produces direct responses without exposing an internal chain-of-thought. It is designed for instruction following, agentic workflows, tool use, multilingual tasks, complex question answering, and coding. The model scores 51.8% on LiveCodeBench v6, 70.3% on AIME25, and 77.5% on GPQA, reflecting its range across coding, mathematical reasoning, and knowledge-intensive tasks.

Text Tools Structured Output
Context: 262,144 Output: 262,144 tokens
Input: $0.15 Output: $1.82
View model →
Image

Qwen 2 Pro

Release date unavailable

Qwen Image 2.0 Pro is an image generation and editing model developed by Alibaba's Qwen team and released in February 2026. It uses an 8B Qwen3-VL encoder paired with a 7B diffusion decoder to produce images natively at 2048×2048 resolution. A single model handles both text-to-image generation and image editing tasks, and it accepts prompts up to 1,000 tokens for detailed scene descriptions. It holds the number one position on AI Arena's blind human evaluation leaderboard for both text-to-image generation and image editing. One of the model's defining characteristics is its ability to render accurately spelled, properly positioned text within generated images, making it suitable for infographics, presentation slides, movie posters, comics, and bilingual Chinese and English content. Its 7 billion parameter footprint is smaller than its predecessor, which used 20 billion parameters, enabling faster inference. The model is well suited for marketing teams, content creators, and designers who need production-ready visuals where accurate text rendering, high native resolution, or iterative editing workflows are priorities.

Image
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Qwen Image Edit Plus

Release date unavailable

Qwen Image Edit Plus is an image generation and editing model developed by Qwen, released in early 2026. It supports text-to-image generation, image-to-image editing, and ControlNet pose conditioning, making it suited for workflows that require precise control over output composition. The model accepts image URL arrays, numeric parameters, and seed values as inputs, enabling reproducible results across generation runs. The model is designed for tasks that involve modifying existing images based on text prompts as well as generating new images from scratch. Its ControlNet pose support allows users to guide human figure layouts using reference poses, which is useful for character-focused creative work. With a context window of 50,000 tokens, it can process detailed prompt instructions alongside image inputs.

Image
Context: 50,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Z Image Turbo

Release date unavailable

Z Image Turbo is a text-to-image generation model developed by Alibaba's Tongyi-MAI lab, built on a single-stream diffusion transformer architecture with 6 billion parameters. It is the distilled, few-step variant of the Z-Image foundation model, designed to produce high-quality images at faster inference speeds without significant quality degradation. The model incorporates a Reinforcement Learning from Human Feedback (RLHF) pipeline using DPO and GRPO stages to align outputs with human aesthetic preferences, and includes a built-in prompt enhancer with a reasoning chain to improve results from short or simple prompts. Z Image Turbo accepts text prompts, source images, LoRA weights, and a seed value as inputs, making it suitable for both text-to-image and image editing workflows. Its training data infrastructure includes a Data Profiling Engine, Cross-modal Vector Engine, and a multi-level image captioning system covering OCR, world knowledge, and editing difference captions. The model is well-suited for creative professionals, developers building image generation pipelines, and researchers working with efficient diffusion transformer architectures.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Z Image Turbo Controlnet

Release date unavailable

Z Image Turbo Controlnet is an image generation model developed by Alibaba's Tongyi-MAI lab, built on a single-stream diffusion transformer architecture with 6 billion parameters. It uses a few-step distillation approach (the Turbo variant) to accelerate inference while preserving output quality, and incorporates ControlNet to allow structural guidance from a source image. The model was trained with a multi-level captioning system and a data infrastructure that includes a Cross-modal Vector Engine and World Knowledge Topological Graph to improve semantic alignment between prompts and outputs. This model is well-suited for workflows that require both speed and structural control over generated images, such as guided creative generation, image editing pipelines, and rapid prototyping. It accepts image URLs as source inputs alongside configurable parameters including seed values for reproducibility. An RLHF alignment pipeline using DPO and GRPO stages was applied to bring outputs closer to human aesthetic preferences, and a built-in prompt enhancer with reasoning chain helps produce better results from short or underspecified prompts.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
W

Wavespeed

6 models

Image

Chroma

Release date unavailable

Chroma is an 8.9 billion parameter text-to-image model developed by WaveSpeed AI, built on the FLUX.1-schnell architecture. It was trained using over 105,000 hours of NVIDIA H100 GPU time, with a dataset curated from 5 million selected images. The model is designed around a philosophy of unrestricted creative expression, removing the content filters found on many mainstream image generation platforms. It supports image output up to 1536×1536 pixels and is noted for clean renders, natural lighting, strong color harmony, and anatomical accuracy in human figures, hands, and faces. Chroma is well-suited for commercial photography, digital illustration, character design, concept art, and medical or educational illustration where content restrictions would otherwise be a barrier. It handles complex, multi-element scenes involving people, props, and environments with strong prompt adherence. The model responds particularly well to structured prompts organized around subject, context, style, lighting, camera, and mood. It is available through WaveSpeed AI and is optimized for both single-shot and batch generation workflows.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
3d_generation

Hunyuan3D V2 Multi-View

Release date unavailable

Generate a 3D model from front, back, and side reference images with optional textured output.

Context: N/A Output: 10,000 tokens
Input: N/A Output: N/A
View model →
3d_generation

Hunyuan3D V3

Release date unavailable

Structured model profile with pricing, context, and capability details.

Context: N/A Output: 10,000 tokens
Input: N/A Output: N/A
View model →
3d_generation

Meshy 6

Release date unavailable

Structured model profile with pricing, context, and capability details.

Context: N/A Output: 10,000 tokens
Input: N/A Output: N/A
View model →
3d_generation

SAM 3D Objects

Release date unavailable

Structured model profile with pricing, context, and capability details.

Context: N/A Output: 10,000 tokens
Input: N/A Output: N/A
View model →
3d_generation

Tripo3D v2.5

Release date unavailable

Structured model profile with pricing, context, and capability details.

Context: N/A Output: 10,000 tokens
Input: N/A Output: N/A
View model →
S

Stability

5 models

Image

SDXL LoRA

Jul 04, 2023

SDXL LoRA is a text-to-image generative AI model developed by Stability AI, built as a successor to Stable Diffusion. It runs on a 3.5 billion parameter architecture and generates images natively at 1024×1024 resolution, using dual text encoders — OpenCLIP-ViT/G and CLIP-ViT/L — to interpret complex prompts with reported 89% prompt adherence in benchmark testing. The model also supports an optional refiner stage that applies an ensemble-of-experts approach to add fine detail to generated outputs. What distinguishes SDXL LoRA from the base SDXL model is its built-in support for Low-Rank Adaptation (LoRA), a technique that enables efficient style and subject customization without full model retraining. Users can apply up to five LoRA adapters simultaneously, making it practical for tasks like consistent character design, brand-specific imagery, and specialized artistic styles. It is well-suited for digital artists, marketing teams, game developers, and product designers who need repeatable, customizable visual output at scale.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

SDXL

Release date unavailable

SDXL (Stable Diffusion XL) is an open-source image generation model developed by Stability AI and released in July 2023. It accepts text prompts and optional image inputs to produce images, and supports workflows including text-to-image generation, image-to-image editing, and inpainting. The model is available in two configurations: SDXL 1.0, optimized for out-of-the-box use, and SDXL 1.0 Open, which allows fine-tuning with custom data and inference code. Both variants are deployable via AWS SageMaker and Amazon Bedrock. SDXL is designed for designers, creative professionals, and developers who need generative imagery at scale. Its open model variant supports customization through fine-tuning, making it usable for specialized image pipelines beyond general-purpose prompting. Inputs include image URLs, numeric parameters for dimensions, and a seed value for reproducible outputs. The model is tagged as open source and multi-modal, reflecting its support for both text and image inputs.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Stable Diffusion 3

Release date unavailable

Stable Diffusion 3 (SD3) is a text-to-image generation model developed by Stability AI and released in June 2024. It introduces a Multimodal Diffusion Transformer (MMDiT) architecture that maintains separate weight sets for image and language representations, which improves the model's ability to interpret complex, detailed prompts. The model is available in multiple size variants ranging from 800 million to 8 billion parameters, making it deployable across a range of hardware configurations. One of SD3's most notable characteristics is its ability to render legible text within generated images, a task that has historically been difficult for diffusion-based models. The 8B parameter variant fits within 24GB of VRAM and generates a 1024×1024 image in approximately 34 seconds using 50 sampling steps. SD3 is well suited for creative professionals, developers, and researchers who require high-fidelity image generation with strong alignment to nuanced text prompts.

Image
Context: 10,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Stable Image Core

Release date unavailable

Stable Image Core is a text-to-image generation model developed by Stability AI, designed to convert natural language descriptions into detailed visual imagery. It accepts text prompts of up to 77 tokens and produces images across a range of styles and subjects. The model is available through both the Stability AI API and AWS Bedrock, giving developers flexibility in how they integrate it into their workflows. Stable Image Core is well suited for use cases such as creative content generation, marketing visuals, concept art, and rapid visual prototyping. Its availability on AWS Bedrock means it can be incorporated into cloud-based applications without managing underlying infrastructure. The model serves as an accessible entry point into Stability AI's image generation ecosystem, balancing output quality with ease of deployment.

Image
Context: 77 Output: N/A
Input: N/A Output: N/A
View model →
Image

Stable Image Ultra

Release date unavailable

Stable Image Ultra is Stability AI's flagship text-to-image generation model, designed to produce high-quality, photorealistic images from natural language text prompts. It sits at the top of Stability AI's image generation lineup and accepts concise text descriptions — up to a 77-token context window — to generate detailed visuals with strong coherence and fidelity. The model supports configurable inputs including text prompts, selection parameters, and a seed value for reproducible outputs. Stable Image Ultra is well-suited for applications such as marketing visuals, concept art, product visualization, and editorial illustration. It is available through Stability AI's own API and via AWS Bedrock, making it accessible for production-scale deployments without requiring infrastructure management. Developers and enterprises can integrate it directly into existing workflows through these managed cloud platforms.

Image
Context: 77 Output: N/A
Input: N/A Output: N/A
View model →
Z

Z.ai

5 models

Text

GLM 5.1

Apr 07, 2026

Open source

GLM-5.1 delivers a major leap in coding capability, with particularly significant gains in handling long-horizon tasks. Unlike previous models built around minute-level interactions, GLM-5.1 can work independently and continuously on...

Text Tools Structured Output
Context: 202.8K Output: 16,384 tokens
Input: $1.40 Output: $3.08
View model →
Text

GLM 5

Feb 11, 2026

GLM-5 is a 744-billion-parameter Mixture-of-Experts language model developed by Z.ai (formerly Zhipu AI), released in February 2026 under the MIT license. It activates 40 billion parameters per token and supports a 200,000-token context window, making it suited for tasks that require processing large volumes of text in a single pass. The model was pre-trained on 28.5 trillion tokens and incorporates DeepSeek Sparse Attention to reduce inference costs while maintaining long-context performance. GLM-5 is designed primarily for agentic workflows, autonomous software engineering, tool use, and long-horizon planning tasks. A notable aspect of its development is that it was trained entirely on Huawei Ascend chips using the MindSpore framework, with no dependency on NVIDIA hardware. It also introduces an asynchronous reinforcement learning training system called slime, which improves training throughput and enables more fine-grained post-training alignment. The model is freely available for both research and commercial use under its MIT license.

Text Tools Structured Output
Context: 202.8K Output: 16,384 tokens
Input: $0.80 Output: $1.92
View model →
Text

GLM 4.7

Dec 22, 2025

GLM-4.7 is a 358-billion-parameter large language model developed by Z.ai (formerly Zhipu AI/THUDM) and released in December 2025. It is designed specifically for agentic workflows, multi-step coding tasks, terminal automation, and complex mathematical and scientific reasoning. The model is available under an MIT license, making it usable for both commercial and non-commercial applications. It supports a 131,072-token context window, allowing it to handle long documents and extended coding sessions. What distinguishes GLM-4.7 from earlier GLM releases is a set of three reasoning mechanisms: Interleaved Thinking, which applies reasoning before every response and tool call; Preserved Thinking, which retains reasoning context across conversation turns to maintain consistency; and Turn-level Thinking, which lets developers toggle reasoning depth on or off per turn. On benchmarks, the model scores 73.8% on SWE-bench Verified, 95.7% on AIME 2025, and 87.4% on τ²-Bench. It is best suited for developers and researchers building agent pipelines, automated coding tools, or applications requiring reliable multi-step planning.

Text Tools Structured Output
Context: 131,072 Output: 16,384 tokens
Input: $0.40 Output: $1.75
View model →
Text

GLM 4.6V

Dec 08, 2025

GLM-4.6V is a large-scale multimodal foundation model developed by Z.ai, available in two variants: the full 106B parameter version designed for cloud and high-performance cluster deployments, and a lightweight 9B Flash version optimized for local and low-latency use. The model supports a 128K token context window, allowing it to process long documents, multi-page files, and complex mixed-media inputs natively without converting content to plain text first. It was trained with a data cutoff of December 2025. What distinguishes GLM-4.6V is its native integration of tool-use capabilities within a visual model — it can accept images, screenshots, and document pages directly as inputs to function calls, connecting visual perception to executable actions in agent workflows. The model also supports interleaved image-text generation, frontend replication from UI screenshots, and joint understanding of text, layout, charts, tables, and figures. It is best suited for enterprise and agent-based applications such as document analysis pipelines, multimodal AI assistants, UI automation, and content generation workflows.

Text Image Video
Context: 131,072 Output: 16,384 tokens
Input: $0.30 Output: $0.90
View model →
Text

GLM 4.6

Sep 30, 2025

GLM-4.6 is a large language model developed by Zhipu AI (Z.ai), built on a Mixture-of-Experts architecture with approximately 357 billion parameters. It supports both English and Chinese, carries a 200,000-token context window, and is released under the MIT license, making it available for commercial and personal use without restrictions. The model was released in late 2025 and represents Zhipu AI's flagship offering in the GLM series. GLM-4.6 is designed for tasks that require extended context handling, multi-step reasoning, and agentic workflows. A notable characteristic is its ability to invoke tools during the reasoning process itself — not only after completing a chain of thought — which enables more dynamic problem-solving in agent-based applications. It is well suited for developers and researchers working on complex coding tasks, long-document analysis, bilingual applications, and automated multi-step pipelines.

Text Tools Structured Output
Context: 200,000 Output: 16,384 tokens
Input: $0.43 Output: $1.74
View model →
E

ElevenLabs

4 models

Text to Speech

ElevenLabs TTS

Release date unavailable

ElevenLabs TTS is a text-to-speech platform developed by ElevenLabs that converts written text into natural-sounding audio across 70+ languages. The platform includes multiple speech models — Eleven v3, Eleven Multilingual v2, and Eleven Flash v2.5 — each designed for different use cases, from expressive long-form narration to ultra-low-latency real-time applications. It also supports voice cloning, allowing users to create digital replicas of voices that retain their characteristics across all supported languages. ElevenLabs TTS is well-suited for media companies, audiobook producers, game developers, publishers, and content creators who need scalable multilingual audio output. The platform's conversational AI component supports sub-100ms latency and can integrate with CRMs, payment systems, and telephony platforms, making it applicable for customer-facing voice agent deployments. The context window supports up to 10,000 tokens per request, and the platform accepts voice selection and configuration inputs through its API.

Context: 10,000 Output: N/A
Input: $240.00 Output: N/A
View model →
Music

ElevenLabs Music

Release date unavailable

ElevenLabs Music (music_v1) is a music generation model developed by ElevenLabs that produces original audio tracks from text descriptions. Users can specify a genre, mood, style, or use case in natural language, and the model generates a corresponding track with or without vocals. It supports multiple languages for vocal content, making it usable across a range of regional and stylistic contexts. The model is designed for creators who need customized background music, scored content, or vocal tracks without requiring traditional music production tools. It accepts text as its sole input type, meaning the entire creative direction is communicated through descriptive prompts. ElevenLabs Music is well-suited for video producers, game developers, content creators, and anyone needing original audio generated quickly from a written description.

Context: N/A Output: N/A
Input: N/A Output: N/A
View model →
Transcription

Scribe v1

Release date unavailable

Scribe v1 is ElevenLabs' original speech-to-text model, designed to convert spoken audio into written transcripts. Built as the foundation of ElevenLabs' transcription offering, it enables developers and creators to automatically transcribe audio and video content through the ElevenLabs API. The model supports transcription across multiple languages, making it usable in multilingual workflows and automation pipelines. Scribe v1 has been deployed in use cases ranging from voice note capture to content production tooling. It has since been succeeded by Scribe v2, which adds features such as support for 90+ languages, speaker diarization for up to 32 speakers, word-level timestamps, and entity detection. Developers starting new projects are directed by ElevenLabs to use Scribe v2, while Scribe v1 remains available for existing integrations.

Context: N/A Output: N/A
Input: N/A Output: N/A
View model →
Transcription

Scribe v2

Release date unavailable

Scribe v2 is ElevenLabs' flagship speech-to-text model, built to transcribe audio accurately across more than 90 languages with automatic language detection. It supports speaker diarization for up to 32 speakers, word-level timestamps, and entity detection across 56 named entity types, making it one of the more feature-rich transcription models available through an API. Developers can also supply up to 100 custom keyterms to improve recognition of domain-specific vocabulary, names, or technical jargon. Scribe v2 is well suited for applications where transcription accuracy and rich metadata matter — such as meeting summarization, podcast indexing, media subtitling, and legal or medical documentation workflows. Its dynamic audio tagging feature automatically labels non-speech events, which adds context beyond spoken words. The combination of precise timing data and speaker attribution makes it a practical choice for any pipeline where knowing who said what and when is a requirement.

Context: N/A Output: N/A
Input: N/A Output: N/A
View model →
L

Luma Labs

4 models

Image

Photon 1

Release date unavailable

Photon 1 is a text-to-image generation model developed by Luma Labs, the company also known for its Ray video generation models. Released in early 2025, it is built on a proprietary Universal Transformer architecture rather than the diffusion-based approach used by many image generators, which enables it to produce 1920×1080 resolution outputs with accurate lighting, shadows, and textures. It accepts natural language prompts without requiring specialized prompt engineering, and supports up to four reference images to guide style, composition, or character appearance. Photon 1 is available in two configurations: the standard variant targets maximum quality for professional and print-ready use cases, while Photon 1 Flash is a lighter variant optimized for speed, with generation times as low as 100–500 milliseconds. A beta character consistency feature allows the model to maintain a character's appearance across multiple generations using a single reference image. The model is well suited for designers, marketers, and content creators who need production-ready imagery for applications such as product photography, marketing campaigns, and character-driven visual projects.

Image
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Image

Photon 1 Flash

Release date unavailable

Photon Flash 1 is an image generation model developed by Luma Labs, designed with an emphasis on speed and low-latency inference. It is the "flash" variant in the Photon model family, meaning it is optimized for faster generation cycles compared to more computationally intensive counterparts. The model accepts a context window of up to 1,000 tokens and is accessible through AI platforms and aggregators including MindStudio. Photon Flash 1 is best suited for production environments and real-time applications where response time is a priority alongside image quality. Developers and businesses building high-throughput workflows — such as rapid prototyping tools, content pipelines, or interactive applications — are the primary audience for this model. Its design reflects a practical trade-off between generation speed and output fidelity, making it a functional choice when latency constraints are a key deployment consideration.

Image
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Ray 2

Release date unavailable

Ray 2 is a large-scale AI video generation model developed by Luma Labs, released in January 2025. It runs on approximately 10 times the compute of its predecessor, Ray 1.6, and is built on a multi-modal architecture trained directly on video sequences rather than individual frames. This training approach gives the model an understanding of natural motion, lighting behavior, and physical object interactions. Ray 2 accepts text prompts, images, or video as input and generates clips ranging from 5 to 9 seconds, extendable up to 30 seconds, at resolutions up to 1080p with optional 4K upscaling. Ray 2 supports multiple aspect ratios including 16:9, 9:16, 1:1, and 21:9, and includes keyframe control so users can define start frames, end frames, or both for precise scene direction. A speed-optimized variant called Ray 2 Flash delivers comparable visual quality in roughly one-third the render time, making it suitable for rapid iteration. The model is available through Luma AI's own platform and via Amazon Bedrock, where AWS serves as the exclusive cloud provider for fully managed access. It is used across industries including advertising, entertainment, architecture, fashion, film, and music production.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

Ray Flash 2

Release date unavailable

Ray Flash 2 is a video generation model developed by Luma AI, designed to produce AI-generated videos from text prompts and images. It is the speed-optimized variant within the Ray 2 model family, prioritizing throughput and fast iteration while maintaining visual quality. As part of the broader Ray lineup, it sits alongside Ray 2, Ray 1.6, and Photo Flash 1, giving users options across different speed and quality trade-offs. Ray Flash 2 is well-suited for workflows that require rapid video generation or high-volume production, such as creative prototyping or iterative content development. It is accessible through multiple platforms, including ComfyUI's Partner Nodes system for visual AI workflows and the Codenteam Intersect platform for developer integrations. The model accepts both text and image inputs, supporting flexible generation modes for a range of creative and technical use cases.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
A

Amazon

3 models

Text

Amazon Nova Lite

Dec 05, 2024

Amazon Nova Lite is a multimodal foundation model developed by Amazon and made available through Amazon Bedrock. It accepts image, video, and text inputs and is designed to process them at low latency and low cost. The model was released in December 2024 as part of the Amazon Nova family, which includes three understanding models — Nova Micro, Nova Lite, and Nova Pro — and two creative content generation models. Nova Lite occupies the middle tier of the Nova understanding lineup, sitting between the text-only Nova Micro and the more capable Nova Pro. It supports a 300,000-token context window, making it suitable for tasks that involve long documents or extended conversations. The model also supports fine-tuning on Amazon Bedrock, allowing developers to adapt it for specific use cases. It is well-suited for applications that require multimodal input processing at scale where cost efficiency and speed are priorities.

Text Image Tools
Context: 300,000 Output: 5,000 tokens
Input: $0.06 Output: $0.24
View model →
Text

Amazon Nova Micro

Dec 05, 2024

Amazon Nova Micro is a text-only foundation model developed by Amazon and made available through Amazon Bedrock. It is part of the Amazon Nova family, which includes understanding models (Nova Pro, Nova Lite, and Nova Micro) as well as creative content generation models. Nova Micro is specifically designed to deliver the lowest latency responses within the Nova lineup at very low cost, making it a practical choice for applications where speed and cost efficiency are priorities. Because Nova Micro handles text input and output exclusively, it is well suited for tasks such as summarization, classification, question answering, and other text-based workflows where multimodal capabilities are not required. The model supports a 128,000-token context window, allowing it to process long documents or extended conversations in a single request. It can also be fine-tuned on Amazon Bedrock, enabling developers to adapt it to specific domains or use cases.

Text Tools
Context: 128,000 Output: 5,000 tokens
Input: $0.04 Output: $0.14
View model →
Text

Amazon Nova Pro

Dec 05, 2024

Amazon Nova Pro is a multimodal foundation model developed by Amazon and made available through Amazon Bedrock. It accepts text and vision inputs and is designed to handle a wide range of tasks where accuracy, response speed, and cost-efficiency all need to be balanced together. It is part of the Amazon Nova family, which also includes Nova Lite and Nova Micro, each targeting different points on the capability-cost spectrum. Nova Pro was released in December 2024 and supports a 300,000-token context window. Nova Pro is particularly suited for agentic workflows and UI actuation, meaning it can be used to build systems that take sequences of actions or interact with interfaces. It supports fine-tuning on Amazon Bedrock, allowing developers to customize the model for specific domains or cost targets. Within the Nova family, Pro occupies the highest capability tier among the understanding models, making it the appropriate choice when tasks require processing both text and images at scale.

Text Image Tools
Context: 300,000 Output: 5,000 tokens
Input: $0.80 Output: $3.20
View model →
L

Lightricks

3 models

Video

LTX-2 19b

Release date unavailable

LTX-2 19B is an open-source video generation model developed by Lightricks and released on January 6, 2026. It uses an asymmetric dual-stream Diffusion Transformer architecture to generate video and synchronized audio together in a single unified process, rather than producing silent video and adding audio as a separate step. The model accepts text prompts, reference images, or existing video clips as input and outputs native 4K video with flexible frame-rate control and support for extended clip durations. What distinguishes LTX-2 19B is its simultaneous audiovisual output, where ambient sound, environmental effects, and speech synchronization are generated alongside the video frames. The model supports LoRA fine-tuning for camera motion control and custom stylization, and offers NVFP4 and FP8 quantization formats that reduce VRAM usage by up to 60% and accelerate generation up to 3x. A distilled 8-step fast generation mode runs 5–6 times faster than the full model, and on an RTX 4090 with NVFP4 quantization an 8-second 720p clip can be produced in approximately 25 seconds. It is well suited for film-style storytelling, advertising production, and any workflow requiring tight audiovisual coherence.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

LTX-2.3

Release date unavailable

LTX-2.3 is a multimodal video generation model developed by Lightricks and released in March 2026. Built on a Diffusion Transformer architecture with 22 billion parameters, it generates synchronized audio and video in a single forward pass at resolutions up to 4K at 50 frames per second, for clips up to 20 seconds long. It is available as open-source software with open weights under a permissive license, and can be run locally, accessed via API, or deployed on-premises. The model introduces several architectural updates over its predecessor, including a rebuilt variational autoencoder for sharper texture and edge detail, a gated attention text connector for improved prompt adherence, and an upgraded vocoder trained on filtered audio data for cleaner output. It supports native portrait-mode output at 1080×1920 and ships in four checkpoint variants — dev, distilled, fast, and pro — with the distilled variant completing generation in as few as 8 denoising steps. LTX-2.3 is aimed at independent creators, small studios, and developers who need a production-ready open-source foundation for video creation without licensing fees.

Video
Context: 1,000 Output: N/A
Input: N/A Output: N/A
View model →
Video

LTX-2.3 LoRA

Release date unavailable

LTX-2.3 LoRA is a Low-Rank Adaptation fine-tuning system built on top of Lightricks' LTX-2.3 video generation model, released in January 2026. Rather than retraining the full model, LoRA adapters allow users to teach the base model new characters, visual styles, or motion behaviors at a fraction of the computational cost. The system supports both text-to-video and image-to-video generation workflows, and LoRAs trained on the earlier LTX-2.0 model are reported to retain compatibility with the 2.3 update. LTX-2.3 LoRA is designed for creators and developers who need stylistically consistent output across AI-generated video sequences, such as animation, storytelling, or visual effects production. It supports multi-character generation with consistent appearance across frames, style transfer, and community-developed camera movement controls including dolly in and out. The model runs locally using open-source tooling and has gained traction in the Stable Diffusion community for its character and style fidelity in generated video content.

Video
Context: 1000 Output: N/A
Input: N/A Output: N/A
View model →
M

MiniMax

3 models

Text to Speech

Minimax Speech 2.8 HD

Release date unavailable

MiniMax Speech 2.8 HD is a high-definition text-to-speech model developed by MiniMax, built on an autoregressive Transformer architecture with a Flow-VAE decoder. Instead of using traditional mel-spectrogram vocoders, it models speech in a learned latent space, which produces audio with natural cadence, proper intonation, and emotional depth. The model accepts up to 50,000 tokens of input text and was trained through January 2026. The model offers 17 or more expressive voice presets spanning different genders, ages, and speaking styles, along with support for natural interjections such as laughs, sighs, and gasps embedded directly in text. Users can control emotion, speed, volume, pitch, sample rate, bitrate, channel configuration, and output format. These features make it well suited for audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility applications, and game development.

Context: 50K Output: N/A
Input: N/A Output: N/A
View model →
Video

Hailuo 2.3 Pro

Release date unavailable

Hailuo 2.3 Pro is a video generation model developed by MiniMax, capable of producing ultra-clear 1080P video output from text prompts or image inputs. It is designed with physics-aware scene rendering, meaning it attempts to simulate realistic physical interactions and motion within generated video content. The model was trained with a cutoff of October 2025 and accepts up to 2000 context tokens for prompt input. Hailuo 2.3 Pro supports both text-to-video and image-to-video generation workflows, making it applicable to creative production, prototyping, and visual storytelling tasks. Its image URL input type allows users to anchor video generation to a specific starting frame, while toggle group inputs provide control over generation parameters. The model is suited for use cases that require high-resolution output with coherent motion and scene physics.

Video
Context: 2000 Output: N/A
Input: N/A Output: N/A
View model →
Music

Minimax Music 2.5

Release date unavailable

MiniMax Music 2.5 is an AI music generation model developed by MiniMax and released in January 2025. It is designed to address two longstanding challenges in AI-generated music: precise structural control over song arrangement and high-fidelity audio output that closely resembles professionally recorded sound. The model supports 14 distinct structural tags — including Intro, Bridge, Interlude, Build-up, and Hook — giving users paragraph-level control over how a song is organized from start to finish. MiniMax Music 2.5 is well-suited for musicians, content creators, filmmakers, and developers who need complete, structurally defined songs without access to a recording studio. Users can generate full tracks by providing text prompts and selecting structural and stylistic options, making the workflow accessible to creators at varying levels of music production experience. The model is available through MiniMax's platform and is accessible on MindStudio without requiring separate API key setup.

Context: 50,000 Output: N/A
Input: N/A Output: N/A
View model →
R

Reka

3 models

C

Cohere

2 models

Text

Command R

Aug 30, 2024

Command R is an instruction-following conversational model developed by Cohere, designed for enterprise language tasks with a focus on reliability and scalability. It is available through Amazon Bedrock and carries a knowledge cutoff of March 2024. The model is purpose-built for retrieval-augmented generation (RAG) and tool use, making it well-suited for workflows that require grounding responses in external data sources or integrating with external APIs and functions. One of Command R's defining characteristics is its 128,000-token context window, which allows it to process long documents, extended multi-turn conversations, and complex inputs in a single pass. It also supports multilingual tasks and is tagged for low-latency performance, making it a practical choice for organizations building scalable AI applications where response speed and contextual accuracy matter. It is best suited for enterprise use cases such as document analysis, agentic pipelines, and knowledge-grounded question answering.

Text Tools Structured Output
Context: 128,000 Output: 4,000 tokens
Input: $0.50 Output: $0.60
View model →
Text

Command R+

Aug 30, 2024

Command R+ is a large language model developed by Cohere, positioned as the company's flagship text generation model for enterprise use. It is available through Amazon Bedrock, allowing organizations to deploy it within AWS's managed cloud infrastructure. The model supports a 128,000-token context window and was trained on data up to January 2023. It is designed specifically for demanding enterprise workloads that require high accuracy and reliability. What distinguishes Command R+ is its purpose-built support for retrieval-augmented generation, enabling it to ground responses in external knowledge sources rather than relying solely on parametric memory. It also supports multi-step tool use and agentic workflows, allowing it to interact with APIs, databases, and other external systems. The model handles multiple languages, making it applicable for global deployments. It is best suited for production applications such as intelligent search, document summarization, customer support automation, and complex data analysis pipelines.

Text Tools Structured Output
Context: 128,000 Output: 4,000 tokens
Input: $3.00 Output: $10.00
View model →
N

Nvidia

2 models

Text

Nemotron 3 Super 120B

Mar 11, 2026

Open source

Nemotron 3 Super 120B is an open-weight large language model released by NVIDIA in March 2026. It uses a hybrid LatentMoE architecture that combines Mamba-2, Mixture-of-Experts, and Attention layers, activating only 12 billion of its 120 billion total parameters per token. This design allows the model to handle demanding tasks while using significantly less compute than a dense model of comparable parameter count. The model is built for agentic workflows, long-context reasoning, and high-throughput deployments. It supports a context window of up to 1 million tokens and achieves a RULER-100 retrieval score of 91.75 at that length. Nemotron 3 Super 120B also includes a configurable thinking mode for step-by-step reasoning, supports seven languages including English, French, German, Italian, Japanese, Spanish, and Chinese, and is available as an open-weight model suitable for both cloud API and self-hosted use.

Text Tools Structured Output
Context: 1M Output: 16,384 tokens
Input: $0.10 Output: $0.00
View model →
Text

Nemotron 3 Nano 30B

Dec 14, 2025

Nemotron 3 Nano 30B is an open-weight text generation model released by NVIDIA in December 2025 as part of the Nemotron 3 family. It uses a hybrid architecture combining 23 Mamba-2 layers, 23 Mixture-of-Experts (MoE) layers, and 6 Attention layers, with 30B total parameters but only 3.5B active per token. This design allows the model to handle complex tasks while using significantly less compute than a comparable dense model. It supports six languages: English, German, Spanish, French, Italian, and Japanese. The model supports a context window of up to 1 million tokens, making it well-suited for long-document processing, retrieval-augmented generation (RAG), and agentic workflows. On math benchmarks it scores 89.1% on AIME25 without tools and 99.2% with tools, and it achieves 68.3% on LiveCodeBench and 38.8% on SWE-Bench for coding tasks. Its combination of low active-parameter count and long-context capability makes it a practical choice for high-volume or cost-sensitive deployments, edge agents, and instruction-following applications where compute efficiency matters.

Text Tools Structured Output
Context: 262.1K Output: 16,384 tokens
Input: $0.05 Output: $0.20
View model →
H

HeyGen

1 models

M

MeiGen

1 models

P

PixVerse

1 models

R

Runway

1 models