Explore frontier AI models by provider, pricing, and context

›

GPT-4o Mini

Jul 18, 2024

GPT-4o Mini is a text generation model developed by OpenAI and released in July 2024. It is designed to deliver low-cost, low-latency responses across a wide range of tasks, making it suitable for applications that require fast throughput or high request volumes. The model supports a 128,000-token context window and is compatible with the same range of languages as GPT-4o. GPT-4o Mini is positioned for use cases such as real-time customer interactions, processing large volumes of context, and multimodal reasoning tasks. It performs on academic benchmarks across both textual intelligence and multimodal reasoning, outscoring GPT-3.5 Turbo and other small models in those evaluations. Its combination of speed and affordability makes it a practical choice for developers building cost-sensitive production applications.

Context: 128,000 Output: 16,383 tokens

Input: $0.15 Output: $0.60

›

GPT-4o

May 13, 2024

GPT-4o is a multimodal language model developed by OpenAI, released in May 2024. The "o" stands for "omni," reflecting its ability to accept any combination of text, audio, and image as input and generate any combination of those same modalities as output. It has a 128,000-token context window and a training data cutoff of October 2023. One of GPT-4o's defining characteristics is its audio response latency, which can be as low as 232 milliseconds and averages around 320 milliseconds — comparable to human conversational response times. It is well-suited for applications requiring fast, multimodal interaction, such as voice assistants, image analysis pipelines, and multilingual text processing. OpenAI has noted it offers improved performance on non-English text compared to GPT-4 Turbo, while also being available at a lower API cost.

Context: 128,000 Output: 16,384 tokens

Input: $2.50 Output: $10.00

›

GPT-4 Turbo

Apr 09, 2024

GPT-4 Turbo is a variant of OpenAI's GPT-4 model, released to provide faster response times while retaining the language understanding and generation capabilities of the base GPT-4. It supports a 128,000-token context window, allowing it to process and reason over long documents, extended conversations, or large blocks of text in a single request. The model has a training data cutoff of December 2023 and is available through OpenAI's API. GPT-4 Turbo is designed for use cases where both response quality and speed matter, such as interactive chatbots, real-time content generation, and applications that need to handle lengthy inputs. Its large context window makes it well-suited for tasks like document summarization, multi-turn dialogue, and code generation across large codebases. Developers building latency-sensitive applications often choose this variant over the base GPT-4 for its improved throughput.

Text Image Tools

Context: 128,000 Output: 4,096 tokens

Input: $10.00 Output: $30.00

›

Release date unavailable

Enhanced language understanding and generation for detailed, context-relevant responses.

Context: N/A Output: 2,500 tokens

›

GPT-4.5 Deprecated

Release date unavailable

Increased capacity and nuance compared to predecessors, offering more accurate text generation.

Context: N/A Output: 8,000 tokens

›

Transcription

GPT-4o mini Transcribe

Release date unavailable

GPT-4o mini Transcribe is a speech-to-text model developed by OpenAI that uses the GPT-4o mini architecture to convert spoken audio into written text. It is designed to deliver improved word error rates and more accurate language recognition compared to the original Whisper-based transcription models. The model is part of OpenAI's transcription API offerings and became available in 2025. This model is well-suited for applications that require accurate transcripts from audio input, such as meeting notes, voice interfaces, and content captioning. Its use of the GPT-4o mini backbone allows it to handle a range of languages with improved recognition accuracy. Developers looking for a cost-efficient transcription option within the OpenAI ecosystem can use this model via the API.

Context: 16,000 Output: 2,000

Input: $1.25 Output: $5.00

›

GPT-4o Mini Vision

Release date unavailable

GPT-4o Mini Vision is a multimodal language model developed by OpenAI, released in mid-2024. It is a smaller, more cost-efficient variant of the GPT-4o family, designed to process both text and images within a single context window of 128,000 tokens. The model supports the same range of languages as GPT-4o and is optimized for low latency, making it suitable for high-throughput or real-time applications. The model is well-suited for tasks that require fast responses at scale, such as customer-facing chat interfaces, document analysis with visual content, and pipelines where cost per token is a primary constraint. Its multimodal reasoning capability allows it to interpret images alongside text in the same request. Developers working with large volumes of context or needing to process mixed text-and-image inputs at reduced cost are the primary intended audience.

Context: 128,000 Output: 16,383 tokens

›

Transcription

GPT-4o Transcribe

Release date unavailable

GPT-4o Transcribe is a speech-to-text model developed by OpenAI that uses the GPT-4o model architecture to convert spoken audio into written text. It is part of OpenAI's audio model lineup and was introduced as an improvement over the original Whisper-based transcription models, offering a lower word error rate and more accurate language recognition across a broader range of languages. The model is designed for use cases where transcription accuracy is a priority, such as meeting notes, voice interfaces, medical dictation, and multilingual content. Because it builds on GPT-4o rather than the earlier Whisper architecture, it brings stronger language understanding to the transcription task, which can help with difficult audio conditions, accented speech, and domain-specific vocabulary.

Context: 16,000 Output: 2,000

Input: $2.50 Output: $10.00

›

GPT-4o Vision

Release date unavailable

GPT-4o Vision is a variant of OpenAI's GPT-4o model that accepts both text and image inputs, allowing it to analyze visual content and respond to questions about it. Developed by OpenAI and added to MindStudio in June 2024, it supports a 128,000-token context window and has a training data cutoff of October 2023. The model addresses a historical limitation of language models, which traditionally processed only text, by enabling multimodal input handling within a single system. GPT-4o Vision is well suited for tasks that require interpreting images alongside text, such as describing visual content, answering questions about photographs or diagrams, extracting information from images, and supporting workflows where visual and textual data appear together. Because it shares the GPT-4o architecture, it handles natural language tasks in addition to vision tasks without requiring a separate model. Developers building applications that involve document analysis, image-based Q&A, or mixed-media content can use this model through the OpenAI API.

Context: 128,000 Output: 4,096 tokens

Input: $2.50 Output: N/A

›

Text to Speech

Release date unavailable

TTS HD (model ID: tts-1-hd) is a text-to-speech model developed by OpenAI that converts written text into natural-sounding spoken audio. It accepts a text input of up to 4096 tokens and produces audio output in a variety of supported voices. TTS-1-HD is the quality-optimized variant in OpenAI's TTS model family, designed to produce higher-fidelity audio compared to the standard TTS-1 offering. The model is well-suited for applications that require clear, natural-sounding voice output, such as voice assistants, audiobook narration, accessibility tools, and content creation workflows. It supports multiple built-in voices and can output audio in formats including MP3, Opus, AAC, and FLAC. Developers access the model through OpenAI's API, and it is available on MindStudio without requiring separate API key management.

Context: N/A Output: N/A

Input: $30.00 Output: N/A

›

Transcription

Whisper

Release date unavailable

Jun 17, 2025

Gemini 2.5 Flash Vision is a multimodal vision model developed by Google, designed to process and reason over visual inputs alongside text. It is part of the Gemini 2.5 Flash family, which is built around balancing cost efficiency with broad capability coverage. The model supports a context window of 1,048,576 tokens, making it suitable for tasks that require processing large amounts of information in a single request. It was trained with a knowledge cutoff of June 2025. This model is positioned for use cases where real-time or low-latency responses are important, such as visual question answering, document analysis with images, and applications that combine vision with extended context. The "thinking" architecture underlying the Gemini 2.5 Flash series enables the model to apply multi-step reasoning before producing a response. Developers looking for a vision-capable model that can handle long documents, images, and mixed-modality inputs without incurring the cost of larger models will find this a practical option.

Context: 1,048,576 Output: 65,535 tokens

Input: $0.30 Output: N/A

›

Gemini 2.5 Pro

Jun 17, 2025

Gemini 2.5 Pro is a thinking model developed by Google DeepMind, designed to reason through complex problems rather than simply predict outputs. It is built to analyze information, draw logical conclusions, and incorporate contextual nuance across tasks in code, mathematics, and STEM. The model supports native multimodality, meaning it can process text, images, audio, video, and code repositories within a single context. The model features a 1,048,576-token context window, making it suited for tasks that require processing large documents, entire codebases, or extended conversations. It scored 63.8% on the SWE-Bench Verified coding evaluation and is available through the Gemini API and Google AI Studio. It is best suited for developers and researchers working on complex reasoning tasks, long-document analysis, and advanced code generation.

Context: 1,048,576 Output: 65,536 tokens

Input: $1.25 Output: $10.00

›

Feb 25, 2025

Gemini 2.0 Flash-Lite Vision is a multimodal model developed by Google, designed to process both visual and textual inputs. It belongs to the Gemini 2.0 Flash family and is positioned as the fastest and most cost-efficient option within that lineup. The model supports a context window of over one million tokens, making it suitable for tasks that require processing large amounts of information in a single request. It was trained on data up to June 2024. This model is intended as an upgrade path for users of Gemini 1.5 Flash who want improved output quality without changes to cost or latency. Its vision capabilities allow it to handle image understanding tasks alongside text-based workflows. The combination of speed, large context support, and multimodal input handling makes it well-suited for applications such as document analysis, image captioning, and high-throughput pipelines where cost efficiency is a priority.

Context: 1,048,576 Output: 8,192 tokens

Input: $0.08 Output: N/A

›

Gemini 2.0 Flash

Feb 05, 2025

Gemini 2.0 Flash is a text generation model developed by Google, released as part of the Gemini 2.0 model family. It features a context window of 1,048,576 tokens and is designed to handle a broad range of everyday tasks with real-time response latency. The model's training data has a cutoff of June 2024. Gemini 2.0 Flash is positioned as an upgrade for users of the 1.5 Flash model who want meaningfully improved output quality, and for users of the 1.5 Pro model who want comparable or slightly improved quality at lower latency and cost. It is well-suited for applications that require processing long documents, maintaining extended conversations, or running high-throughput workloads where response speed matters.

Context: 1,048,576 Output: 8,192 tokens

Input: $0.15 Output: $0.40

›

Gemini 2.0 Flash Vision

Feb 05, 2025

Gemini 2.0 Flash Vision is a multimodal language model developed by Google, designed to process and reason over text, images, and other input types within a single context window of up to 1,048,576 tokens. It is part of the Gemini 2.0 Flash family, which emphasizes speed and efficiency alongside broad capability coverage including built-in tool use and multimodal generation. The model's training data has a cutoff of June 2024. Gemini 2.0 Flash Vision is well-suited for tasks that require understanding visual content alongside large volumes of text, such as document analysis, image-based question answering, and long-context reasoning. Its large context window makes it practical for workflows involving lengthy documents or multi-turn conversations that incorporate both images and text. The model is accessible through Google's Vertex AI platform and is intended for developers building applications that need fast, multimodal processing at scale.

Context: 1,048,576 Output: 8,192 tokens

›

Imagen 4 Fast

Release date unavailable

Imagen 4 Fast is a text-to-image generation model developed by Google, available through the fal.ai platform as a fast inference variant. It accepts natural language text prompts and produces images across a range of visual styles, including photorealistic scenes, illustrations, concept art, and painterly aesthetics. The model is designed to interpret complex, multi-element prompts and render fine details such as textures, fabrics, and lighting with precision. It was trained with data through early 2025 and is available for commercial use via fal.ai. Imagen 4 Fast is optimized for workflows that require rapid image generation without reducing visual fidelity. It supports a context window of up to 10,000 tokens, allowing for detailed and descriptive prompts. The model is well-suited for creative professionals, developers, and businesses building applications around product imagery, storytelling visuals, or concept development. Input types include text prompts, selection parameters, and a seed value for reproducible outputs.

›

Gemini 1.0 Pro Vision Deprecated

Release date unavailable

Handles both text and image inputs for content generation and problem-solving.

Context: N/A Output: 1,024 tokens

›

Gemini 1.5 Flash Deprecated

Release date unavailable

Speedy, cost-effective multimodal model for high-volume applications without compromising quality.

›

Gemini 1.5 Flash Vision Deprecated

Release date unavailable

Fast, cost-effective multimodal model for quality applications at high volume.

›

Gemini 1.5 Pro Deprecated

Release date unavailable

Proficient at multimodal tasks and content creation from image, audio, and video inputs.

›

Release date unavailable

Veo 2 is Google's production-ready video generation model, released in April 2025 via the Gemini API under the model ID veo-2.0-generate-001. It accepts both text prompts and reference images as input, generating high-definition video output at resolutions up to 4K. The model includes physics-aware rendering that handles fluid dynamics, lighting, and object interactions, and it embeds SynthID watermarking in all generated videos to identify AI-created content. Veo 2 is available through both the Gemini API and Google's Vertex AI platform, making it accessible to developers via standard API calls without specialized infrastructure. It supports cinematic prompt controls such as aerial shots, panning, and time-lapses, and maintains consistent character appearance across scenes. The model is suited for developers, marketers, creative professionals, and educators who need to generate video content programmatically for use cases like product demos, ad campaigns, and educational visualizations.

Context: 5,000 Output: N/A

Mistral

17 models

›

Mistral Large 3

Dec 02, 2025

Open source

Mistral Large 3 is a 675-billion-parameter mixture-of-experts (MoE) text generation model developed by Mistral. It is the first MoE model Mistral has released since the Mixtral series, and was trained from scratch on 3,000 NVIDIA H200 GPUs. The model is released under a permissive open-weight license, making the weights publicly available for download and self-hosting. Mistral Large 3 supports a 256,000-token context window and includes image understanding alongside text generation. It is particularly noted for multilingual conversation handling, with Mistral highlighting non-English and non-Chinese language performance as a focus area. The model is well-suited for tasks requiring long-context reasoning, multilingual text processing, and instruction following across general-purpose prompts.

Input: $0.50 Output: $1.50

›

Mistral Medium 3

May 07, 2025

Mistral Medium 3 is a text generation model released on May 7, 2025 by Mistral, a French AI company. It is designed to balance performance with cost efficiency, priced at $0.40 per million input tokens and $2.00 per million output tokens. The model supports a 128,000-token context window and was trained on data through early 2025. It is available through Mistral La Plateforme and Amazon SageMaker, with additional platform support planned. Mistral Medium 3 is built with enterprise deployment in mind, supporting self-hosted setups with a minimum of four GPUs as well as any cloud environment. It can be customized through continuous pre-training, fine-tuning, and integration with enterprise knowledge bases, making it applicable to domain-specific workflows in sectors such as financial services, energy, and healthcare. The model is noted for its strengths in coding tasks and multimodal understanding, and is suited for use cases including customer service automation, business process personalization, and complex dataset analysis.

Input: $0.40 Output: $2.00

›

Mistral Nemo

Jul 19, 2024

Mistral NeMo is a text generation model developed by Mistral, a French AI company. It features a 128,000-token context window and is trained with function calling support, making it suitable for agentic and tool-use workflows. The model has particular strength across eleven languages: English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. Mistral NeMo is a 12-billion parameter model built in collaboration with NVIDIA, which is reflected in the "NeMo" name referencing NVIDIA's NeMo framework. It is designed for developers and organizations building multilingual applications where broad language coverage and a large context window are priorities. The model's combination of function calling capability, multilingual training, and long-context handling makes it a practical choice for global deployment scenarios.

Context: 128,000 Output: 64,000 tokens

Input: $0.15 Output: $0.04

›

Mixtral 8x22B Instruct Deprecated

Apr 17, 2024

Mistral's official instruct fine-tuned version of [Mixtral 8x22B](/models/mistralai/mixtral-8x22b). It uses 39B active parameters out of 141B, offering unparalleled cost efficiency for its size. Its strengths include: - strong math, coding,...

Text File Tools

Context: 65.5K Output: 64,000 tokens

Input: $2.00 Output: $6.00

›

Mistral 7B Instruct

Oct 10, 2023

Mistral 7B Instruct is a 7-billion-parameter language model developed by Mistral AI and released in September 2023. It is the instruction-tuned variant of the base Mistral 7B model, fine-tuned to follow user instructions and produce clear, direct responses. The model uses grouped-query attention (GQA) and sliding window attention (SWA) techniques, which allow it to handle sequences efficiently within its 4,096-token context window. This model is well-suited for instruction-following tasks such as conversational AI, content summarization, and task-oriented dialogue. Because it is optimized to adhere closely to user-provided instructions, it performs consistently in structured workflows where predictable output format matters. It is available through Amazon Bedrock and is also openly accessible on Hugging Face, making it usable in a range of deployment environments.

Context: 4,096 Output: 2,500 tokens

›

Mistral 7B Instruct Deprecated

Oct 10, 2023

Focused on instruction-based tasks, providing clear, concise responses adhering to user instructions.

Context: N/A Output: 2,500 tokens

›

Ministral 3 14B

Release date unavailable

Ministral 3 14B is the largest model in the Ministral 3 family, developed by Mistral AI. It is an open-source text generation model with a 256,000-token context window, designed to handle long-form inputs and extended conversations. The model is released under an open license, making it available for local deployment and self-hosted use cases. The model is optimized for running on diverse hardware configurations, including consumer-grade local setups, which makes it suitable for developers and researchers who prefer on-device inference. Its 14 billion parameter count positions it as the largest variant in the Ministral 3 series. Common use cases include text generation, summarization, instruction following, and tasks that benefit from a large context window without requiring cloud-based infrastructure.

Input: $0.20 Output: N/A

›

Ministral 3 3B

Release date unavailable

Ministral 3 3B is a 3-billion-parameter language model developed by Mistral AI as part of the Ministral 3 family. It is the smallest model in that family and is released as open-weight, meaning the model weights are publicly available for download and local use. The model supports a 256,000-token context window and includes both language and vision capabilities in a compact form factor. Ministral 3 3B is designed specifically for edge deployment, making it suitable for running on local hardware, embedded systems, and resource-constrained environments. Its small parameter count allows it to operate efficiently across a wide range of hardware configurations without requiring cloud infrastructure. It is well-suited for developers building on-device applications, offline workflows, or latency-sensitive pipelines where a smaller footprint is a requirement.

Input: $0.10 Output: N/A

›

Ministral 3 8B

Release date unavailable

Ministral 3 8B is a text generation model developed by Mistral AI, part of the Ministral 3 model family. It is open source and designed with edge deployment in mind, meaning it is optimized to run efficiently across a range of hardware configurations, including local setups without cloud infrastructure. The model supports a 256,000-token context window, enabling it to process and reason over long documents in a single pass. Ministral 3 8B is well-suited for developers and organizations that need a capable language model deployable on-device or in resource-constrained environments. Its 8-billion parameter size makes it practical for local inference while still handling a broad range of text generation tasks. The open-source availability means it can be downloaded, fine-tuned, and self-hosted without requiring API access.

›

Mistral 8x7b Deprecated

Release date unavailable

Mixtral 8x7B is a high-performance mixture-of-experts language model from Mistral AI, offering a 32K token context window with efficient, fast inference.

›

Mistral Codestral

Release date unavailable

Mistral Codestral is an open-weight generative AI model built by Mistral and designed specifically for code generation tasks. It operates through a shared instruction and completion API endpoint, allowing developers to both write new code and interact with existing codebases. The model is trained on a dataset spanning more than 80 programming languages, including Python, Java, C, C++, JavaScript, Bash, Swift, and Fortran. Codestral is intended for developers building AI-assisted coding tools and applications, as it handles both code and English fluently. Its broad language coverage makes it applicable across a wide range of development environments and project types. Because it is open-weight, it can be deployed and integrated in ways that closed models typically do not permit.

Context: 32,000 Output: 16,000 tokens

Input: $0.20 Output: N/A

›

Mistral Large 24.02

Release date unavailable

Mistral Large 24.02 is a text generation model developed by Mistral, built around 123 billion parameters and designed to run on a single node for large-throughput inference. It features a 128,000-token context window, making it suited for long-document processing and extended conversational tasks. The model supports dozens of natural languages, including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean. Beyond natural language, Mistral Large 24.02 supports over 80 programming languages, including Python, Java, C, C++, JavaScript, and Bash, making it applicable to code generation and analysis tasks. Its single-node inference design means it can deliver high throughput without requiring distributed infrastructure. This combination of broad language coverage, large context capacity, and coding support makes it well-suited for multilingual applications, long-context document workflows, and software development assistance.

Input: $4.00 Output: N/A

›

Mistral Large 24.07

Release date unavailable

Mistral Large 24.07 is a text generation model developed by Mistral, released in July 2024 as the second iteration of their Large series. It features 123 billion parameters and a 128,000-token context window, making it suitable for long-document processing and extended conversational tasks within a single inference node. The model supports dozens of natural languages, including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean. One of the model's defining characteristics is its design for single-node inference, meaning the full 123B parameter model can run at high throughput without requiring multi-node infrastructure. It also supports over 80 coding languages, including Python, Java, C, C++, JavaScript, and Bash, making it applicable to software development workflows. On MindStudio, it is available through Amazon Bedrock under the identifier mistral-large-24.07-bedrock.

Input: $2.00 Output: N/A

›

Mistral Small 24.02

Release date unavailable

Mistral Small 24.02 is a text generation model developed by Mistral, designed to run on a single node while supporting a 128,000-token context window. It covers dozens of natural languages including French, German, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean, as well as over 80 coding languages such as Python, Java, C, C++, JavaScript, and Bash. The model has 123 billion parameters, which enables high-throughput inference without requiring multi-node infrastructure. This model is well-suited for long-context applications where fitting large documents or extended conversations into a single prompt is necessary. Its broad language coverage makes it applicable to multilingual workflows, while its coding language support makes it useful for code generation and analysis tasks. The single-node inference design is a practical consideration for teams managing deployment costs and infrastructure complexity.

Input: $1.00 Output: N/A

›

Mistral Small 3.1 (25.03)

Release date unavailable

Mistral Small 3.1 (25.03) is a text generation model developed by Mistral, released in March 2025. It features a 128,000-token context window, multimodal understanding, and support for dozens of spoken languages alongside more than 80 coding languages. The model is designed to run on a single node, making it practical for deployment without distributed infrastructure. This version introduces improved text performance and expanded context handling compared to earlier Mistral Small releases. At an inference speed of approximately 150 tokens per second, it is suited for tasks that require both throughput and long-context processing, such as document analysis, multilingual applications, and code generation. Its combination of broad language coverage and single-node efficiency makes it a practical choice for developers building production applications with constrained compute budgets.

Input: $0.10 Output: N/A

›

Mixtral 8x7B Instruct

Release date unavailable

Mixtral 8x7B Instruct is a sparse mixture-of-experts (SMoE) language model developed by Mistral AI and released under the Apache 2.0 license. It uses a routing mechanism that activates only a subset of its expert networks per token, allowing it to draw on a large total parameter count while keeping active computation lower than a dense model of equivalent size. The instruct variant has been fine-tuned to follow instructions and engage in conversational tasks. The model has a context window of 4,096 tokens and was trained on data through September 2023. Its open-weight, permissive license makes it suitable for commercial and research use cases where model access and reproducibility matter. It is well-suited for tasks such as text generation, summarization, question answering, and general instruction following.

Context: 4,096 Output: 2,500 tokens

Input: $0.45 Output: N/A

›

Mixtral 8x7B Instruct Deprecated

Release date unavailable

High-quality, efficient sparse model outperforming larger models in speed and benchmarks.

Release date unavailable

Grok 2 Vision (grok-2-vision-1212) is a multimodal language model developed by xAI and released in December 2024. It accepts combined image and text inputs and is designed to understand, analyze, and respond to visual content alongside natural language. The model supports images up to 20MiB in JPG, JPEG, or PNG format and can process inputs in any order. It also includes multilingual support and improved instruction-following compared to earlier Grok vision releases. Grok 2 Vision is suited for production use cases that require visual comprehension, such as image captioning, visual question answering, chart and document analysis, and building AI assistants that respond to visual inputs. It supports tool calling and structured outputs, making it straightforward to integrate into developer workflows. With a 32,768-token context window, it can handle moderately long conversations that mix text and image content.

Context: 32,768 Output: 1M

Input: $2.00 Output: $2.50

›

Grok 3

Release date unavailable

Grok 3 is the flagship large language model from xAI, developed and released in February 2025. It was built from the ground up in approximately one year and is designed to handle demanding tasks including advanced reasoning, coding, and creative writing. The model is available via API under the identifier grok-3-latest and supports a context window of 131,072 tokens. It includes a dedicated Thinking mode that enables multi-step reasoning on complex problems. Grok 3 is well-suited for tasks that require structured, multi-step problem solving, such as scientific research, advanced mathematics, and complex software development. It scored 96% on AIME, a challenging mathematics competition benchmark, and 85% on GPQA, a graduate-level science reasoning benchmark. The model also supports image understanding, function calling, and structured output generation, making it usable across a range of developer and research workflows. It ranked first in creative writing evaluations at the time of its release.

Context: 131,072 Output: 8,192 tokens

Input: $3.00 Output: $2.50

›

Grok 3 Fast

Release date unavailable

Grok 3 Fast is a performance-optimized variant of xAI's Grok 3 model, released in April 2025 as part of the Grok 3 family. It is designed to deliver faster response times compared to the standard Grok 3 Beta while retaining the same core language understanding, function calling, and web search capabilities. The model supports a 131,072-token context window, making it capable of handling long documents and extended multi-turn conversations. Grok 3 Fast is best suited for applications where response latency matters, such as real-time chat interfaces, high-throughput processing pipelines, and interactive AI assistants. Its support for function calling allows developers to integrate external tools and APIs, enabling agentic workflows that can act on live information. The model exposes an OpenAI-compatible API, which simplifies adoption for developers already working within that ecosystem.

Context: 131,072 Output: 8,192 tokens

Input: $5.00 Output: N/A

›

Grok 3 Mini

Release date unavailable

Grok 3 Mini Beta is a compact text generation model developed by xAI, the AI division of X. It is designed as a thinking model, meaning it reasons through problems step by step before producing a final answer, and it exposes that reasoning trace so users can follow the model's logic in full. The model supports adjustable reasoning effort, defaulting to a lower setting for speed but allowing a high-effort mode for more demanding problems. It has a 131,072-token context window and was trained with data up to April 2025. Grok 3 Mini is best suited for tasks that rely heavily on structured reasoning rather than broad world knowledge — including math problems, logic puzzles, coding challenges, and quantitative analysis. According to xAI's published benchmarks, it scores 95.8% on AIME 2024 and 80.4% on LiveCodeBench. It also supports function calling and web search, making it usable in agentic workflows. Epoch AI has noted that with high reasoning effort, Grok 3 Mini outperforms the larger Grok 3 model on math benchmarks.

Context: 131,072 Output: 8,192 tokens

Input: $0.30 Output: N/A

›

Grok 4.3 Vision

Release date unavailable

Structured model profile with pricing, context, and capability details.

Context: N/A Output: 2,000,000 tokens

Input: $1.25 Output: N/A

›

Grok Imagine

Release date unavailable

Grok Imagine (grok-imagine-image) is a text-to-image generation model developed by xAI, the AI company founded by Elon Musk. It is part of the Grok Imagine family and allows users and developers to generate high-resolution images from plain-language text descriptions. The model was unveiled in early August 2025 and expanded from a subscriber-only feature to a broadly available tool accessible via xAI's developer API. It sits alongside the more premium grok-imagine-image-pro variant, serving as the standard, faster option in the family. Grok Imagine supports up to 300 requests per minute, making it suited for applications that require image generation at volume. It accepts a 131,072-token context window and is accessible through xAI's API for integration into apps, tools, and workflows. The model is best suited for developers and creators who need reliable, high-throughput image generation for use cases such as prototyping, content creation, or building image-powered products. No artistic expertise is required to use it — a text description is sufficient to produce a detailed image.

Context: 131,072 Output: 500k

Input: $1.25 Output: $2.50

›

Grok Imagine Pro

Release date unavailable

Grok Imagine Pro is xAI's advanced text-to-image generation model, sitting at the top of xAI's image generation lineup above the standard grok-imagine-image. Published under the X brand, it accepts text prompts along with image URL inputs and selection parameters to produce detailed visual outputs. The "pro" designation reflects its position as the higher-quality tier within xAI's image generation offerings. Grok Imagine Pro is well-suited for developers and creators who require high-fidelity AI-generated imagery within production pipelines or creative workflows. It supports a context window of 131,072 tokens, allowing for detailed and nuanced text prompts. Use cases include content generation, creative projects, and any application where prompt adherence and image detail are priorities.

Release date unavailable

Structured model profile with pricing, context, and capability details.

Release date unavailable

FLUX.2 [turbo] is an image generation model developed by Black Forest Labs, designed to convert text descriptions into images across a wide range of styles including photorealistic scenes, illustrations, concept art, and character design. The model supports resolutions up to 2K and 4MP output, with a context window of 10,000 tokens for prompt input. It accepts select and seed inputs, giving users control over style options and reproducibility of results. The model is positioned for workflows where generation speed is a priority, producing images in approximately 10 seconds at 4MP resolution. It supports multiple aspect ratios including 1:1, 16:9, 9:16, 4:3, and 3:4, making it adaptable for different creative and commercial formats. FLUX.2 [turbo] is well-suited for graphic designers, visual marketers, and developers integrating image generation into applications via API.

ByteDance

10 models

›

Seedream 4.5

Release date unavailable

Seedream 4.5 is an image generation and editing model developed by ByteDance, built on a unified architecture that handles both creating images from text prompts and modifying existing images within a single system. It accepts up to 10 reference images in one request, enabling multi-source compositing workflows such as product swaps and element transfers between images while preserving depth, perspective, and lighting consistency. The model supports output resolutions up to 2048×2048 pixels across multiple aspect ratios including 1:1, 16:9, and 9:16. One of Seedream 4.5's most documented characteristics is its text rendering accuracy — it can produce correctly spelled, legible text in various font styles and non-Latin scripts integrated naturally into scenes like signs, packaging, and posters. The model ranks 10th globally on the LM Arena leaderboard with a score of 1147 and was trained through December 2025. It is well suited for designers, marketers, and e-commerce teams who need production-ready visuals driven by natural language prompts, without requiring manual selection tools or layer-based editing workflows.

›

Lip Sync

LatentSync

Release date unavailable

LatentSync is an end-to-end lip-synchronization model developed by ByteDance that takes a source talking-head video and a target audio track as inputs and produces a new video with mouth movements precisely aligned to the provided speech. It is built on audio-conditioned latent diffusion, meaning it operates directly in the latent space rather than relying on 3D meshes or 2D facial landmarks. The model supports multiple languages and accents, works with diverse speakers and recording conditions, and handles both real people and stylized characters such as anime. A key technical feature of LatentSync is Temporal REPresentation Alignment (TREPA), a technique designed to reduce flicker, jitter, and frame-to-frame artifacts, keeping head pose and lip motion stable across longer sequences. The model is recommended for use with high-resolution source videos — 720p, 1080p, or 4K — paired with clean audio recordings. It is well suited for workflows involving lip-syncing, audio dubbing, digital human creation, and video-audio alignment.

Context: 50,000 Output: N/A

›

DreamActor V2

Release date unavailable

DreamActor V2 is a video generation model developed by ByteDance that animates static images by transferring motion from a reference driving video onto a target character. It is the second generation of ByteDance's DreamActor series and was made available in February 2026. Rather than relying on skeleton extraction or pose estimation pipelines, it uses a spatiotemporal in-context learning framework that reads motion directly from raw video pixels, which allows it to handle character types that traditional pose-based methods struggle with, including animals, cartoon mascots, fantasy creatures, and 3D renders. DreamActor V2 accepts two inputs — a character image and a driving video — and produces animated video outputs up to 15 seconds at 720p resolution across a range of aspect ratios. It transfers facial expressions, head orientation, eye direction, lip movement, hand gestures, and full-body motion while maintaining the structural consistency of the source character across frames. This makes it applicable to use cases such as social media content creation, brand animation, virtual avatar production, game asset prototyping, and educational video generation.

Context: 1000 Output: N/A

›

Lip Sync

Release date unavailable

Context: 50K Output: 10,000 tokens

›

Seedance 2.0 Fast Turbo

Release date unavailable

Context: 50K Output: 10,000 tokens

›

Seedream 4.0

Release date unavailable

Seedream 4.0 is an image generation model developed by ByteDance, designed to produce images from text prompts and source image inputs. It supports a context window of 10,000 tokens and accepts image URL arrays alongside numerical parameters, enabling flexible control over generation behavior. The model is part of ByteDance's Seedream series and is available through MindStudio's model catalog. Seedream 4.0 is best suited for workflows that require image generation guided by reference images, making it useful for tasks like style transfer, image variation, and visually consistent content creation. Its support for source image inputs distinguishes it from purely text-to-image pipelines, allowing users to anchor outputs to existing visual references. Developers can integrate it into MindStudio applications without managing separate API keys.

›

Seedream 5.0 Lite

Release date unavailable

Jan 27, 2026

Kimi K2.5 is an open-source multimodal model developed by Moonshot AI and released in January 2026. It uses a Mixture-of-Experts architecture with 1 trillion total parameters and approximately 32 billion active at inference time, trained on roughly 15 trillion mixed visual and text tokens. Unlike models that add vision as a secondary capability, Kimi K2.5 was trained natively on both image and text data, enabling integrated understanding of charts, documents, video, and code. The model supports two operating modes — Instant Mode for direct responses and Thinking Mode for step-by-step reasoning on complex problems — within a 256,000-token context window. It introduces an Agent Swarm paradigm that can coordinate up to 100 parallel sub-agents, reducing execution time by 4.5x on parallelizable tasks. Kimi K2.5 is released under a modified MIT license, making it available for local deployment, fine-tuning, and commercial use, and is particularly suited for visual programming, document analysis, automated research, and multi-step agentic workflows.

Text Image Tools

Context: 262,144 Output: 16,384 tokens

Input: $0.45 Output: $1.90

›

DeepSeek V3.2

Dec 01, 2025

DeepSeek-V3.2 is an open-weight large language model developed by DeepSeek and released on December 1, 2025. It uses a Mixture-of-Experts architecture combined with a novel sparse attention mechanism called DeepSeek Sparse Attention (DSA), which reduces computational complexity to near-linear scale (O(kL)) for long-context tasks. The model supports a 160,000-token context window and is available under the MIT License on Hugging Face. DeepSeek-V3.2 introduces three notable technical advances: a scalable reinforcement learning training framework, a large-scale agentic task synthesis pipeline covering over 1,800 environments and 85,000+ complex instructions, and native support for Thinking in Tool-Use — the ability to reason while invoking external tools in both thinking and non-thinking modes. It is best suited for complex multi-step reasoning, agentic workflows involving search and code execution, long-context document processing, and developers building AI applications that require integrated reasoning and tool use.

Context: 160,000 Output: 8,000 tokens

Input: $0.26 Output: $0.38

›

DeepSeek V3.1

Aug 21, 2025

DeepSeek-V3.1 is a 671-billion parameter large language model developed by DeepSeek, using a Mixture-of-Experts (MoE) architecture that activates 37 billion parameters at any given time. It supports a 128,000-token context window and was trained through August 2025, with an enhanced base model built using a two-phase long-context extension process that included 630 billion tokens at the 32K phase and 209 billion tokens at the 128K phase. The model accepts text input and produces text output across a wide range of general-purpose tasks. What distinguishes DeepSeek-V3.1 from earlier versions is its hybrid thinking design: a single model that can operate in a fast conversational mode or a slower step-by-step reasoning mode, selectable through prompting rather than requiring a separate model. Post-training improvements have also focused on tool use and agentic workflows, including multi-step API calls, web search, and code execution. This makes it well-suited for coding, mathematical reasoning, long-document analysis, and complex multi-turn agent tasks.

Context: 128,000 Output: 8,000 tokens

Input: $0.27 Output: $0.79

›

DeepSeek-R1

Jan 22, 2025

DeepSeek-R1 is a text generation model developed by DeepSeek, a Chinese AI company. It is a reasoning-focused model that generates a Chain of Thought (CoT) before producing a final answer, a technique designed to improve accuracy on multi-step problems. The model was trained through late 2024 and supports a context window of 64,000 tokens. DeepSeek released the model weights publicly, making it available for local deployment and research use. DeepSeek-R1 is well suited for tasks that benefit from structured reasoning, such as mathematics, logic puzzles, coding challenges, and scientific problem-solving. Because the model externalizes its reasoning steps before answering, users can inspect the thought process that led to a given response. DeepSeek also released a series of distilled versions of R1 based on smaller base models, broadening its accessibility across different hardware configurations.

Context: 64,000 Output: 8,000 tokens

Input: $0.55 Output: N/A

›

DeepSeek-V3

Dec 26, 2024

DeepSeek-V3 is a large language model developed by DeepSeek, a Chinese AI company. It is a general-purpose text generation model designed to handle a wide range of tasks including coding, reasoning, summarization, and open-ended conversation. The model supports a 128,000-token context window and was trained on data through late 2024. It is identified by the model ID deepseek-chat and is available via API. DeepSeek-V3 uses a Mixture-of-Experts (MoE) architecture with 671 billion total parameters, activating 37 billion per forward pass, which allows it to maintain efficiency at scale. The model was trained using an optimized pipeline that includes multi-token prediction and FP8 mixed-precision training. It is well-suited for tasks that require long-context understanding, instruction following, and multi-step reasoning across technical and general domains.

Context: 128,000 Output: 8,000 tokens

Input: $0.27 Output: $0.89

›

DeepSeek R1 Turbo

Release date unavailable

DeepSeek R1 Turbo is a text generation model developed by DeepSeek, designed as an accelerated variant of the R1 reasoning model family. It retains the chain-of-thought reasoning capabilities of the base R1 model while incorporating architectural and inference optimizations aimed at reducing latency. The model supports a 128,000-token context window and was trained on data through late 2024. It accepts text input and produces text output across a wide range of analytical and generative tasks. DeepSeek R1 Turbo is particularly well-suited for applications where multi-step reasoning is required but response time is a practical constraint. Common use cases include coding assistance, mathematical problem-solving, logical deduction, and structured analytical workflows. Developers building interactive tools or real-time applications that depend on reasoning-intensive outputs are the primary intended audience for this model.

Context: 128,000 Output: 8,000 tokens

Input: $1.00 Output: N/A

›

DeepSeek-V3 Deprecated

Release date unavailable

General-purpose LLM from Chinese AI company DeepSeek.

Release date unavailable

Kling Video O3, also known as Kling 3.0 Omni, is a video generation model developed by Kuaishou and launched in February 2026. It is the premium tier of the Kling 3.0 model family, designed specifically for structured, multi-shot storytelling rather than single isolated clips. The model accepts text, images, and video as inputs, and uses Multimodal Visual Language (MVL) technology to reason about scene composition, spatial relationships, and motion in a unified pass. It supports clip lengths of up to 15 seconds across up to six distinct shots generated in a single request. Kling Video O3 is built for workflows where visual consistency is critical — such as brand marketing, recurring character content, and cinematic pre-production. It preserves a subject's exact appearance, including facial features, clothing, logos, and on-screen text, across shots and scene transitions when a reference image or video is provided. The model also generates synchronized audio natively alongside video, covering ambient sound, dialogue, and multilingual lip-sync without requiring separate post-production. It is best suited for production scenarios where a character, product, or campaign identity has already been defined and consistent output at scale is the goal.

Release date unavailable

Perplexity's latest model family surpassing earlier versions in cost-efficiency, speed, and performance.

Release date unavailable

Context: 50K Output: 10,000 tokens

›

Wan 2.7

Release date unavailable

Alibaba's powerful multimodal AI model that generates cinematic 1080p video with native audio synchronization, multi-shot storytelling, and advanced image creation.

Context: 2K Output: N/A

›

Wan 2.7 Pro

Release date unavailable

Alibaba's powerful multimodal AI model that generates cinematic 1080p video with native audio synchronization, multi-shot storytelling, and advanced image creation.

Release date unavailable

Release date unavailable

Z Image Turbo Controlnet is an image generation model developed by Alibaba's Tongyi-MAI lab, built on a single-stream diffusion transformer architecture with 6 billion parameters. It uses a few-step distillation approach (the Turbo variant) to accelerate inference while preserving output quality, and incorporates ControlNet to allow structural guidance from a source image. The model was trained with a multi-level captioning system and a data infrastructure that includes a Cross-modal Vector Engine and World Knowledge Topological Graph to improve semantic alignment between prompts and outputs. This model is well-suited for workflows that require both speed and structural control over generated images, such as guided creative generation, image editing pipelines, and rapid prototyping. It accepts image URLs as source inputs alongside configurable parameters including seed values for reproducibility. An RLHF alignment pipeline using DPO and GRPO stages was applied to bring outputs closer to human aesthetic preferences, and a built-in prompt enhancer with reasoning chain helps produce better results from short or underspecified prompts.

Wavespeed

6 models

›

Chroma

Release date unavailable

Chroma is an 8.9 billion parameter text-to-image model developed by WaveSpeed AI, built on the FLUX.1-schnell architecture. It was trained using over 105,000 hours of NVIDIA H100 GPU time, with a dataset curated from 5 million selected images. The model is designed around a philosophy of unrestricted creative expression, removing the content filters found on many mainstream image generation platforms. It supports image output up to 1536×1536 pixels and is noted for clean renders, natural lighting, strong color harmony, and anatomical accuracy in human figures, hands, and faces. Chroma is well-suited for commercial photography, digital illustration, character design, concept art, and medical or educational illustration where content restrictions would otherwise be a barrier. It handles complex, multi-element scenes involving people, props, and environments with strong prompt adherence. The model responds particularly well to structured prompts organized around subject, context, style, lighting, camera, and mood. It is available through WaveSpeed AI and is optimized for both single-shot and batch generation workflows.

›

Hunyuan3D V2 Multi-View

Release date unavailable

Generate a 3D model from front, back, and side reference images with optional textured output.

›

Hunyuan3D V3

Release date unavailable

Structured model profile with pricing, context, and capability details.

›

Meshy 6

Release date unavailable

Structured model profile with pricing, context, and capability details.

›

SAM 3D Objects

Release date unavailable

Structured model profile with pricing, context, and capability details.

›

Tripo3D v2.5

Release date unavailable

Release date unavailable

Stable Image Ultra is Stability AI's flagship text-to-image generation model, designed to produce high-quality, photorealistic images from natural language text prompts. It sits at the top of Stability AI's image generation lineup and accepts concise text descriptions — up to a 77-token context window — to generate detailed visuals with strong coherence and fidelity. The model supports configurable inputs including text prompts, selection parameters, and a seed value for reproducible outputs. Stable Image Ultra is well-suited for applications such as marketing visuals, concept art, product visualization, and editorial illustration. It is available through Stability AI's own API and via AWS Bedrock, making it accessible for production-scale deployments without requiring infrastructure management. Developers and enterprises can integrate it directly into existing workflows through these managed cloud platforms.

Context: 77 Output: N/A

Z.ai

5 models

›

GLM 5.1

Apr 07, 2026

Open source

GLM-5.1 delivers a major leap in coding capability, with particularly significant gains in handling long-horizon tasks. Unlike previous models built around minute-level interactions, GLM-5.1 can work independently and continuously on...

Context: 202.8K Output: 16,384 tokens

Input: $1.40 Output: $3.08

›

GLM 5

Feb 11, 2026

GLM-5 is a 744-billion-parameter Mixture-of-Experts language model developed by Z.ai (formerly Zhipu AI), released in February 2026 under the MIT license. It activates 40 billion parameters per token and supports a 200,000-token context window, making it suited for tasks that require processing large volumes of text in a single pass. The model was pre-trained on 28.5 trillion tokens and incorporates DeepSeek Sparse Attention to reduce inference costs while maintaining long-context performance. GLM-5 is designed primarily for agentic workflows, autonomous software engineering, tool use, and long-horizon planning tasks. A notable aspect of its development is that it was trained entirely on Huawei Ascend chips using the MindSpore framework, with no dependency on NVIDIA hardware. It also introduces an asynchronous reinforcement learning training system called slime, which improves training throughput and enables more fine-grained post-training alignment. The model is freely available for both research and commercial use under its MIT license.

Context: 202.8K Output: 16,384 tokens

Input: $0.80 Output: $1.92

›

GLM 4.7

Dec 22, 2025

GLM-4.7 is a 358-billion-parameter large language model developed by Z.ai (formerly Zhipu AI/THUDM) and released in December 2025. It is designed specifically for agentic workflows, multi-step coding tasks, terminal automation, and complex mathematical and scientific reasoning. The model is available under an MIT license, making it usable for both commercial and non-commercial applications. It supports a 131,072-token context window, allowing it to handle long documents and extended coding sessions. What distinguishes GLM-4.7 from earlier GLM releases is a set of three reasoning mechanisms: Interleaved Thinking, which applies reasoning before every response and tool call; Preserved Thinking, which retains reasoning context across conversation turns to maintain consistency; and Turn-level Thinking, which lets developers toggle reasoning depth on or off per turn. On benchmarks, the model scores 73.8% on SWE-bench Verified, 95.7% on AIME 2025, and 87.4% on τ²-Bench. It is best suited for developers and researchers building agent pipelines, automated coding tools, or applications requiring reliable multi-step planning.

Context: 131,072 Output: 16,384 tokens

Input: $0.40 Output: $1.75

›

GLM 4.6V

Dec 08, 2025

GLM-4.6V is a large-scale multimodal foundation model developed by Z.ai, available in two variants: the full 106B parameter version designed for cloud and high-performance cluster deployments, and a lightweight 9B Flash version optimized for local and low-latency use. The model supports a 128K token context window, allowing it to process long documents, multi-page files, and complex mixed-media inputs natively without converting content to plain text first. It was trained with a data cutoff of December 2025. What distinguishes GLM-4.6V is its native integration of tool-use capabilities within a visual model — it can accept images, screenshots, and document pages directly as inputs to function calls, connecting visual perception to executable actions in agent workflows. The model also supports interleaved image-text generation, frontend replication from UI screenshots, and joint understanding of text, layout, charts, tables, and figures. It is best suited for enterprise and agent-based applications such as document analysis pipelines, multimodal AI assistants, UI automation, and content generation workflows.

Text Image Video

Context: 131,072 Output: 16,384 tokens

Input: $0.30 Output: $0.90

›

GLM 4.6

Sep 30, 2025

GLM-4.6 is a large language model developed by Zhipu AI (Z.ai), built on a Mixture-of-Experts architecture with approximately 357 billion parameters. It supports both English and Chinese, carries a 200,000-token context window, and is released under the MIT license, making it available for commercial and personal use without restrictions. The model was released in late 2025 and represents Zhipu AI's flagship offering in the GLM series. GLM-4.6 is designed for tasks that require extended context handling, multi-step reasoning, and agentic workflows. A notable characteristic is its ability to invoke tools during the reasoning process itself — not only after completing a chain of thought — which enables more dynamic problem-solving in agent-based applications. It is well suited for developers and researchers working on complex coding tasks, long-document analysis, bilingual applications, and automated multi-step pipelines.

Release date unavailable

Release date unavailable

Ray Flash 2 is a video generation model developed by Luma AI, designed to produce AI-generated videos from text prompts and images. It is the speed-optimized variant within the Ray 2 model family, prioritizing throughput and fast iteration while maintaining visual quality. As part of the broader Ray lineup, it sits alongside Ray 2, Ray 1.6, and Photo Flash 1, giving users options across different speed and quality trade-offs. Ray Flash 2 is well-suited for workflows that require rapid video generation or high-volume production, such as creative prototyping or iterative content development. It is accessible through multiple platforms, including ComfyUI's Partner Nodes system for visual AI workflows and the Codenteam Intersect platform for developer integrations. The model accepts both text and image inputs, supporting flexible generation modes for a range of creative and technical use cases.

Dec 05, 2024

Amazon Nova Pro is a multimodal foundation model developed by Amazon and made available through Amazon Bedrock. It accepts text and vision inputs and is designed to handle a wide range of tasks where accuracy, response speed, and cost-efficiency all need to be balanced together. It is part of the Amazon Nova family, which also includes Nova Lite and Nova Micro, each targeting different points on the capability-cost spectrum. Nova Pro was released in December 2024 and supports a 300,000-token context window. Nova Pro is particularly suited for agentic workflows and UI actuation, meaning it can be used to build systems that take sequences of actions or interact with interfaces. It supports fine-tuning on Amazon Bedrock, allowing developers to customize the model for specific domains or cost targets. Within the Nova family, Pro occupies the highest capability tier among the understanding models, making it the appropriate choice when tasks require processing both text and images at scale.

Release date unavailable

LTX-2.3 LoRA is a Low-Rank Adaptation fine-tuning system built on top of Lightricks' LTX-2.3 video generation model, released in January 2026. Rather than retraining the full model, LoRA adapters allow users to teach the base model new characters, visual styles, or motion behaviors at a fraction of the computational cost. The system supports both text-to-video and image-to-video generation workflows, and LoRAs trained on the earlier LTX-2.0 model are reported to retain compatibility with the 2.3 update. LTX-2.3 LoRA is designed for creators and developers who need stylistically consistent output across AI-generated video sequences, such as animation, storytelling, or visual effects production. It supports multi-character generation with consistent appearance across frames, style transfer, and community-developed camera movement controls including dolly in and out. The model runs locally using open-source tooling and has gained traction in the Stable Diffusion community for its character and style fidelity in generated video content.

Context: 1000 Output: N/A

MiniMax

3 models

›

Text to Speech

Minimax Speech 2.8 HD

Release date unavailable

MiniMax Speech 2.8 HD is a high-definition text-to-speech model developed by MiniMax, built on an autoregressive Transformer architecture with a Flow-VAE decoder. Instead of using traditional mel-spectrogram vocoders, it models speech in a learned latent space, which produces audio with natural cadence, proper intonation, and emotional depth. The model accepts up to 50,000 tokens of input text and was trained through January 2026. The model offers 17 or more expressive voice presets spanning different genders, ages, and speaking styles, along with support for natural interjections such as laughs, sighs, and gasps embedded directly in text. Users can control emotion, speed, volume, pitch, sample rate, bitrate, channel configuration, and output format. These features make it well suited for audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility applications, and game development.

Context: 50K Output: N/A

›

Hailuo 2.3 Pro

Release date unavailable

Hailuo 2.3 Pro is a video generation model developed by MiniMax, capable of producing ultra-clear 1080P video output from text prompts or image inputs. It is designed with physics-aware scene rendering, meaning it attempts to simulate realistic physical interactions and motion within generated video content. The model was trained with a cutoff of October 2025 and accepts up to 2000 context tokens for prompt input. Hailuo 2.3 Pro supports both text-to-video and image-to-video generation workflows, making it applicable to creative production, prototyping, and visual storytelling tasks. Its image URL input type allows users to anchor video generation to a specific starting frame, while toggle group inputs provide control over generation parameters. The model is suited for use cases that require high-resolution output with coherent motion and scene physics.

Context: 2000 Output: N/A

›

Music

Minimax Music 2.5

Release date unavailable

Release date unavailable

Fast and capable 21B model outperforming larger models while delivering outsized value.

Context: N/A Output: 128,000 tokens

Cohere

2 models

›

Command R

Aug 30, 2024

Command R is an instruction-following conversational model developed by Cohere, designed for enterprise language tasks with a focus on reliability and scalability. It is available through Amazon Bedrock and carries a knowledge cutoff of March 2024. The model is purpose-built for retrieval-augmented generation (RAG) and tool use, making it well-suited for workflows that require grounding responses in external data sources or integrating with external APIs and functions. One of Command R's defining characteristics is its 128,000-token context window, which allows it to process long documents, extended multi-turn conversations, and complex inputs in a single pass. It also supports multilingual tasks and is tagged for low-latency performance, making it a practical choice for organizations building scalable AI applications where response speed and contextual accuracy matter. It is best suited for enterprise use cases such as document analysis, agentic pipelines, and knowledge-grounded question answering.

Context: 128,000 Output: 4,000 tokens

Input: $0.50 Output: $0.60

›

Command R+

Aug 30, 2024

Command R+ is a large language model developed by Cohere, positioned as the company's flagship text generation model for enterprise use. It is available through Amazon Bedrock, allowing organizations to deploy it within AWS's managed cloud infrastructure. The model supports a 128,000-token context window and was trained on data up to January 2023. It is designed specifically for demanding enterprise workloads that require high accuracy and reliability. What distinguishes Command R+ is its purpose-built support for retrieval-augmented generation, enabling it to ground responses in external knowledge sources rather than relying solely on parametric memory. It also supports multi-step tool use and agentic workflows, allowing it to interact with APIs, databases, and other external systems. The model handles multiple languages, making it applicable for global deployments. It is best suited for production applications such as intelligent search, document summarization, customer support automation, and complex data analysis pipelines.

Context: 128,000 Output: 4,000 tokens

Input: $3.00 Output: $10.00

Nvidia

2 models

›

Nemotron 3 Super 120B

Mar 11, 2026

Open source

Nemotron 3 Super 120B is an open-weight large language model released by NVIDIA in March 2026. It uses a hybrid LatentMoE architecture that combines Mamba-2, Mixture-of-Experts, and Attention layers, activating only 12 billion of its 120 billion total parameters per token. This design allows the model to handle demanding tasks while using significantly less compute than a dense model of comparable parameter count. The model is built for agentic workflows, long-context reasoning, and high-throughput deployments. It supports a context window of up to 1 million tokens and achieves a RULER-100 retrieval score of 91.75 at that length. Nemotron 3 Super 120B also includes a configurable thinking mode for step-by-step reasoning, supports seven languages including English, French, German, Italian, Japanese, Spanish, and Chinese, and is available as an open-weight model suitable for both cloud API and self-hosted use.

Context: 1M Output: 16,384 tokens

Input: $0.10 Output: $0.00

›

Nemotron 3 Nano 30B

Dec 14, 2025

Release date unavailable

PixVerse V5.6 is an AI video generation model developed by Aishi Technology and released in January 2026. It generates videos from text prompts and images, supporting resolutions from 360p up to native 4K output, video lengths between 5 and 15 seconds, and aspect ratios for YouTube, TikTok, and Instagram. The model uses a hybrid diffusion-transformer architecture and is designed to reduce visual artifacts compared to prior versions, with improved physics simulation for elements like water, fabric, and character motion. What distinguishes PixVerse V5.6 is its end frame control feature, which allows users to define both the starting and ending images of a video and have the model generate all motion in between — with support for chaining up to 7 keyframes in a single video. It also supports multi-character consistency using reference photos for up to three distinct characters, preserving facial features, clothing, and body proportions across frames. Integrated audio generation produces background music, sound effects, and dialogue synchronized to the on-screen action. The model is well suited for content creators, marketers, and filmmakers producing product demos, branded content, character animations, or social media clips.

Context: 1,000 Output: N/A

Runway

1 models

›

Gen-4 Turbo

Release date unavailable

Gen-4 Turbo is a video generation model developed by Runway, part of the company's Gen-4 model family. It accepts text prompts and reference images as input and outputs up to 10 seconds of AI-generated video per generation. The model is designed with speed and efficiency as primary considerations, making it suited for workflows that require rapid or high-volume video output. It is accessible via Runway's API with official Python and Node.js SDKs for integration into production pipelines. Runway's API ecosystem, which includes Gen-4 Turbo, is used by enterprise customers in media and entertainment, including studios and production companies. The platform holds SOC 2 Type II, GDPR, and CCPA compliance certifications for professional deployments. Gen-4 Turbo sits alongside other models in Runway's broader multi-modal platform, which also supports image and audio generation. It is particularly well-suited for iterative creative workflows where fast turnaround on video output is a priority.

Context: 1,000 Output: N/A