LLM Model Directory

Explore frontier AI models by provider, pricing, and context

Browse the synced model catalog by provider, release, pricing, and core capabilities.

All Text 132 Vision 13 Image 55 Video 32 Transcription 5 Text to Speech 6 Music 2 Lip Sync 5 3D 5

13 models 4 providers in view Current filter: All providers Type: Vision

OpenAI

3 models

›

Vision

GPT-4 Turbo Vision

Release date unavailable

GPT-4 Turbo Vision is a multimodal language model developed by OpenAI that accepts both text and image inputs, allowing it to analyze visual content and answer questions about it. It is built on GPT-4 Turbo and extends the traditional text-only language model paradigm by incorporating vision capabilities, with a context window of 128,000 tokens. The model's training data has a cutoff of December 2023. GPT-4 Turbo Vision is well suited for tasks that require reasoning over images alongside text, such as document analysis, visual question answering, interpreting diagrams, and describing image content. The large context window allows users to include substantial amounts of text alongside image inputs in a single request. It is available through OpenAI's API and is accessible on MindStudio without requiring separate API key management.

Context: 128,000 Output: 4,096 tokens

Input: $10.00 Output: N/A

View model →

›

Vision

GPT-4o Mini Vision

Release date unavailable

GPT-4o Mini Vision is a multimodal language model developed by OpenAI, released in mid-2024. It is a smaller, more cost-efficient variant of the GPT-4o family, designed to process both text and images within a single context window of 128,000 tokens. The model supports the same range of languages as GPT-4o and is optimized for low latency, making it suitable for high-throughput or real-time applications. The model is well-suited for tasks that require fast responses at scale, such as customer-facing chat interfaces, document analysis with visual content, and pipelines where cost per token is a primary constraint. Its multimodal reasoning capability allows it to interpret images alongside text in the same request. Developers working with large volumes of context or needing to process mixed text-and-image inputs at reduced cost are the primary intended audience.

Context: 128,000 Output: 16,383 tokens

Input: $0.15 Output: N/A

View model →

›

Vision

GPT-4o Vision

Release date unavailable

GPT-4o Vision is a variant of OpenAI's GPT-4o model that accepts both text and image inputs, allowing it to analyze visual content and respond to questions about it. Developed by OpenAI and added to MindStudio in June 2024, it supports a 128,000-token context window and has a training data cutoff of October 2023. The model addresses a historical limitation of language models, which traditionally processed only text, by enabling multimodal input handling within a single system. GPT-4o Vision is well suited for tasks that require interpreting images alongside text, such as describing visual content, answering questions about photographs or diagrams, extracting information from images, and supporting workflows where visual and textual data appear together. Because it shares the GPT-4o architecture, it handles natural language tasks in addition to vision tasks without requiring a separate model. Developers building applications that involve document analysis, image-based Q&A, or mixed-media content can use this model through the OpenAI API.

Context: 128,000 Output: 4,096 tokens

Input: $2.50 Output: N/A

View model →

Google

7 models

›

Vision

Gemini 2.5 Flash Vision

Jun 17, 2025

Gemini 2.5 Flash Vision is a multimodal vision model developed by Google, designed to process and reason over visual inputs alongside text. It is part of the Gemini 2.5 Flash family, which is built around balancing cost efficiency with broad capability coverage. The model supports a context window of 1,048,576 tokens, making it suitable for tasks that require processing large amounts of information in a single request. It was trained with a knowledge cutoff of June 2025. This model is positioned for use cases where real-time or low-latency responses are important, such as visual question answering, document analysis with images, and applications that combine vision with extended context. The "thinking" architecture underlying the Gemini 2.5 Flash series enables the model to apply multi-step reasoning before producing a response. Developers looking for a vision-capable model that can handle long documents, images, and mixed-modality inputs without incurring the cost of larger models will find this a practical option.

Context: 1,048,576 Output: 65,535 tokens

Input: $0.30 Output: N/A

View model →

›

Vision

Gemini 2.5 Pro Vision

Jun 17, 2025

Gemini 2.5 Pro Vision is a multimodal AI model developed by Google DeepMind, designed to reason through complex problems by analyzing text, images, audio, video, and code. It operates as a "thinking model," meaning it works through logical steps before producing a response rather than generating output directly. The model supports a context window of 1,048,576 tokens, enabling it to process large documents, codebases, and extended conversations in a single request. The model is particularly suited for tasks that require combining visual understanding with structured reasoning, such as interpreting diagrams, analyzing image-based data, and generating code from visual inputs. It has demonstrated strong benchmark performance in math, science, and software engineering tasks, including a 63.8% score on the SWE-Bench Verified evaluation. Gemini 2.5 Pro Vision is available through Google AI Studio and via the Gemini API, making it accessible for developers building applications that require both vision and reasoning capabilities.

Context: 1,048,576 Output: 65,536 tokens

Input: $1.25 Output: N/A

View model →

›

Vision

Gemini 2.0 Flash-Lite Vision

Feb 25, 2025

Gemini 2.0 Flash-Lite Vision is a multimodal model developed by Google, designed to process both visual and textual inputs. It belongs to the Gemini 2.0 Flash family and is positioned as the fastest and most cost-efficient option within that lineup. The model supports a context window of over one million tokens, making it suitable for tasks that require processing large amounts of information in a single request. It was trained on data up to June 2024. This model is intended as an upgrade path for users of Gemini 1.5 Flash who want improved output quality without changes to cost or latency. Its vision capabilities allow it to handle image understanding tasks alongside text-based workflows. The combination of speed, large context support, and multimodal input handling makes it well-suited for applications such as document analysis, image captioning, and high-throughput pipelines where cost efficiency is a priority.

Context: 1,048,576 Output: 8,192 tokens

Input: $0.08 Output: N/A

View model →

›

Vision

Gemini 2.0 Flash Vision

Feb 05, 2025

Gemini 2.0 Flash Vision is a multimodal language model developed by Google, designed to process and reason over text, images, and other input types within a single context window of up to 1,048,576 tokens. It is part of the Gemini 2.0 Flash family, which emphasizes speed and efficiency alongside broad capability coverage including built-in tool use and multimodal generation. The model's training data has a cutoff of June 2024. Gemini 2.0 Flash Vision is well-suited for tasks that require understanding visual content alongside large volumes of text, such as document analysis, image-based question answering, and long-context reasoning. Its large context window makes it practical for workflows involving lengthy documents or multi-turn conversations that incorporate both images and text. The model is accessible through Google's Vertex AI platform and is intended for developers building applications that need fast, multimodal processing at scale.

Context: 1,048,576 Output: 8,192 tokens

Input: $0.15 Output: N/A

View model →

›

Vision

Gemini 1.0 Pro Vision Deprecated

Release date unavailable

Handles both text and image inputs for content generation and problem-solving.

Context: N/A Output: 1,024 tokens

Input: N/A Output: N/A

View model →

›

Vision

Gemini 1.5 Flash Vision Deprecated

Release date unavailable

Fast, cost-effective multimodal model for quality applications at high volume.

Context: N/A Output: 8,192 tokens

Input: N/A Output: N/A

View model →

›

Vision

Gemini 1.5 Pro Vision Deprecated

Release date unavailable

Adept at processing visual and text inputs for multimodal tasks and content creation.

Context: N/A Output: 8,192 tokens

Input: N/A Output: N/A

View model →

X.ai

2 models

›

Vision

Grok 2 Vision

Release date unavailable

Grok 2 Vision (grok-2-vision-1212) is a multimodal language model developed by xAI and released in December 2024. It accepts combined image and text inputs and is designed to understand, analyze, and respond to visual content alongside natural language. The model supports images up to 20MiB in JPG, JPEG, or PNG format and can process inputs in any order. It also includes multilingual support and improved instruction-following compared to earlier Grok vision releases. Grok 2 Vision is suited for production use cases that require visual comprehension, such as image captioning, visual question answering, chart and document analysis, and building AI assistants that respond to visual inputs. It supports tool calling and structured outputs, making it straightforward to integrate into developer workflows. With a 32,768-token context window, it can handle moderately long conversations that mix text and image content.

Context: 32,768 Output: 1M

Input: $2.00 Output: $2.50

View model →

›

Vision

Grok 4.3 Vision

Release date unavailable

Structured model profile with pricing, context, and capability details.

Context: N/A Output: 2,000,000 tokens

Input: $1.25 Output: N/A

View model →

Ideogram

1 models

›

Vision

Ideogram Vision

Release date unavailable

Ideogram Vision is a multimodal AI model developed by Ideogram that combines image understanding with natural language processing. It is designed to analyze and interpret images in conjunction with text prompts, enabling tasks such as visual question answering, image description, and vision-language reasoning. The model extends Ideogram's AI platform beyond image generation into visual comprehension. It supports a context window of 32,000 tokens, allowing for detailed and extended interactions involving both image and text inputs. Ideogram Vision is best suited for applications that require understanding the content of an image and responding to queries about it in natural language. This includes use cases such as extracting information from visual content, describing scenes or objects, and combining visual context with text-based reasoning tasks. The model is accessible through the MindStudio platform without requiring separate API key management. It is particularly relevant for developers and teams building workflows that involve image analysis as a core component.

Context: 32,000 Output: N/A

Input: $0.01 Output: N/A

View model →