Large Context Window
Supports up to 1,048,576 tokens in a single context, enabling processing of long documents, extended conversations, or large batches of visual and textual content.
Gemini 2.5 Flash Vision is a multimodal vision model developed by Google, designed to process and reason over visual inputs alongside text. It is part of the Gemini 2.5 Flash family, which is built around balancing cost efficiency with broad capability coverage. The model supports a context window of 1,048,576 tokens, making it suitable for tasks that require processing large amounts of information in a single request. It was trained with a knowledge cutoff of June 2025. This model is positioned for use cases where real-time or low-latency responses are important, such as visual question answering, document analysis with images, and applications that combine vision with extended context. The "thinking" architecture underlying the Gemini 2.5 Flash series enables the model to apply multi-step reasoning before producing a response. Developers looking for a vision-capable model that can handle long documents, images, and mixed-modality inputs without incurring the cost of larger models will find this a practical option.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for Gemini 2.5 Flash Vision.
Gemini 2.5 Flash Vision is a multimodal vision model developed by Google, designed to process and reason over visual inputs alongside text. It is part of the Gemini 2.5 Flash family, which is built around balancing cost efficiency with broad capability coverage. The model supports a context window of 1,048,576 tokens, making it suitable for tasks that require processing large amounts of information in a single request. It was trained with a knowledge cutoff of June 2025.
This model is positioned for use cases where real-time or low-latency responses are important, such as visual question answering, document analysis with images, and applications that combine vision with extended context. The "thinking" architecture underlying the Gemini 2.5 Flash series enables the model to apply multi-step reasoning before producing a response. Developers looking for a vision-capable model that can handle long documents, images, and mixed-modality inputs without incurring the cost of larger models will find this a practical option.
Supports up to 1,048,576 tokens in a single context, enabling processing of long documents, extended conversations, or large batches of visual and textual content.
Optimized for low-latency responses, making it suitable for interactive applications and real-time visual analysis workflows.
Processes image inputs alongside text to answer questions, describe scenes, extract information, or reason over visual content.
Applies multi-step thinking across both visual and textual inputs, supporting tasks like document comprehension that combine images and text.
Can return responses in structured formats, useful for extracting data from images or documents into machine-readable outputs.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Additional usage-cost dimensions synced into the project for this model.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
Parameters currently listed by OpenRouter or the local catalog for this model.
Benchmark scores synced from the current model source and normalized into the local catalog.
| Benchmark | Score |
|---|---|
|
AIME 2024
American math olympiad problems
|
|
|
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
|
|
|
HLE
Questions that challenge frontier models across many domains
|
|
|
LiveCodeBench
Real-world coding tasks from recent competitions
|
|
|
MATH-500
Undergraduate and competition-level math problems
|
|
|
MMLU-Pro
Expert knowledge across 14 academic disciplines
|
|
|
SciCode
Scientific research coding and numerical methods
|
Official model cards, release notes, docs, and other references synced from the source page.
Jump straight into the most relevant side-by-side comparison pages for this model.
Compare Gemini 1.0 Pro Vision Deprecated and Gemini 2.5 Flash Vision across pricing, context window, capabilities, benchmarks, and API access to choose the better fit for general-purpose AI workloads versus long-context workloads.
Compare Gemini 1.0 Pro Deprecated and Gemini 2.5 Flash Vision across pricing, context window, capabilities, benchmarks, and API access to choose the better fit for long-context workloads versus long-context workloads.
Compare Gemini 3 Pro Image and Gemini 2.5 Flash Vision across pricing, context window, capabilities, benchmarks, and API access to choose the better fit for reasoning-heavy tasks versus long-context workloads.
Compare Gemini 3 Deprecated and Gemini 2.5 Flash Vision across pricing, context window, capabilities, benchmarks, and API access to choose the better fit for long-context workloads versus long-context workloads.
Compare Gemini 3 Flash and Gemini 2.5 Flash Vision across pricing, context window, capabilities, benchmarks, and API access to choose the better fit for long-context workloads versus long-context workloads.
Compare Gemini 3.5 Flash and Gemini 2.5 Flash Vision across pricing, context window, capabilities, benchmarks, and API access to choose the better fit for reasoning-heavy tasks versus long-context workloads.
Gemini 2.5 Flash Vision supports a context window of 1,048,576 tokens, allowing very large amounts of text and visual content to be processed in a single request.
The model has a training data cutoff of June 2025, as indicated in the model metadata.
The model is classified as a Vision type and is designed to accept image inputs alongside text, enabling multimodal tasks.
Yes. Gemini 2.5 Flash Vision is tagged for real-time latency, meaning it is optimized to return responses quickly, which is relevant for interactive or production applications.
The model is published by Google and is part of the Gemini 2.5 Flash model family, available through Google's AI infrastructure including Vertex AI.
Continue browsing adjacent models from the same provider.