Google

Gemini 2.0 Flash Vision

Gemini 2.0 Flash Vision is a multimodal language model developed by Google, designed to process and reason over text, images, and other input types within a single context window of up to 1,048,576 tokens. It is part of the Gemini 2.0 Flash family, which emphasizes speed and efficiency alongside broad capability coverage including built-in tool use and multimodal generation. The model's training data has a cutoff of June 2024. Gemini 2.0 Flash Vision is well-suited for tasks that require understanding visual content alongside large volumes of text, such as document analysis, image-based question answering, and long-context reasoning. Its large context window makes it practical for workflows involving lengthy documents or multi-turn conversations that incorporate both images and text. The model is accessible through Google's Vertex AI platform and is intended for developers building applications that need fast, multimodal processing at scale.

Feb 05, 2025 1,048,576 context 8,192 tokens output

Image Understanding Long Context Window Multimodal Generation Built-in Tool Use Fast Inference Structured Output

Overview ↓ About ↓ Capabilities ↓ Pricing ↓ Price Comparison ↓ Parameters ↓ Benchmarks ↓ Compare ↓ Tools ↓ Daily ↓ Resources ↓ Community ↓ FAQ ↓

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

Google

Input Context Window

The number of tokens supported by the input context window.

1,048,576 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

8,192 tokens tokens

Open Source

Whether the model's code is available for public use.

Release Date

When the model was first released.

Feb 05, 2025 1 year ago

Knowledge Cut-off Date

When the model's knowledge was last updated.

June 2024

API Providers

The providers that offer this model. This is not an exhaustive list.

Google, Vertex AI

Modalities

Types of data this model can process.

Text Image

What is Gemini 2.0 Flash Vision

A fuller summary of positioning, capabilities, and source-specific details for Gemini 2.0 Flash Vision.

Gemini 2.0 Flash Vision is a multimodal language model developed by Google, designed to process and reason over text, images, and other input types within a single context window of up to 1,048,576 tokens. It is part of the Gemini 2.0 Flash family, which emphasizes speed and efficiency alongside broad capability coverage including built-in tool use and multimodal generation. The model's training data has a cutoff of June 2024.

Gemini 2.0 Flash Vision is well-suited for tasks that require understanding visual content alongside large volumes of text, such as document analysis, image-based question answering, and long-context reasoning. Its large context window makes it practical for workflows involving lengthy documents or multi-turn conversations that incorporate both images and text. The model is accessible through Google's Vertex AI platform and is intended for developers building applications that need fast, multimodal processing at scale.

Capabilities

What Gemini 2.0 Flash Vision supports

IMG

Image Understanding

Analyzes and reasons over image inputs alongside text, enabling tasks like visual question answering and image-based document analysis.

CTX

Long Context Window

Supports up to 1,048,576 tokens in a single context, allowing processing of lengthy documents, multi-image inputs, or extended conversations.

Multimodal Generation

Generates responses that draw on multiple input modalities, combining text and visual understanding in a single inference pass.

Built-in Tool Use

Supports native tool-calling capabilities, enabling the model to invoke external functions or APIs as part of its response generation.

Fast Inference

Optimized for low-latency responses within the Gemini 2.0 Flash family, making it suitable for real-time or high-throughput applications.

JSON

Structured Output

Can return responses in structured formats, supporting downstream data extraction and integration workflows.

Pricing for Gemini 2.0 Flash Vision

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Input tokens $0.15 Per million tokens

Output tokens N/A Per million tokens

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

maxTemperature 2

maxResponseSize 8,192 tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

Google Vertex AI

Configuration & Parameters

The configurable options currently documented for this model.

Temperature

Number

Default: 1 Range: 0 - 2 (step 0.1)

Max Response Tokens

Number

Default: 4096 Range: 1 - 8192 (step 1)

Supported Request Parameters

Parameters currently listed by OpenRouter or the local catalog for this model.

Temperature Max Response Tokens

Model Performance

Benchmark scores synced from the current model source and normalized into the local catalog.

Benchmark	Score
AIME 2024 American math olympiad problems	33.0%
GPQA Diamond PhD-level science questions (biology, physics, chemistry)	62.3%
HLE Questions that challenge frontier models across many domains	5.3%
LiveCodeBench Real-world coding tasks from recent competitions	33.4%
MATH-500 Undergraduate and competition-level math problems	93.0%
MMLU-Pro Expert knowledge across 14 academic disciplines	77.9%
SciCode Scientific research coding and numerical methods	33.3%

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Official Website Other

→

Documentation Documentation

→

Google AI Studio Playground

→

Gemini API Reference Documentation

→

Gemini 2.0 Flash Announcement Announcements

→