X.ai

Grok 2 Vision

Grok 2 Vision (grok-2-vision-1212) is a multimodal language model developed by xAI and released in December 2024. It accepts combined image and text inputs and is designed to understand, analyze, and respond to visual content alongside natural language. The model supports images up to 20MiB in JPG, JPEG, or PNG format and can process inputs in any order. It also includes multilingual support and improved instruction-following compared to earlier Grok vision releases. Grok 2 Vision is suited for production use cases that require visual comprehension, such as image captioning, visual question answering, chart and document analysis, and building AI assistants that respond to visual inputs. It supports tool calling and structured outputs, making it straightforward to integrate into developer workflows. With a 32,768-token context window, it can handle moderately long conversations that mix text and image content.

Unknown 32,768 context 1M output
Image Understanding Multimodal Input Multilingual Support Instruction Following Tool Calling Structured Outputs

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

X.ai

Input Context Window

The number of tokens supported by the input context window.

32,768 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

1M tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Unknown

Knowledge Cut-off Date

When the model's knowledge was last updated.

Unknown

API Providers

The providers that offer this model. This is not an exhaustive list.

X.ai

Modalities

Types of data this model can process.

Text Image

What is Grok 2 Vision

A fuller summary of positioning, capabilities, and source-specific details for Grok 2 Vision.

Grok 2 Vision (grok-2-vision-1212) is a multimodal language model developed by xAI and released in December 2024. It accepts combined image and text inputs and is designed to understand, analyze, and respond to visual content alongside natural language. The model supports images up to 20MiB in JPG, JPEG, or PNG format and can process inputs in any order. It also includes multilingual support and improved instruction-following compared to earlier Grok vision releases.

Grok 2 Vision is suited for production use cases that require visual comprehension, such as image captioning, visual question answering, chart and document analysis, and building AI assistants that respond to visual inputs. It supports tool calling and structured outputs, making it straightforward to integrate into developer workflows. With a 32,768-token context window, it can handle moderately long conversations that mix text and image content.

Capabilities

What Grok 2 Vision supports

IMG

Image Understanding

Analyzes image content including objects, styles, charts, and documents. Accepts JPG, JPEG, or PNG files up to 20MiB per image.

MM

Multimodal Input

Accepts interleaved text and image inputs in any order within a single request, enabling flexible prompt construction.

AI

Multilingual Support

Processes and generates responses in multiple languages, making it usable for internationally facing applications.

AI

Instruction Following

Follows complex and nuanced prompts with improved steerability introduced in the December 2024 release.

TL

Tool Calling

Supports function calling so developers can connect the model to external tools and APIs within their pipelines.

JSON

Structured Outputs

Returns structured data formats and supports temperature control for predictable, integration-ready responses.

AI

Visual Question Answering

Answers natural language questions about image content, including charts, diagrams, and scanned documents.

CTX

Long Context Window

Supports up to 32,768 tokens per request, accommodating extended conversations that mix text and image inputs.

Pricing for Grok 2 Vision

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

maxTemperature 1

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

X.ai

Model Performance

Benchmark scores synced from the current model source and normalized into the local catalog.

Benchmark Score
AIME 2024
American math olympiad problems
13.3%
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
51.0%
HLE
Questions that challenge frontier models across many domains
3.8%
LiveCodeBench
Real-world coding tasks from recent competitions
26.7%
MATH-500
Undergraduate and competition-level math problems
77.8%
MMLU-Pro
Expert knowledge across 14 academic disciplines
70.9%
SciCode
Scientific research coding and numerical methods
28.5%

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

FAQ

Common questions about Grok 2 Vision

What is the context window for Grok 2 Vision?

Grok 2 Vision supports a context window of 32,768 tokens per request.

What image formats does Grok 2 Vision accept?

The model accepts JPG, JPEG, and PNG image formats, with a maximum file size of 20MiB per image.

When was Grok 2 Vision released and what is its training cutoff?

Grok 2 Vision was released in December 2024, with a training date listed as December 2024.

Does Grok 2 Vision support tool calling?

Yes, Grok 2 Vision supports function calling and structured outputs, allowing integration with external tools and APIs.

Who publishes Grok 2 Vision and where can I access it via API?

Grok 2 Vision is published by xAI (the AI division of X). It is accessible through the xAI API and is also listed on OpenRouter under the model ID grok-2-vision-1212.

More models from X.ai

Continue browsing adjacent models from the same provider.

← All AI Models