OpenAI

GPT-4o Mini Vision

GPT-4o Mini Vision is a multimodal language model developed by OpenAI, released in mid-2024. It is a smaller, more cost-efficient variant of the GPT-4o family, designed to process both text and images within a single context window of 128,000 tokens. The model supports the same range of languages as GPT-4o and is optimized for low latency, making it suitable for high-throughput or real-time applications. The model is well-suited for tasks that require fast responses at scale, such as customer-facing chat interfaces, document analysis with visual content, and pipelines where cost per token is a primary constraint. Its multimodal reasoning capability allows it to interpret images alongside text in the same request. Developers working with large volumes of context or needing to process mixed text-and-image inputs at reduced cost are the primary intended audience.

Unknown 128,000 context 16,383 tokens output
Image Understanding Large Context Window Low Latency Responses Cost-Efficient Inference Multilingual Text Processing Structured Output

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

OpenAI

Input Context Window

The number of tokens supported by the input context window.

128,000 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

16,383 tokens tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Unknown

Knowledge Cut-off Date

When the model's knowledge was last updated.

Unknown

API Providers

The providers that offer this model. This is not an exhaustive list.

OpenAI API

Modalities

Types of data this model can process.

Text Image

What is GPT-4o Mini Vision

A fuller summary of positioning, capabilities, and source-specific details for GPT-4o Mini Vision.

GPT-4o Mini Vision is a multimodal language model developed by OpenAI, released in mid-2024. It is a smaller, more cost-efficient variant of the GPT-4o family, designed to process both text and images within a single context window of 128,000 tokens. The model supports the same range of languages as GPT-4o and is optimized for low latency, making it suitable for high-throughput or real-time applications.

The model is well-suited for tasks that require fast responses at scale, such as customer-facing chat interfaces, document analysis with visual content, and pipelines where cost per token is a primary constraint. Its multimodal reasoning capability allows it to interpret images alongside text in the same request. Developers working with large volumes of context or needing to process mixed text-and-image inputs at reduced cost are the primary intended audience.

Capabilities

What GPT-4o Mini Vision supports

IMG

Image Understanding

Accepts image inputs alongside text in a single request, enabling the model to describe, analyze, or answer questions about visual content.

CTX

Large Context Window

Supports up to 128,000 tokens of context per request, allowing long documents, conversation histories, or multiple images to be passed in one call.

AI

Low Latency Responses

Optimized for fast inference, making it suitable for real-time applications such as customer chat interfaces or interactive tools.

AI

Cost-Efficient Inference

Priced significantly lower per token than larger GPT-4o variants, enabling high-volume deployments without proportional cost increases.

AI

Multilingual Text Processing

Supports the same broad set of languages as GPT-4o, covering text generation, comprehension, and reasoning across multiple languages.

JSON

Structured Output

Can return responses in structured formats such as JSON, useful for downstream data processing or API integrations.

Pricing for GPT-4o Mini Vision

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

maxTemperature 2
maxResponseSize 16,383 tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

OpenAI API

Configuration & Parameters

The configurable options currently documented for this model.

Temperature

Number
Default: 1 Range: 0 - 2 (step 0.1)

Max Response Tokens

Number
Default: 8191 Range: 1 - 16383 (step 1)

Supported Request Parameters

Parameters currently listed by OpenRouter or the local catalog for this model.

Temperature Max Response Tokens

Model Performance

Benchmark scores synced from the current model source and normalized into the local catalog.

Benchmark Score
AIME 2024
American math olympiad problems
15.0%
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
54.3%
HLE
Questions that challenge frontier models across many domains
3.3%
LiveCodeBench
Real-world coding tasks from recent competitions
30.9%
MATH-500
Undergraduate and competition-level math problems
75.9%
MMLU-Pro
Expert knowledge across 14 academic disciplines
74.8%
SciCode
Scientific research coding and numerical methods
33.3%

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about GPT-4o Mini Vision

GPT-4o Mini Vision discussions are most active in r/OpenAI, r/OpenAIDev, r/arduino. Top Reddit threads cluster around benchmark and model-comparison threads, coding workflow discussions.

The strongest match in this snapshot has 69 upvotes and 34 comments.

r/OpenAI 69 upvotes 34 comments July 19, 2024
GPT-4o mini vision pricing is odd

Sorry if someone's posted this before but I couldn't see anything.

I find it a bit strange that OpenAI have made their GPT-4o mini functionally the same as the non-mini model for vision, by making each "image tile" more tokens in the mini vs the original 4o model.

[https://openai.com/api/pricing/](https://openai.com/api/pricing/)

GPT-4o:
150 x 150px image = 255 tokens (155 + 85 base tokens)
255 tokens = US$0.001275

GPT-4o mini:
150 x 150px image = 8500 tokens (5667 + 2833 base tokens)
8500 tokens = US$0.001275

I had a bit of a fun project in mind which would compare images, so I was super excited about a really cheap model (especially with their batch 50% discount) but it's a bit dissapointing that the discount doesn't carry over to images.

In contrast, Anthropic just use the formula \`tokens = (width px \* height px)/750\` and charge you the corresponding model's rate for the tokens, and for now Haiku is nearly 10x cheaper per image than 4o mini.

Note:
I did test that this isn't an error on their page, I compared two small images and got the following response. `CompletionUsage(completion_tokens=13, prompt_tokens=17128, total_tokens=17141)`

Edit:
Seems like it's official, there's a tweet from OpenAI acknowledging it
[https://x.com/romainhuet/status/1814054938986885550?t=AMFK4svMvCluYqAXUqRDMQ&s=19](https://x.com/romainhuet/status/1814054938986885550?t=AMFK4svMvCluYqAXUqRDMQ&s=19)

Open Reddit thread
r/OpenAIDev 3 upvotes 1 comments December 26, 2024
Sudden 88% drop in GPT-4o mini vision API token usage - what's going on?

Hey everyone! I'm seeing some strange behavior with my GPT-4o mini vision API usage and hoping someone can shed some light on this.

My Setup:
- I have an app that uses GPT-4o mini vision to extract data from images
- Images are sent as base64 directly in the prompt
- No recent changes made to the application

What Changed:
- Average token usage dropped from 137k to 16k tokens (88% decrease)
- Error rate increased from 1.9% to 2.9%

This happened suddenly without any changes on my end. Has anyone else experienced something similar? Were there any recent pricing changes or updates to the API that might explain this?
Any insights would be greatly appreciated!

Open Reddit thread
r/arduino 1 comments October 19, 2024
Chat gpt vision ai with gpt 4o mini

I am making a project using chat gpt's vision api with an esp32cam. Works for first loop (first picture it takes and sends to chat gpt), but the esp32 has "connection error" with chat gpt when i try to take another picture. Need help. Here is my code so far: (I have used chat gpt to try and fix the code but didn't work)

#include "esp_camera.h"
#include "FS.h"
#include "SD.h"
#include "SPI.h"
#include "mbedtls/base64.h"  // For Base64 encoding
#include "WiFi.h"            // Include Wi-Fi library
#include "wifi_credentials.h"  // Include the file with Wi-Fi credentials

#define CAMERA_MODEL_XIAO_ESP32S3 // Has PSRAM

#include "camera_pins.h"

int imageCount = 1;                // File Counter
bool camera_sign = false;          // Check camera status
bool sd_sign = false;              // Check sd status
int button = 0;    
const int buttonPin = 3;           // Pin where the button is connected  

// Function to delete all files in the root directory
void deleteAllFiles(fs::FS &fs) {
    File root = fs.open("/");
    File file = root.openNextFile();
    while (file) {
        fs.remove(file.name());  // Delete each file
        file = root.openNextFile();
    }
    Serial.println("All files deleted from SD card.");
}

// Function to create necessary folders
void createFolders(fs::FS &fs) {
    if (!fs.exists("/pictures")) {
        fs.mkdir("/pictures");
        Serial.println("Created folder: /pictures");
    }
    if (!fs.exists("/encoded")) {
        fs.mkdir("/encoded");
        Serial.println("Created folder: /encoded");
    }
}

// Save pictures to SD card in /pictures folder
void photo_save(const char * fileName) {
    // Take a photo
    camera_fb_t *fb = esp_camera_fb_get();
    if (!fb) {
        Serial.println("Failed to get camera frame buffer");
        return;
    }
    // Save photo to file in the /pictures directory
    writeFile(SD, fileName, fb->buf, fb->len);
 
    // Base64 encode and save the image
    encodeBase64AndSave(fb->buf, fb->len);

    // Release image buffer
    esp_camera_fb_return(fb);

    Serial.println("Photo saved to file and encoded.");
}

// SD card write file
void writeFile(fs::FS &fs, const char * path, uint8_t * data, size_t len){
    Serial.printf("Writing file: %s\r\n", path);

    File file = fs.open(path, FILE_WRITE);
    if(!file){
        Serial.println("Failed to open file for writing");
        return;
    }
    if(file.write(data, len) == len){
        Serial.println("File written");
    } else {
        Serial.println("Write failed");
    }
    file.close();
}

// Function to Base64 encode the image and save it to the encoded folder
void encodeBase64AndSave(uint8_t *imageData, size_t len) {
    // Calculate the output buffer size for Base64 encoded data
    size_t encodedLen = (len * 4 / 3) + 4;  // Base64 increases size by ~33%
    char *encodedData = (char*) malloc(encodedLen);  // Allocate memory for encoded data

    if (encodedData == NULL) {
        Serial.println("Failed to allocate memory for Base64 encoding");
        return;
    }

    // Perform Base64 encoding
    size_t outputLen;
    int ret = mbedtls_base64_encode((unsigned char*)encodedData, encodedLen, &outputLen, imageData, len);

    if (ret != 0) {
        Serial.println("Failed to encode image to Base64");
        free(encodedData);
        return;
    }

    // Create the filename for the encoded file in the /encoded folder
    char encodedFileName[64];
    sprintf(encodedFileName, "/encoded/image%d.txt", imageCount);  // Save Base64 data as a .txt file

    // Save the encoded data to the SD card
    writeFile(SD, encodedFileName, (uint8_t*)encodedData, outputLen);

    free(encodedData);  // Free allocated memory after encoding
}

// Function to connect to Wi-Fi
void connectToWiFi() {
    WiFi.begin(WIFI_SSID, WIFI_PASSWORD);
    Serial.print("Connecting to Wi-Fi");

    // Wait until the ESP32 connects to the Wi-Fi
    while (WiFi.status() != WL_CONNECTED) {
        delay(500);
        Serial.print(".");
    }

    Serial.println("");
    Serial.println("Wi-Fi connected.");
    Serial.print("IP address: ");
    Serial.println(WiFi.localIP());
}

void setup() {
    Serial.begin(115200);
    while(!Serial); // When the serial monitor is turned on, the program starts to execute

    // Connect to Wi-Fi
    connectToWiFi();

    camera_config_t config;
    config.ledc_channel = LEDC_CHANNEL_0;
    config.ledc_timer = LEDC_TIMER_0;
    config.pin_d0 = Y2_GPIO_NUM;
    config.pin_d1 = Y3_GPIO_NUM;
    config.pin_d2 = Y4_GPIO_NUM;
    config.pin_d3 = Y5_GPIO_NUM;
    config.pin_d4 = Y6_GPIO_NUM;
    config.pin_d5 = Y7_GPIO_NUM;
    config.pin_d6 = Y8_GPIO_NUM;
    config.pin_d7 = Y9_GPIO_NUM;
    config.pin_xclk = XCLK_GPIO_NUM;
    config.pin_pclk = PCLK_GPIO_NUM;
    config.pin_vsync = VSYNC_GPIO_NUM;
    config.pin_href = HREF_GPIO_NUM;
    config.pin_sscb_sda = SIOD_GPIO_NUM;
    config.pin_sscb_scl = SIOC_GPIO_NUM;
    config.pin_pwdn = PWDN_GPIO_NUM;
    config.pin_reset = RESET_GPIO_NUM;
    config.xclk_freq_hz = 20000000;
    config.frame_size = FRAMESIZE_UXGA;
    config.pixel_format = PIXFORMAT_JPEG; // for streaming
    config.grab_mode = CAMERA_GRAB_WHEN_EMPTY;
    config.fb_location = CAMERA_FB_IN_PSRAM;
    config.jpeg_quality = 12;
    config.fb_count = 1;
   
    // if PSRAM IC present, init with UXGA resolution and higher JPEG quality
    if(config.pixel_format == PIXFORMAT_JPEG){
        if(psramFound()){
            config.jpeg_quality = 10;
            config.fb_count = 2;
            config.grab_mode = CAMERA_GRAB_LATEST;
        } else {
            // Limit the frame size when PSRAM is not available
            config.frame_size = FRAMESIZE_SVGA;
            config.fb_location = CAMERA_FB_IN_DRAM;
        }
    } else {
        // Best option for face detection/recognition
        config.frame_size = FRAMESIZE_240X240;
    #if CONFIG_IDF_TARGET_ESP32S3
        config.fb_count = 2;
    #endif
    }

    // camera init
    esp_err_t err = esp_camera_init(&config);
    if (err != ESP_OK) {
        Serial.printf("Camera init failed with error 0x%x", err);
        return;
    }
   
    camera_sign = true; // Camera initialization check passes

    // Initialize SD card
    if(!SD.begin(21)){
        Serial.println("Card Mount Failed");
        return;
    }
    uint8_t cardType = SD.cardType();

    // Determine if the type of SD card is available
    if(cardType == CARD_NONE){
        Serial.println("No SD card attached");
        return;
    }

    Serial.print("SD Card Type: ");
    if(cardType == CARD_MMC){
        Serial.println("MMC");
    } else if(cardType == CARD_SD){
        Serial.println("SDSC");
    } else if(cardType == CARD_SDHC){
        Serial.println("SDHC");
    } else {
        Serial.println("UNKNOWN");
    }

    sd_sign = true; // SD initialization check passes

    // Delete all files and create folders
    deleteAllFiles(SD);      // Delete all files on boot
    createFolders(SD);       // Create "pictures" and "encoded" folders

    Serial.println("Photos will begin in one minute, please be ready.");
}

void loop() {
    if (touchRead(4) <= 25000) {
        button = 0;
    }  
 
    if (touchRead(4) >= 25000 && button == 0) {  
        delay(500);
        if (touchRead(4) >= 25000 && button == 0) {
            char filename[64];
            sprintf(filename, "/pictures/image%d.jpg", imageCount);  // Save to the pictures folder
            photo_save(filename);
            Serial.printf("Saved picture: %s\r\n", filename);
            imageCount++;
            button = 1;
        }
    }
    delay(50);
}

#include "esp_camera.h"
#include "FS.h"
#include "SD.h"
#include "SPI.h"
#include "WiFi.h"
#include <WiFiClientSecure.h>
#include <ArduinoJson.h>
#include "Base64.h"
#include "ChatGPT.hpp"
#include "credentials.h" // WiFi credentials and OpenAI API key

#define CAMERA_MODEL_XIAO_ESP32S3 // Has PSRAM

#include "camera_pins.h"

int imageCount = 1;                // File Counter
bool camera_sign = false;          // Check camera status
bool sd_sign = false;              // Check sd status
int button = 0;    
const int buttonPin = 3;           // Pin where the button is connected  

WiFiClientSecure client;  // WiFiClientSecure for HTTPS connection
ChatGPT<WiFiClientSecure> chatGPT_Client(&client, "v1", openai_api_key, 60000);  // Use WiFiClientSecure for HTTPS

void connectToWiFi() {
    WiFi.begin(ssid, password);
    Serial.println("Connecting to WiFi...");
   
    // Wait until the device is connected to WiFi
    while (WiFi.status() != WL_CONNECTED) {
        delay(500);
        Serial.print(".");
    }
    Serial.println();
    Serial.print("Connected! IP address: ");
    Serial.println(WiFi.localIP());
}

// Function to delete all files in the root directory
void deleteAllFiles(fs::FS &fs) {
    File root = fs.open("/");
    File file = root.openNextFile();
    while (file) {
        fs.remove(file.name());  // Delete each file
        file = root.openNextFile();
    }
    Serial.println("All files deleted from SD card.");
}

// Function to create necessary folders
void createFolders(fs::FS &fs) {
    if (!fs.exists("/pictures")) {
        fs.mkdir("/pictures");
        Serial.println("Created folder: /pictures");
    }
    if (!fs.exists("/encoded")) {
        fs.mkdir("/encoded");
        Serial.println("Created folder: /encoded");
    }
}

// SD card write file
void writeFile(fs::FS &fs, const char * path, uint8_t * data, size_t len){
    Serial.printf("Writing file: %s\r\n", path);

    File file = fs.open(path, FILE_WRITE);
    if(!file){
        Serial.println("Failed to open file for writing");
        return;
    }
    if(file.write(data, len) == len){
        Serial.println("File written");
    } else {
        Serial.println("Write failed");
    }
    file.close();
}

// Save pictures to SD card and send to GPT-4o Mini Vision API
void photo_save_and_analyze(const char * fileName) {
    // Take a photo
    camera_fb_t *fb = esp_camera_fb_get();
    if (!fb) {
        Serial.println("Failed to get camera frame buffer");
        return;
    }

    // Encode image to Base64
    String encodedImage = base64::encode(fb->buf, fb->len);
   
    // Print the Base64-encoded image (optional, can comment this line to reduce log size)
    Serial.println("Base64 Encoded Image:");
    Serial.println(encodedImage);

    // Save photo to file in the /pictures directory
    writeFile(SD, fileName, fb->buf, fb->len);
 
    // Release image buffer
    esp_camera_fb_return(fb);

    Serial.println("Photo saved to file");

    // Prepare the data URL for the API request
    if (encodedImage.length() > 0) {
        String base64Image = "data:image/jpeg;base64," + encodedImage;
        String result;
        Serial.println("\n\n[ChatGPT] - Asking a Vision Question");

        // Send to the API
        if (chatGPT_Client.vision_question("gpt-4o", "user", "text", "What’s in this image?", "image_url", base64Image.c_str(), "auto", 5000, true, result)) {
            Serial.print("[ChatGPT] Response: ");
            Serial.println(result);
            encodedImage = "";
        } else {
            Serial.print("[ChatGPT] Error: ");
            Serial.println(result);
        }

        // Clear the Base64 encoded image
        encodedImage = ""; // Clear the base64 string after the API request
    } else {
        Serial.println("Encoded image is empty!");
    }
}

void setup() {
    Serial.begin(115200);
    while(!Serial); // When the serial monitor is turned on, the program starts to execute

    camera_config_t config;
    config.ledc_channel = LEDC_CHANNEL_0;
    config.ledc_timer = LEDC_TIMER_0;
    config.pin_d0 = Y2_GPIO_NUM;
    config.pin_d1 = Y3_GPIO_NUM;
    config.pin_d2 = Y4_GPIO_NUM;
    config.pin_d3 = Y5_GPIO_NUM;
    config.pin_d4 = Y6_GPIO_NUM;
    config.pin_d5 = Y7_GPIO_NUM;
    config.pin_d6 = Y8_GPIO_NUM;
    config.pin_d7 = Y9_GPIO_NUM;
    config.pin_xclk = XCLK_GPIO_NUM;
    config.pin_pclk = PCLK_GPIO_NUM;
    config.pin_vsync = VSYNC_GPIO_NUM;
    config.pin_href = HREF_GPIO_NUM;
    config.pin_sscb_sda = SIOD_GPIO_NUM;
    config.pin_sscb_scl = SIOC_GPIO_NUM;
    config.pin_pwdn = PWDN_GPIO_NUM;
    config.pin_reset = RESET_GPIO_NUM;
    config.xclk_freq_hz = 20000000;
    config.frame_size = FRAMESIZE_UXGA;
    config.pixel_format = PIXFORMAT_JPEG; // for streaming
    config.grab_mode = CAMERA_GRAB_WHEN_EMPTY;
    config.fb_location = CAMERA_FB_IN_PSRAM;
    config.jpeg_quality = 12;
    config.fb_count = 1;
   
    // if PSRAM IC present, init with UXGA resolution and higher JPEG quality
    if(config.pixel_format == PIXFORMAT_JPEG){
        if(psramFound()){
            config.jpeg_quality = 10;
            config.fb_count = 2;
            config.grab_mode = CAMERA_GRAB_LATEST;
        } else {
            // Limit the frame size when PSRAM is not available
            config.frame_size = FRAMESIZE_SVGA;
            config.fb_location = CAMERA_FB_IN_DRAM;
        }
    } else {
        // Best option for face detection/recognition
        config.frame_size = FRAMESIZE_240X240;
    #if CONFIG_IDF_TARGET_ESP32S3
        config.fb_count = 2;
    #endif
    }

    // camera init
    esp_err_t err = esp_camera_init(&config);
    if (err != ESP_OK) {
        Serial.printf("Camera init failed with error 0x%x", err);
        return;
    }
   
    camera_sign = true; // Camera initialization check passes

    // Initialize SD card
    if(!SD.begin(21)){
        Serial.println("Card Mount Failed");
        return;
    }
    uint8_t cardType = SD.cardType();

    // Determine if the type of SD card is available
    if(cardType == CARD_NONE){
        Serial.println("No SD card attached");
        return;
    }

    Serial.print("SD Card Type: ");
    if(cardType == CARD_MMC){
        Serial.println("MMC");
    } else if(cardType == CARD_SD){
        Serial.println("SDSC");
    } else if(cardType == CARD_SDHC){
        Serial.println("SDHC");
    } else {
        Serial.println("UNKNOWN");
    }

    sd_sign = true; // SD initialization check passes

    // Delete all files and create folders
    deleteAllFiles(SD);      // Delete all files on boot
    createFolders(SD);       // Create "pictures" and "encoded" folders

    Serial.println("Photos will begin in one minute, please be ready.");

    // Connect to WiFi
    connectToWiFi();
}

void loop() {
    if (touchRead(4) <= 25000) {
        button = 0;
    }  
 
    // If it has been more than 1 minute since the last shot, take a picture, save it to the SD card, and analyze it with GPT-4o Mini Vision API
    if (touchRead(4) >= 25000 && button == 0) {  
        delay(500);
        if (touchRead(4) >= 25000 && button == 0) {
            char filename[64];
            sprintf(filename, "/pictures/image%d.jpg", imageCount);  // Save to the pictures folder only
            photo_save_and_analyze(filename);
            Serial.printf("Saved and analyzed picture: %s\r\n", filename);
            imageCount++;
            button = 1;
        }
    }
    delay(50);
}

Open Reddit thread
View more discussions →
FAQ

Common questions about GPT-4o Mini Vision

What is the context window size for GPT-4o Mini Vision?

GPT-4o Mini Vision supports a context window of 128,000 tokens, allowing large amounts of text and image content to be included in a single request.

What is the knowledge cutoff date for this model?

The training data cutoff for GPT-4o Mini Vision is October 2024, meaning it does not have knowledge of events that occurred after that date.

Does this model support image inputs?

Yes, GPT-4o Mini Vision is a multimodal model that accepts both text and image inputs within the same request, enabling visual question answering and image-based reasoning.

How does the pricing of GPT-4o Mini compare to other OpenAI models?

GPT-4o Mini is positioned as a low-cost model in OpenAI's lineup. For exact current pricing, refer to the OpenAI pricing page at platform.openai.com/docs/models.

What languages does GPT-4o Mini Vision support?

GPT-4o Mini Vision supports the same range of languages as GPT-4o, making it suitable for multilingual applications.

More models from OpenAI

Continue browsing adjacent models from the same provider.

← All AI Models