Image Understanding
Accepts image inputs alongside text in a single request, enabling the model to describe, analyze, or answer questions about visual content.
GPT-4o Mini Vision is a multimodal language model developed by OpenAI, released in mid-2024. It is a smaller, more cost-efficient variant of the GPT-4o family, designed to process both text and images within a single context window of 128,000 tokens. The model supports the same range of languages as GPT-4o and is optimized for low latency, making it suitable for high-throughput or real-time applications. The model is well-suited for tasks that require fast responses at scale, such as customer-facing chat interfaces, document analysis with visual content, and pipelines where cost per token is a primary constraint. Its multimodal reasoning capability allows it to interpret images alongside text in the same request. Developers working with large volumes of context or needing to process mixed text-and-image inputs at reduced cost are the primary intended audience.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for GPT-4o Mini Vision.
GPT-4o Mini Vision is a multimodal language model developed by OpenAI, released in mid-2024. It is a smaller, more cost-efficient variant of the GPT-4o family, designed to process both text and images within a single context window of 128,000 tokens. The model supports the same range of languages as GPT-4o and is optimized for low latency, making it suitable for high-throughput or real-time applications.
The model is well-suited for tasks that require fast responses at scale, such as customer-facing chat interfaces, document analysis with visual content, and pipelines where cost per token is a primary constraint. Its multimodal reasoning capability allows it to interpret images alongside text in the same request. Developers working with large volumes of context or needing to process mixed text-and-image inputs at reduced cost are the primary intended audience.
Accepts image inputs alongside text in a single request, enabling the model to describe, analyze, or answer questions about visual content.
Supports up to 128,000 tokens of context per request, allowing long documents, conversation histories, or multiple images to be passed in one call.
Optimized for fast inference, making it suitable for real-time applications such as customer chat interfaces or interactive tools.
Priced significantly lower per token than larger GPT-4o variants, enabling high-volume deployments without proportional cost increases.
Supports the same broad set of languages as GPT-4o, covering text generation, comprehension, and reasoning across multiple languages.
Can return responses in structured formats such as JSON, useful for downstream data processing or API integrations.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Additional usage-cost dimensions synced into the project for this model.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
Parameters currently listed by OpenRouter or the local catalog for this model.
Benchmark scores synced from the current model source and normalized into the local catalog.
| Benchmark | Score |
|---|---|
|
AIME 2024
American math olympiad problems
|
|
|
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
|
|
|
HLE
Questions that challenge frontier models across many domains
|
|
|
LiveCodeBench
Real-world coding tasks from recent competitions
|
|
|
MATH-500
Undergraduate and competition-level math problems
|
|
|
MMLU-Pro
Expert knowledge across 14 academic disciplines
|
|
|
SciCode
Scientific research coding and numerical methods
|
Official model cards, release notes, docs, and other references synced from the source page.
GPT-4o Mini Vision discussions are most active in r/OpenAI, r/OpenAIDev, r/arduino. Top Reddit threads cluster around benchmark and model-comparison threads, coding workflow discussions.
The strongest match in this snapshot has 69 upvotes and 34 comments.
Sorry if someone's posted this before but I couldn't see anything.
I find it a bit strange that OpenAI have made their GPT-4o mini functionally the same as the non-mini model for vision, by making each "image tile" more tokens in the mini vs the original 4o model.
[https://openai.com/api/pricing/](https://openai.com/api/pricing/)
GPT-4o:
150 x 150px image = 255 tokens (155 + 85 base tokens)
255 tokens = US$0.001275
GPT-4o mini:
150 x 150px image = 8500 tokens (5667 + 2833 base tokens)
8500 tokens = US$0.001275
I had a bit of a fun project in mind which would compare images, so I was super excited about a really cheap model (especially with their batch 50% discount) but it's a bit dissapointing that the discount doesn't carry over to images.
In contrast, Anthropic just use the formula \`tokens = (width px \* height px)/750\` and charge you the corresponding model's rate for the tokens, and for now Haiku is nearly 10x cheaper per image than 4o mini.
Note:
I did test that this isn't an error on their page, I compared two small images and got the following response. `CompletionUsage(completion_tokens=13, prompt_tokens=17128, total_tokens=17141)`
Edit:
Seems like it's official, there's a tweet from OpenAI acknowledging it
[https://x.com/romainhuet/status/1814054938986885550?t=AMFK4svMvCluYqAXUqRDMQ&s=19](https://x.com/romainhuet/status/1814054938986885550?t=AMFK4svMvCluYqAXUqRDMQ&s=19)
Hey everyone! I'm seeing some strange behavior with my GPT-4o mini vision API usage and hoping someone can shed some light on this.
My Setup:
- I have an app that uses GPT-4o mini vision to extract data from images
- Images are sent as base64 directly in the prompt
- No recent changes made to the application
What Changed:
- Average token usage dropped from 137k to 16k tokens (88% decrease)
- Error rate increased from 1.9% to 2.9%
This happened suddenly without any changes on my end. Has anyone else experienced something similar? Were there any recent pricing changes or updates to the API that might explain this?
Any insights would be greatly appreciated!
I am making a project using chat gpt's vision api with an esp32cam. Works for first loop (first picture it takes and sends to chat gpt), but the esp32 has "connection error" with chat gpt when i try to take another picture. Need help. Here is my code so far: (I have used chat gpt to try and fix the code but didn't work)
#include "esp_camera.h"
#include "FS.h"
#include "SD.h"
#include "SPI.h"
#include "mbedtls/base64.h" // For Base64 encoding
#include "WiFi.h" // Include Wi-Fi library
#include "wifi_credentials.h" // Include the file with Wi-Fi credentials
#define CAMERA_MODEL_XIAO_ESP32S3 // Has PSRAM
#include "camera_pins.h"
int imageCount = 1; // File Counter
bool camera_sign = false; // Check camera status
bool sd_sign = false; // Check sd status
int button = 0;
const int buttonPin = 3; // Pin where the button is connected
// Function to delete all files in the root directory
void deleteAllFiles(fs::FS &fs) {
File root = fs.open("/");
File file = root.openNextFile();
while (file) {
fs.remove(file.name()); // Delete each file
file = root.openNextFile();
}
Serial.println("All files deleted from SD card.");
}
// Function to create necessary folders
void createFolders(fs::FS &fs) {
if (!fs.exists("/pictures")) {
fs.mkdir("/pictures");
Serial.println("Created folder: /pictures");
}
if (!fs.exists("/encoded")) {
fs.mkdir("/encoded");
Serial.println("Created folder: /encoded");
}
}
// Save pictures to SD card in /pictures folder
void photo_save(const char * fileName) {
// Take a photo
camera_fb_t *fb = esp_camera_fb_get();
if (!fb) {
Serial.println("Failed to get camera frame buffer");
return;
}
// Save photo to file in the /pictures directory
writeFile(SD, fileName, fb->buf, fb->len);
// Base64 encode and save the image
encodeBase64AndSave(fb->buf, fb->len);
// Release image buffer
esp_camera_fb_return(fb);
Serial.println("Photo saved to file and encoded.");
}
// SD card write file
void writeFile(fs::FS &fs, const char * path, uint8_t * data, size_t len){
Serial.printf("Writing file: %s\r\n", path);
File file = fs.open(path, FILE_WRITE);
if(!file){
Serial.println("Failed to open file for writing");
return;
}
if(file.write(data, len) == len){
Serial.println("File written");
} else {
Serial.println("Write failed");
}
file.close();
}
// Function to Base64 encode the image and save it to the encoded folder
void encodeBase64AndSave(uint8_t *imageData, size_t len) {
// Calculate the output buffer size for Base64 encoded data
size_t encodedLen = (len * 4 / 3) + 4; // Base64 increases size by ~33%
char *encodedData = (char*) malloc(encodedLen); // Allocate memory for encoded data
if (encodedData == NULL) {
Serial.println("Failed to allocate memory for Base64 encoding");
return;
}
// Perform Base64 encoding
size_t outputLen;
int ret = mbedtls_base64_encode((unsigned char*)encodedData, encodedLen, &outputLen, imageData, len);
if (ret != 0) {
Serial.println("Failed to encode image to Base64");
free(encodedData);
return;
}
// Create the filename for the encoded file in the /encoded folder
char encodedFileName[64];
sprintf(encodedFileName, "/encoded/image%d.txt", imageCount); // Save Base64 data as a .txt file
// Save the encoded data to the SD card
writeFile(SD, encodedFileName, (uint8_t*)encodedData, outputLen);
free(encodedData); // Free allocated memory after encoding
}
// Function to connect to Wi-Fi
void connectToWiFi() {
WiFi.begin(WIFI_SSID, WIFI_PASSWORD);
Serial.print("Connecting to Wi-Fi");
// Wait until the ESP32 connects to the Wi-Fi
while (WiFi.status() != WL_CONNECTED) {
delay(500);
Serial.print(".");
}
Serial.println("");
Serial.println("Wi-Fi connected.");
Serial.print("IP address: ");
Serial.println(WiFi.localIP());
}
void setup() {
Serial.begin(115200);
while(!Serial); // When the serial monitor is turned on, the program starts to execute
// Connect to Wi-Fi
connectToWiFi();
camera_config_t config;
config.ledc_channel = LEDC_CHANNEL_0;
config.ledc_timer = LEDC_TIMER_0;
config.pin_d0 = Y2_GPIO_NUM;
config.pin_d1 = Y3_GPIO_NUM;
config.pin_d2 = Y4_GPIO_NUM;
config.pin_d3 = Y5_GPIO_NUM;
config.pin_d4 = Y6_GPIO_NUM;
config.pin_d5 = Y7_GPIO_NUM;
config.pin_d6 = Y8_GPIO_NUM;
config.pin_d7 = Y9_GPIO_NUM;
config.pin_xclk = XCLK_GPIO_NUM;
config.pin_pclk = PCLK_GPIO_NUM;
config.pin_vsync = VSYNC_GPIO_NUM;
config.pin_href = HREF_GPIO_NUM;
config.pin_sscb_sda = SIOD_GPIO_NUM;
config.pin_sscb_scl = SIOC_GPIO_NUM;
config.pin_pwdn = PWDN_GPIO_NUM;
config.pin_reset = RESET_GPIO_NUM;
config.xclk_freq_hz = 20000000;
config.frame_size = FRAMESIZE_UXGA;
config.pixel_format = PIXFORMAT_JPEG; // for streaming
config.grab_mode = CAMERA_GRAB_WHEN_EMPTY;
config.fb_location = CAMERA_FB_IN_PSRAM;
config.jpeg_quality = 12;
config.fb_count = 1;
// if PSRAM IC present, init with UXGA resolution and higher JPEG quality
if(config.pixel_format == PIXFORMAT_JPEG){
if(psramFound()){
config.jpeg_quality = 10;
config.fb_count = 2;
config.grab_mode = CAMERA_GRAB_LATEST;
} else {
// Limit the frame size when PSRAM is not available
config.frame_size = FRAMESIZE_SVGA;
config.fb_location = CAMERA_FB_IN_DRAM;
}
} else {
// Best option for face detection/recognition
config.frame_size = FRAMESIZE_240X240;
#if CONFIG_IDF_TARGET_ESP32S3
config.fb_count = 2;
#endif
}
// camera init
esp_err_t err = esp_camera_init(&config);
if (err != ESP_OK) {
Serial.printf("Camera init failed with error 0x%x", err);
return;
}
camera_sign = true; // Camera initialization check passes
// Initialize SD card
if(!SD.begin(21)){
Serial.println("Card Mount Failed");
return;
}
uint8_t cardType = SD.cardType();
// Determine if the type of SD card is available
if(cardType == CARD_NONE){
Serial.println("No SD card attached");
return;
}
Serial.print("SD Card Type: ");
if(cardType == CARD_MMC){
Serial.println("MMC");
} else if(cardType == CARD_SD){
Serial.println("SDSC");
} else if(cardType == CARD_SDHC){
Serial.println("SDHC");
} else {
Serial.println("UNKNOWN");
}
sd_sign = true; // SD initialization check passes
// Delete all files and create folders
deleteAllFiles(SD); // Delete all files on boot
createFolders(SD); // Create "pictures" and "encoded" folders
Serial.println("Photos will begin in one minute, please be ready.");
}
void loop() {
if (touchRead(4) <= 25000) {
button = 0;
}
if (touchRead(4) >= 25000 && button == 0) {
delay(500);
if (touchRead(4) >= 25000 && button == 0) {
char filename[64];
sprintf(filename, "/pictures/image%d.jpg", imageCount); // Save to the pictures folder
photo_save(filename);
Serial.printf("Saved picture: %s\r\n", filename);
imageCount++;
button = 1;
}
}
delay(50);
}
#include "esp_camera.h"
#include "FS.h"
#include "SD.h"
#include "SPI.h"
#include "WiFi.h"
#include <WiFiClientSecure.h>
#include <ArduinoJson.h>
#include "Base64.h"
#include "ChatGPT.hpp"
#include "credentials.h" // WiFi credentials and OpenAI API key
#define CAMERA_MODEL_XIAO_ESP32S3 // Has PSRAM
#include "camera_pins.h"
int imageCount = 1; // File Counter
bool camera_sign = false; // Check camera status
bool sd_sign = false; // Check sd status
int button = 0;
const int buttonPin = 3; // Pin where the button is connected
WiFiClientSecure client; // WiFiClientSecure for HTTPS connection
ChatGPT<WiFiClientSecure> chatGPT_Client(&client, "v1", openai_api_key, 60000); // Use WiFiClientSecure for HTTPS
void connectToWiFi() {
WiFi.begin(ssid, password);
Serial.println("Connecting to WiFi...");
// Wait until the device is connected to WiFi
while (WiFi.status() != WL_CONNECTED) {
delay(500);
Serial.print(".");
}
Serial.println();
Serial.print("Connected! IP address: ");
Serial.println(WiFi.localIP());
}
// Function to delete all files in the root directory
void deleteAllFiles(fs::FS &fs) {
File root = fs.open("/");
File file = root.openNextFile();
while (file) {
fs.remove(file.name()); // Delete each file
file = root.openNextFile();
}
Serial.println("All files deleted from SD card.");
}
// Function to create necessary folders
void createFolders(fs::FS &fs) {
if (!fs.exists("/pictures")) {
fs.mkdir("/pictures");
Serial.println("Created folder: /pictures");
}
if (!fs.exists("/encoded")) {
fs.mkdir("/encoded");
Serial.println("Created folder: /encoded");
}
}
// SD card write file
void writeFile(fs::FS &fs, const char * path, uint8_t * data, size_t len){
Serial.printf("Writing file: %s\r\n", path);
File file = fs.open(path, FILE_WRITE);
if(!file){
Serial.println("Failed to open file for writing");
return;
}
if(file.write(data, len) == len){
Serial.println("File written");
} else {
Serial.println("Write failed");
}
file.close();
}
// Save pictures to SD card and send to GPT-4o Mini Vision API
void photo_save_and_analyze(const char * fileName) {
// Take a photo
camera_fb_t *fb = esp_camera_fb_get();
if (!fb) {
Serial.println("Failed to get camera frame buffer");
return;
}
// Encode image to Base64
String encodedImage = base64::encode(fb->buf, fb->len);
// Print the Base64-encoded image (optional, can comment this line to reduce log size)
Serial.println("Base64 Encoded Image:");
Serial.println(encodedImage);
// Save photo to file in the /pictures directory
writeFile(SD, fileName, fb->buf, fb->len);
// Release image buffer
esp_camera_fb_return(fb);
Serial.println("Photo saved to file");
// Prepare the data URL for the API request
if (encodedImage.length() > 0) {
String base64Image = "data:image/jpeg;base64," + encodedImage;
String result;
Serial.println("\n\n[ChatGPT] - Asking a Vision Question");
// Send to the API
if (chatGPT_Client.vision_question("gpt-4o", "user", "text", "What’s in this image?", "image_url", base64Image.c_str(), "auto", 5000, true, result)) {
Serial.print("[ChatGPT] Response: ");
Serial.println(result);
encodedImage = "";
} else {
Serial.print("[ChatGPT] Error: ");
Serial.println(result);
}
// Clear the Base64 encoded image
encodedImage = ""; // Clear the base64 string after the API request
} else {
Serial.println("Encoded image is empty!");
}
}
void setup() {
Serial.begin(115200);
while(!Serial); // When the serial monitor is turned on, the program starts to execute
camera_config_t config;
config.ledc_channel = LEDC_CHANNEL_0;
config.ledc_timer = LEDC_TIMER_0;
config.pin_d0 = Y2_GPIO_NUM;
config.pin_d1 = Y3_GPIO_NUM;
config.pin_d2 = Y4_GPIO_NUM;
config.pin_d3 = Y5_GPIO_NUM;
config.pin_d4 = Y6_GPIO_NUM;
config.pin_d5 = Y7_GPIO_NUM;
config.pin_d6 = Y8_GPIO_NUM;
config.pin_d7 = Y9_GPIO_NUM;
config.pin_xclk = XCLK_GPIO_NUM;
config.pin_pclk = PCLK_GPIO_NUM;
config.pin_vsync = VSYNC_GPIO_NUM;
config.pin_href = HREF_GPIO_NUM;
config.pin_sscb_sda = SIOD_GPIO_NUM;
config.pin_sscb_scl = SIOC_GPIO_NUM;
config.pin_pwdn = PWDN_GPIO_NUM;
config.pin_reset = RESET_GPIO_NUM;
config.xclk_freq_hz = 20000000;
config.frame_size = FRAMESIZE_UXGA;
config.pixel_format = PIXFORMAT_JPEG; // for streaming
config.grab_mode = CAMERA_GRAB_WHEN_EMPTY;
config.fb_location = CAMERA_FB_IN_PSRAM;
config.jpeg_quality = 12;
config.fb_count = 1;
// if PSRAM IC present, init with UXGA resolution and higher JPEG quality
if(config.pixel_format == PIXFORMAT_JPEG){
if(psramFound()){
config.jpeg_quality = 10;
config.fb_count = 2;
config.grab_mode = CAMERA_GRAB_LATEST;
} else {
// Limit the frame size when PSRAM is not available
config.frame_size = FRAMESIZE_SVGA;
config.fb_location = CAMERA_FB_IN_DRAM;
}
} else {
// Best option for face detection/recognition
config.frame_size = FRAMESIZE_240X240;
#if CONFIG_IDF_TARGET_ESP32S3
config.fb_count = 2;
#endif
}
// camera init
esp_err_t err = esp_camera_init(&config);
if (err != ESP_OK) {
Serial.printf("Camera init failed with error 0x%x", err);
return;
}
camera_sign = true; // Camera initialization check passes
// Initialize SD card
if(!SD.begin(21)){
Serial.println("Card Mount Failed");
return;
}
uint8_t cardType = SD.cardType();
// Determine if the type of SD card is available
if(cardType == CARD_NONE){
Serial.println("No SD card attached");
return;
}
Serial.print("SD Card Type: ");
if(cardType == CARD_MMC){
Serial.println("MMC");
} else if(cardType == CARD_SD){
Serial.println("SDSC");
} else if(cardType == CARD_SDHC){
Serial.println("SDHC");
} else {
Serial.println("UNKNOWN");
}
sd_sign = true; // SD initialization check passes
// Delete all files and create folders
deleteAllFiles(SD); // Delete all files on boot
createFolders(SD); // Create "pictures" and "encoded" folders
Serial.println("Photos will begin in one minute, please be ready.");
// Connect to WiFi
connectToWiFi();
}
void loop() {
if (touchRead(4) <= 25000) {
button = 0;
}
// If it has been more than 1 minute since the last shot, take a picture, save it to the SD card, and analyze it with GPT-4o Mini Vision API
if (touchRead(4) >= 25000 && button == 0) {
delay(500);
if (touchRead(4) >= 25000 && button == 0) {
char filename[64];
sprintf(filename, "/pictures/image%d.jpg", imageCount); // Save to the pictures folder only
photo_save_and_analyze(filename);
Serial.printf("Saved and analyzed picture: %s\r\n", filename);
imageCount++;
button = 1;
}
}
delay(50);
}
GPT-4o Mini Vision supports a context window of 128,000 tokens, allowing large amounts of text and image content to be included in a single request.
The training data cutoff for GPT-4o Mini Vision is October 2024, meaning it does not have knowledge of events that occurred after that date.
Yes, GPT-4o Mini Vision is a multimodal model that accepts both text and image inputs within the same request, enabling visual question answering and image-based reasoning.
GPT-4o Mini is positioned as a low-cost model in OpenAI's lineup. For exact current pricing, refer to the OpenAI pricing page at platform.openai.com/docs/models.
GPT-4o Mini Vision supports the same range of languages as GPT-4o, making it suitable for multilingual applications.
Continue browsing adjacent models from the same provider.