OpenAI

GPT-4o mini Transcribe

GPT-4o mini Transcribe is a speech-to-text model developed by OpenAI that uses the GPT-4o mini architecture to convert spoken audio into written text. It is designed to deliver improved word error rates and more accurate language recognition compared to the original Whisper-based transcription models. The model is part of OpenAI's transcription API offerings and became available in 2025. This model is well-suited for applications that require accurate transcripts from audio input, such as meeting notes, voice interfaces, and content captioning. Its use of the GPT-4o mini backbone allows it to handle a range of languages with improved recognition accuracy. Developers looking for a cost-efficient transcription option within the OpenAI ecosystem can use this model via the API.

Unknown 16,000 context 2,000 output
Audio Transcription Multi-Language Recognition Low Word Error Rate API Integration

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

OpenAI

Input Context Window

The number of tokens supported by the input context window.

16,000 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

2,000 tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Unknown

Knowledge Cut-off Date

When the model's knowledge was last updated.

Unknown

API Providers

The providers that offer this model. This is not an exhaustive list.

OpenAI API

Modalities

Types of data this model can process.

Text Audio

What is GPT-4o mini Transcribe

A fuller summary of positioning, capabilities, and source-specific details for GPT-4o mini Transcribe.

GPT-4o mini Transcribe is a speech-to-text model developed by OpenAI that uses the GPT-4o mini architecture to convert spoken audio into written text. It is designed to deliver improved word error rates and more accurate language recognition compared to the original Whisper-based transcription models. The model is part of OpenAI's transcription API offerings and became available in 2025.

This model is well-suited for applications that require accurate transcripts from audio input, such as meeting notes, voice interfaces, and content captioning. Its use of the GPT-4o mini backbone allows it to handle a range of languages with improved recognition accuracy. Developers looking for a cost-efficient transcription option within the OpenAI ecosystem can use this model via the API.

Capabilities

What GPT-4o mini Transcribe supports

AUD

Audio Transcription

Converts spoken audio into written text using the GPT-4o mini model, with improved word error rates compared to original Whisper models.

AI

Multi-Language Recognition

Recognizes and transcribes speech across multiple languages with improved accuracy over earlier Whisper-based models.

AI

Low Word Error Rate

Optimized to reduce transcription mistakes, producing cleaner output text suitable for downstream processing or direct use.

API

API Integration

Accessible via the OpenAI API, allowing developers to submit audio files and receive transcription results programmatically.

Pricing for GPT-4o mini Transcribe

Primary API pricing shown in the same “quick compare” spirit as the reference page.

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

OpenAI API

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about GPT-4o mini Transcribe

GPT-4o mini Transcribe discussions are most active in r/LocalLLaMA, r/OpenAI, r/speechtotext. Top Reddit threads cluster around benchmark and model-comparison threads.

The strongest match in this snapshot has 157 upvotes and 45 comments.

r/OpenAI 157 upvotes 45 comments April 9, 2025
GPT-4o-transcribe outperforms Whisper-large

I just found out that OpenAI has released two new closed-source speech-to-text models three weeks ago (gpt-4o-transcribe and gpt-4o-mini-transcribe). Since I hadn't heard of it, I suspect this might be news for some of you too.

The main takeaways:

* According to their own benchmarks, they outperform Whisper V3 across most languages. Independent testing from Artificial Analysis confirms this.
* Gpt-4o-mini-transcribe is priced at half the price of the Whisper API endpoint
* Apart from the improved accuracy, the API remains quite limited though (max. file size of 25MB, no speaker diarization, no word-level timestamps). Since it’s a closed-source model, the community cannot really address these issues, apart from applying some “hacks” like batching inputs and aligning with a separate PyAnnote pipeline.
* Some users experience significant[ latency issues and unstable transcription](https://community.openai.com/t/gpt-4o-mini-transcribe-and-gpt-4o-transcribe-not-as-good-as-whisper/1153905) results with the new API, leading some to revert to Whisper

If you’d like to learn more: I wrote a short [blog post](https://scribewave.com/blog/openai-launches-gpt-4o-transcribe-a-powerful-yet-limited-transcription-model) about it. I tried it out and it passes my “vibe check” but I’ll make sure to evaluate it more thoroughly in the coming days.

Open Reddit thread
r/speechtotext 1 upvotes November 20, 2025
Transcription by gpt-4o-mini-transcribe

I had access to the data of Indian users who want to talk to AI/ Bestfriend/ Girlfriend, and they have recorded from their devices, which were either in Hindi, Bangla, Gujarati, or Punjabi. Here, transcription works where it generates their noisy, low voice into some Urdu text. We can't fix their devices to have better mics, and we can't go for better accurate model because we want low latency and low cost. Is there any model better than gpt-4o-mini-transcribe please reply. If anyone else had same problem. Can you tell me how to solve it.

\#transcription #gptmodel

Open Reddit thread
r/OpenWebUI 17 upvotes 25 comments May 10, 2026
What are you using for STT and TTS (voice chats)?

Hi all,

My STT & TTS setup is poor in Open Web UI.
What I really want is [chatgpt.com](http://chatgpt.com) levels of voice support (STT and TTS).

Things I have tried:

**STT:** OpenAI API directly (\`whisper-1\` model)
**TTS:** ElevenLabs (any model)
**Verdict:** Working but very slow, feels like 20s between each message. Also non-English support is poor.

Then I tried using using models from OpenRouter like \`openai/gpt-audio\` but this isn't supported:
[https://github.com/open-webui/open-webui/discussions/20855](https://github.com/open-webui/open-webui/discussions/20855)

I tried using OpenRouter STT & TTS models directly:

**STT:** OpenRouter (https://openrouter.ai/api/v1), openai/gpt-4o-mini-transcribe model
**TTS:** OpenRouter (https://openrouter.ai/api/v1), openai/gpt-4o-mini-tts-2025-12-15 model
**Verdict:** Doesn't work and is broken. I get this error when trying to use voice model with above config: Error transcribing chunk: External: No number after minus sign in JSON at position 1 (line 1 column 2)

Has anyone managed to make these models with with Open WebUI?

Ultimately, I am wondering what people are using for STT/TTS?

What I want is:

* Fast or reasonable response time
* English & non-English support (I want Persian language specifically which can be hit or miss)
* Open WebUI support for it

Happy to use API based billing on cloud (ideally via OpenRouter), or local models, or whatever else.

Open Reddit thread

It looks like OpenAI is preparing for a massive push into affordable **Voice Agents.**

**New models** have just appeared in the API dropdown (noticed by Developers):

**gpt-realtime-mini-2025-12-15**

**gpt-4o-mini-tts-2025-12-15**

**gpt-4o-mini-transcribe-2025-12-15**

Until now, the **Realtime API** (which allows for human like interruptions and emotion) was extremely expensive. Releasing a **"Mini"** version implies they have successfully distilled the audio capabilities into a smaller, cheaper model.

This likely opens the floodgates for **"Voice Mode"** capabilities in third-party apps that couldn't afford the main model.

**Does this mean we are getting a free tier for "Advanced Voice Mode" in ChatGPT soon? Usually, API drops precede consumer rollouts.**

Open Reddit thread

**TL;DR:** I updated my medical speech-to-text benchmark to **42 models** (up from 31 in v3) and added a new metric: **Medical WER (M-WER)**.

Standard WER treats every word equally. In medical audio, that makes little sense — **“yeah” and “amoxicillin” do not carry the same importance**.

So for v4 I re-scored the benchmark using only **clinically relevant words**: drugs, conditions, symptoms, anatomy, and clinical procedures. I also broke out **Drug M-WER** separately, since medication names are where patient-safety risk gets real.

That change reshuffled the leaderboard hard.

A few notable results:

* **VibeVoice-ASR 9B** ranks **#3** on M-WER and beats Microsoft’s own new closed **MAI-Transcribe-1**, which lands at **#11**
* **Parakeet TDT 0.6B v3** drops from a strong overall-WER position to **#31** on M-WER because of weak drug-name performance
* **Qwen3-ASR 1.7B** is the most interesting small local model this round: **4.40% M-WER** and about **7s/file on A10**
* Cloud APIs were stronger than I expected: **Soniox, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Medical** all ended up genuinely competitive

All code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub.

**Previous posts**: [v1](https://www.reddit.com/r/LocalLLaMA/comments/1md1fka/) · [v2](https://www.reddit.com/r/LocalLLaMA/comments/1pzmwzh/) · [v3](https://www.reddit.com/r/LocalLLaMA/comments/1s4z18o/)

# What changed since v3

# 1. New headline metric: Medical WER (M-WER)

Standard WER is still useful, but in a doctor-patient conversation it overweights the wrong things. A missed filler word and a missed medication name both count as one error, even though only one is likely to matter clinically.

So for v4 I added:

* **M-WER** = WER computed only over medically relevant reference tokens
* **Drug M-WER** = same idea, but restricted to drug names only

The current vocabulary covers **179 terms** across 5 categories:

* drugs
* conditions
* symptoms
* anatomy
* clinical procedures

The reshuffle is real. **Parakeet TDT 0.6B v3** looked great on normal WER in v3, but on M-WER it falls to **#31**, with **22% Drug M-WER**. Great at conversational glue, much weaker on the words that actually carry clinical meaning.

# 2. 11 new models added (31 → 42)

This round added a bunch of new serious contenders:

* **Soniox stt-async-v4** → **#4** on M-WER
* **AssemblyAI Universal-3 Pro** (`domain: medical-v1`) → **#7**
* **Deepgram Nova-3 Medical** → **#9**
* **Microsoft MAI-Transcribe-1** → **#11**
* **Qwen3-ASR 1.7B** → **#8**, best small open-source model this round
* **Cohere Transcribe (Mar 2026)** → **#18**, extremely fast
* **Parakeet TDT 1.1B** → **#15**
* **Facebook MMS-1B-all** → **#42 dead last** on this dataset

Also added a separate **multi-speaker track** with **Multitalker Parakeet 0.6B** using **cpWER**, since joint ASR + diarization is a different evaluation problem.

# Top 20 by Medical WER

Dataset: **PriMock57** — 55 doctor-patient consultations, \~80K words of British English medical dialogue.

|\#|Model|WER|M-WER|Drug M-WER|Speed|Host|
|:-|:-|:-|:-|:-|:-|:-|
|1|Google Gemini 3 Pro Preview|8.35%|2.65%|3.1%|64.5s|API|
|2|Google Gemini 2.5 Pro|8.15%|2.97%|4.1%|56.4s|API|
|3|**VibeVoice-ASR 9B (Microsoft, open-source)**|8.34%|**3.16%**|5.6%|96.7s|H100|
|4|Soniox stt-async-v4|9.18%|3.32%|7.1%|46.2s|API|
|5|Google Gemini 3 Flash Preview|11.33%|3.64%|5.2%|51.5s|API|
|6|ElevenLabs Scribe v2|9.72%|3.86%|4.3%|43.5s|API|
|7|AssemblyAI Universal-3 Pro (medical-v1)|9.55%|4.02%|6.5%|37.3s|API|
|8|**Qwen3 ASR 1.7B (open-source)**|9.00%|**4.40%**|8.6%|6.8s|A10|
|9|Deepgram Nova-3 Medical|9.05%|4.53%|9.7%|12.9s|API|
|10|OpenAI GPT-4o Mini Transcribe (Dec '25)|11.18%|4.85%|10.6%|40.4s|API|
|11|**Microsoft MAI-Transcribe-1**|11.52%|**4.85%**|11.2%|21.8s|API|
|12|ElevenLabs Scribe v1|10.87%|4.88%|7.5%|36.3s|API|
|13|Google Gemini 2.5 Flash|9.45%|5.01%|10.3%|20.2s|API|
|14|Voxtral Mini Transcribe V1|11.85%|5.17%|11.0%|22.4s|API|
|15|Parakeet TDT 1.1B|9.03%|5.20%|15.5%|12.3s|T4|
|16|Voxtral Mini Transcribe V2|11.64%|5.36%|12.1%|18.4s|API|
|17|Voxtral Mini 4B Realtime|11.89%|5.39%|11.8%|270.9s|A10|
|18|Cohere Transcribe (Mar 2026)|11.81%|5.59%|16.6%|3.9s|A10|
|19|OpenAI Whisper-1|13.20%|5.62%|10.3%|104.3s|API|
|20|Groq Whisper Large v3 Turbo|12.14%|5.75%|14.4%|8.0s|API|

Full 42-model leaderboard on [GitHub](https://github.com/Omi-Health/medical-STT-eval).

# The funny part: Microsoft vs Microsoft

Microsoft now has two visible STT offerings in this benchmark:

* **VibeVoice-ASR 9B** — open-source, from Microsoft Research
* **MAI-Transcribe-1** — closed, newly shipped by Microsoft's new SuperIntelligence team available through Azure Foundry.

And on the metric that actually matters for medical voice, the open model wins clearly:

* **VibeVoice-ASR 9B** → **#3**, **3.16% M-WER**
* **MAI-Transcribe-1** → **#11**, **4.85% M-WER**

So Microsoft’s own open-source release beats Microsoft’s flagship closed STT product by:

* **1.7 absolute points of M-WER**
* **5.6 absolute points of Drug M-WER**

VibeVoice is very good, but it is also heavy: **9B params**, long inference, and we ran it on **H100 96GB**. So it wins on contextual medical accuracy, but not on deployability.

# Best small open-source model: Qwen3-ASR 1.7B

This is probably the most practically interesting open-source result in the whole board.

**Qwen3-ASR 1.7B** lands at:

* **9.00% WER**
* **4.40% M-WER**
* **8.6% Drug M-WER**
* about **6.8s/file on A10**

That is a strong accuracy-to-cost tradeoff.

It is much faster than VibeVoice, much smaller, and still good enough on medical terms that I think a lot of people building local or semi-local clinical voice stacks will care more about this result than the #1 spot.

One important deployment caveat: **Qwen3-ASR does not play nicely with T4**. The model path wants newer attention support and ships in **bf16**, so **A10 or better** is the realistic target.

There was also a nasty long-audio bug in the default vLLM setup: Qwen3 would silently hang on longer files. The practical fix was:

max_num_batched_tokens=16384

That one-line change fixed it for us. Full notes are in the repo’s `AGENTS.md`.

# Cloud APIs got serious this round

v3 was still mostly a Google / ElevenLabs / OpenAI / Mistral story.

v4 broadened that a lot:

* **Soniox (#4)** — impressive for a universal model without explicit medical specialization
* **AssemblyAI Universal-3 Pro (#7)** — very solid, especially with `medical-v1`
* **Deepgram Nova-3 Medical (#9)** — fastest serious cloud API in the top group
* **Microsoft MAI-Transcribe-1 (#11)** — weaker than I expected, but still competitive

Google still dominates the very top, but the broader takeaway is different:

**the gap between strong cloud APIs and strong open-source models is now small enough that deployment constraints matter more than ever.**

# How M-WER is computed

The implementation is simple on purpose:

1. Tag medically relevant words in the **reference transcript**
2. Run normal WER alignment between reference and hypothesis
3. Count substitutions / deletions / insertions only on those tagged medical tokens
4. Compute:
* **M-WER** over all medical tokens
* **Drug M-WER** over the drug subset only

Current vocab:

* **179 medical terms**
* **5 categories**
* **464 drug-term occurrences** in PriMock57

The vocabulary file is in `evaluate/medical_terms_list.py` and is easy to extend.

# Links

* **GitHub**: [https://github.com/Omi-Health/medical-STT-eval](https://github.com/Omi-Health/medical-STT-eval)
* Full 42-model leaderboard, evaluation code, per-file transcripts, and per-file metrics are all open-source
* Qwen3 long-audio debugging notes are documented in `AGENTS.md`

Happy to take questions, criticism on the metric design, or suggestions for v5.

Open Reddit thread
View more discussions →
FAQ

Common questions about GPT-4o mini Transcribe

What is GPT-4o mini Transcribe?

GPT-4o mini Transcribe is a speech-to-text model from OpenAI that uses the GPT-4o mini architecture to transcribe audio. It offers improved word error rates and better language recognition compared to the original Whisper models.

How does this model differ from Whisper?

According to OpenAI's overview, GPT-4o mini Transcribe offers improvements to word error rate and better language recognition and accuracy compared to the original Whisper models, as it is powered by the GPT-4o mini architecture rather than the Whisper architecture.

Does GPT-4o mini Transcribe have a context window?

No context window size is specified in the available metadata for this model, as it is a speech-to-text model rather than a text generation model.

What audio formats or input types does this model accept?

The model accepts audio input for transcription. Specific supported audio formats are defined by the OpenAI API documentation; refer to OpenAI's official API reference for the full list of accepted file types.

When was GPT-4o mini Transcribe released?

GPT-4o mini Transcribe was added to MindStudio on June 4, 2025. OpenAI also released updated versions of their transcription models in December 2025 according to community reports.

More models from OpenAI

Continue browsing adjacent models from the same provider.

← All AI Models