Audio Transcription
Converts spoken audio into written text using the GPT-4o mini model, with improved word error rates compared to original Whisper models.
GPT-4o mini Transcribe is a speech-to-text model developed by OpenAI that uses the GPT-4o mini architecture to convert spoken audio into written text. It is designed to deliver improved word error rates and more accurate language recognition compared to the original Whisper-based transcription models. The model is part of OpenAI's transcription API offerings and became available in 2025. This model is well-suited for applications that require accurate transcripts from audio input, such as meeting notes, voice interfaces, and content captioning. Its use of the GPT-4o mini backbone allows it to handle a range of languages with improved recognition accuracy. Developers looking for a cost-efficient transcription option within the OpenAI ecosystem can use this model via the API.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for GPT-4o mini Transcribe.
GPT-4o mini Transcribe is a speech-to-text model developed by OpenAI that uses the GPT-4o mini architecture to convert spoken audio into written text. It is designed to deliver improved word error rates and more accurate language recognition compared to the original Whisper-based transcription models. The model is part of OpenAI's transcription API offerings and became available in 2025.
This model is well-suited for applications that require accurate transcripts from audio input, such as meeting notes, voice interfaces, and content captioning. Its use of the GPT-4o mini backbone allows it to handle a range of languages with improved recognition accuracy. Developers looking for a cost-efficient transcription option within the OpenAI ecosystem can use this model via the API.
Converts spoken audio into written text using the GPT-4o mini model, with improved word error rates compared to original Whisper models.
Recognizes and transcribes speech across multiple languages with improved accuracy over earlier Whisper-based models.
Optimized to reduce transcription mistakes, producing cleaner output text suitable for downstream processing or direct use.
Accessible via the OpenAI API, allowing developers to submit audio files and receive transcription results programmatically.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Places where this model is available, based on the synced detail-page metadata.
Official model cards, release notes, docs, and other references synced from the source page.
GPT-4o mini Transcribe discussions are most active in r/LocalLLaMA, r/OpenAI, r/speechtotext. Top Reddit threads cluster around benchmark and model-comparison threads.
The strongest match in this snapshot has 157 upvotes and 45 comments.
I just found out that OpenAI has released two new closed-source speech-to-text models three weeks ago (gpt-4o-transcribe and gpt-4o-mini-transcribe). Since I hadn't heard of it, I suspect this might be news for some of you too.
The main takeaways:
* According to their own benchmarks, they outperform Whisper V3 across most languages. Independent testing from Artificial Analysis confirms this.
* Gpt-4o-mini-transcribe is priced at half the price of the Whisper API endpoint
* Apart from the improved accuracy, the API remains quite limited though (max. file size of 25MB, no speaker diarization, no word-level timestamps). Since it’s a closed-source model, the community cannot really address these issues, apart from applying some “hacks” like batching inputs and aligning with a separate PyAnnote pipeline.
* Some users experience significant[ latency issues and unstable transcription](https://community.openai.com/t/gpt-4o-mini-transcribe-and-gpt-4o-transcribe-not-as-good-as-whisper/1153905) results with the new API, leading some to revert to Whisper
If you’d like to learn more: I wrote a short [blog post](https://scribewave.com/blog/openai-launches-gpt-4o-transcribe-a-powerful-yet-limited-transcription-model) about it. I tried it out and it passes my “vibe check” but I’ll make sure to evaluate it more thoroughly in the coming days.
I had access to the data of Indian users who want to talk to AI/ Bestfriend/ Girlfriend, and they have recorded from their devices, which were either in Hindi, Bangla, Gujarati, or Punjabi. Here, transcription works where it generates their noisy, low voice into some Urdu text. We can't fix their devices to have better mics, and we can't go for better accurate model because we want low latency and low cost. Is there any model better than gpt-4o-mini-transcribe please reply. If anyone else had same problem. Can you tell me how to solve it.
\#transcription #gptmodel
Hi all,
My STT & TTS setup is poor in Open Web UI.
What I really want is [chatgpt.com](http://chatgpt.com) levels of voice support (STT and TTS).
Things I have tried:
**STT:** OpenAI API directly (\`whisper-1\` model)
**TTS:** ElevenLabs (any model)
**Verdict:** Working but very slow, feels like 20s between each message. Also non-English support is poor.
Then I tried using using models from OpenRouter like \`openai/gpt-audio\` but this isn't supported:
[https://github.com/open-webui/open-webui/discussions/20855](https://github.com/open-webui/open-webui/discussions/20855)
I tried using OpenRouter STT & TTS models directly:
**STT:** OpenRouter (https://openrouter.ai/api/v1), openai/gpt-4o-mini-transcribe model
**TTS:** OpenRouter (https://openrouter.ai/api/v1), openai/gpt-4o-mini-tts-2025-12-15 model
**Verdict:** Doesn't work and is broken. I get this error when trying to use voice model with above config: Error transcribing chunk: External: No number after minus sign in JSON at position 1 (line 1 column 2)
Has anyone managed to make these models with with Open WebUI?
Ultimately, I am wondering what people are using for STT/TTS?
What I want is:
* Fast or reasonable response time
* English & non-English support (I want Persian language specifically which can be hit or miss)
* Open WebUI support for it
Happy to use API based billing on cloud (ideally via OpenRouter), or local models, or whatever else.
It looks like OpenAI is preparing for a massive push into affordable **Voice Agents.**
**New models** have just appeared in the API dropdown (noticed by Developers):
**gpt-realtime-mini-2025-12-15**
**gpt-4o-mini-tts-2025-12-15**
**gpt-4o-mini-transcribe-2025-12-15**
Until now, the **Realtime API** (which allows for human like interruptions and emotion) was extremely expensive. Releasing a **"Mini"** version implies they have successfully distilled the audio capabilities into a smaller, cheaper model.
This likely opens the floodgates for **"Voice Mode"** capabilities in third-party apps that couldn't afford the main model.
**Does this mean we are getting a free tier for "Advanced Voice Mode" in ChatGPT soon? Usually, API drops precede consumer rollouts.**
**TL;DR:** I updated my medical speech-to-text benchmark to **42 models** (up from 31 in v3) and added a new metric: **Medical WER (M-WER)**.
Standard WER treats every word equally. In medical audio, that makes little sense — **“yeah” and “amoxicillin” do not carry the same importance**.
So for v4 I re-scored the benchmark using only **clinically relevant words**: drugs, conditions, symptoms, anatomy, and clinical procedures. I also broke out **Drug M-WER** separately, since medication names are where patient-safety risk gets real.
That change reshuffled the leaderboard hard.
A few notable results:
* **VibeVoice-ASR 9B** ranks **#3** on M-WER and beats Microsoft’s own new closed **MAI-Transcribe-1**, which lands at **#11**
* **Parakeet TDT 0.6B v3** drops from a strong overall-WER position to **#31** on M-WER because of weak drug-name performance
* **Qwen3-ASR 1.7B** is the most interesting small local model this round: **4.40% M-WER** and about **7s/file on A10**
* Cloud APIs were stronger than I expected: **Soniox, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Medical** all ended up genuinely competitive
All code, transcripts, per-file metrics, and the full leaderboard are open-source on GitHub.
**Previous posts**: [v1](https://www.reddit.com/r/LocalLLaMA/comments/1md1fka/) · [v2](https://www.reddit.com/r/LocalLLaMA/comments/1pzmwzh/) · [v3](https://www.reddit.com/r/LocalLLaMA/comments/1s4z18o/)
# What changed since v3
# 1. New headline metric: Medical WER (M-WER)
Standard WER is still useful, but in a doctor-patient conversation it overweights the wrong things. A missed filler word and a missed medication name both count as one error, even though only one is likely to matter clinically.
So for v4 I added:
* **M-WER** = WER computed only over medically relevant reference tokens
* **Drug M-WER** = same idea, but restricted to drug names only
The current vocabulary covers **179 terms** across 5 categories:
* drugs
* conditions
* symptoms
* anatomy
* clinical procedures
The reshuffle is real. **Parakeet TDT 0.6B v3** looked great on normal WER in v3, but on M-WER it falls to **#31**, with **22% Drug M-WER**. Great at conversational glue, much weaker on the words that actually carry clinical meaning.
# 2. 11 new models added (31 → 42)
This round added a bunch of new serious contenders:
* **Soniox stt-async-v4** → **#4** on M-WER
* **AssemblyAI Universal-3 Pro** (`domain: medical-v1`) → **#7**
* **Deepgram Nova-3 Medical** → **#9**
* **Microsoft MAI-Transcribe-1** → **#11**
* **Qwen3-ASR 1.7B** → **#8**, best small open-source model this round
* **Cohere Transcribe (Mar 2026)** → **#18**, extremely fast
* **Parakeet TDT 1.1B** → **#15**
* **Facebook MMS-1B-all** → **#42 dead last** on this dataset
Also added a separate **multi-speaker track** with **Multitalker Parakeet 0.6B** using **cpWER**, since joint ASR + diarization is a different evaluation problem.
# Top 20 by Medical WER
Dataset: **PriMock57** — 55 doctor-patient consultations, \~80K words of British English medical dialogue.
|\#|Model|WER|M-WER|Drug M-WER|Speed|Host|
|:-|:-|:-|:-|:-|:-|:-|
|1|Google Gemini 3 Pro Preview|8.35%|2.65%|3.1%|64.5s|API|
|2|Google Gemini 2.5 Pro|8.15%|2.97%|4.1%|56.4s|API|
|3|**VibeVoice-ASR 9B (Microsoft, open-source)**|8.34%|**3.16%**|5.6%|96.7s|H100|
|4|Soniox stt-async-v4|9.18%|3.32%|7.1%|46.2s|API|
|5|Google Gemini 3 Flash Preview|11.33%|3.64%|5.2%|51.5s|API|
|6|ElevenLabs Scribe v2|9.72%|3.86%|4.3%|43.5s|API|
|7|AssemblyAI Universal-3 Pro (medical-v1)|9.55%|4.02%|6.5%|37.3s|API|
|8|**Qwen3 ASR 1.7B (open-source)**|9.00%|**4.40%**|8.6%|6.8s|A10|
|9|Deepgram Nova-3 Medical|9.05%|4.53%|9.7%|12.9s|API|
|10|OpenAI GPT-4o Mini Transcribe (Dec '25)|11.18%|4.85%|10.6%|40.4s|API|
|11|**Microsoft MAI-Transcribe-1**|11.52%|**4.85%**|11.2%|21.8s|API|
|12|ElevenLabs Scribe v1|10.87%|4.88%|7.5%|36.3s|API|
|13|Google Gemini 2.5 Flash|9.45%|5.01%|10.3%|20.2s|API|
|14|Voxtral Mini Transcribe V1|11.85%|5.17%|11.0%|22.4s|API|
|15|Parakeet TDT 1.1B|9.03%|5.20%|15.5%|12.3s|T4|
|16|Voxtral Mini Transcribe V2|11.64%|5.36%|12.1%|18.4s|API|
|17|Voxtral Mini 4B Realtime|11.89%|5.39%|11.8%|270.9s|A10|
|18|Cohere Transcribe (Mar 2026)|11.81%|5.59%|16.6%|3.9s|A10|
|19|OpenAI Whisper-1|13.20%|5.62%|10.3%|104.3s|API|
|20|Groq Whisper Large v3 Turbo|12.14%|5.75%|14.4%|8.0s|API|
Full 42-model leaderboard on [GitHub](https://github.com/Omi-Health/medical-STT-eval).
# The funny part: Microsoft vs Microsoft
Microsoft now has two visible STT offerings in this benchmark:
* **VibeVoice-ASR 9B** — open-source, from Microsoft Research
* **MAI-Transcribe-1** — closed, newly shipped by Microsoft's new SuperIntelligence team available through Azure Foundry.
And on the metric that actually matters for medical voice, the open model wins clearly:
* **VibeVoice-ASR 9B** → **#3**, **3.16% M-WER**
* **MAI-Transcribe-1** → **#11**, **4.85% M-WER**
So Microsoft’s own open-source release beats Microsoft’s flagship closed STT product by:
* **1.7 absolute points of M-WER**
* **5.6 absolute points of Drug M-WER**
VibeVoice is very good, but it is also heavy: **9B params**, long inference, and we ran it on **H100 96GB**. So it wins on contextual medical accuracy, but not on deployability.
# Best small open-source model: Qwen3-ASR 1.7B
This is probably the most practically interesting open-source result in the whole board.
**Qwen3-ASR 1.7B** lands at:
* **9.00% WER**
* **4.40% M-WER**
* **8.6% Drug M-WER**
* about **6.8s/file on A10**
That is a strong accuracy-to-cost tradeoff.
It is much faster than VibeVoice, much smaller, and still good enough on medical terms that I think a lot of people building local or semi-local clinical voice stacks will care more about this result than the #1 spot.
One important deployment caveat: **Qwen3-ASR does not play nicely with T4**. The model path wants newer attention support and ships in **bf16**, so **A10 or better** is the realistic target.
There was also a nasty long-audio bug in the default vLLM setup: Qwen3 would silently hang on longer files. The practical fix was:
max_num_batched_tokens=16384
That one-line change fixed it for us. Full notes are in the repo’s `AGENTS.md`.
# Cloud APIs got serious this round
v3 was still mostly a Google / ElevenLabs / OpenAI / Mistral story.
v4 broadened that a lot:
* **Soniox (#4)** — impressive for a universal model without explicit medical specialization
* **AssemblyAI Universal-3 Pro (#7)** — very solid, especially with `medical-v1`
* **Deepgram Nova-3 Medical (#9)** — fastest serious cloud API in the top group
* **Microsoft MAI-Transcribe-1 (#11)** — weaker than I expected, but still competitive
Google still dominates the very top, but the broader takeaway is different:
**the gap between strong cloud APIs and strong open-source models is now small enough that deployment constraints matter more than ever.**
# How M-WER is computed
The implementation is simple on purpose:
1. Tag medically relevant words in the **reference transcript**
2. Run normal WER alignment between reference and hypothesis
3. Count substitutions / deletions / insertions only on those tagged medical tokens
4. Compute:
* **M-WER** over all medical tokens
* **Drug M-WER** over the drug subset only
Current vocab:
* **179 medical terms**
* **5 categories**
* **464 drug-term occurrences** in PriMock57
The vocabulary file is in `evaluate/medical_terms_list.py` and is easy to extend.
# Links
* **GitHub**: [https://github.com/Omi-Health/medical-STT-eval](https://github.com/Omi-Health/medical-STT-eval)
* Full 42-model leaderboard, evaluation code, per-file transcripts, and per-file metrics are all open-source
* Qwen3 long-audio debugging notes are documented in `AGENTS.md`
Happy to take questions, criticism on the metric design, or suggestions for v5.
GPT-4o mini Transcribe is a speech-to-text model from OpenAI that uses the GPT-4o mini architecture to transcribe audio. It offers improved word error rates and better language recognition compared to the original Whisper models.
According to OpenAI's overview, GPT-4o mini Transcribe offers improvements to word error rate and better language recognition and accuracy compared to the original Whisper models, as it is powered by the GPT-4o mini architecture rather than the Whisper architecture.
No context window size is specified in the available metadata for this model, as it is a speech-to-text model rather than a text generation model.
The model accepts audio input for transcription. Specific supported audio formats are defined by the OpenAI API documentation; refer to OpenAI's official API reference for the full list of accepted file types.
GPT-4o mini Transcribe was added to MindStudio on June 4, 2025. OpenAI also released updated versions of their transcription models in December 2025 according to community reports.
Continue browsing adjacent models from the same provider.