OpenAI

GPT-4o-mini TTS

GPT-4o-mini TTS is a text-to-speech model developed by OpenAI that converts written text into natural-sounding spoken audio. It belongs to the GPT-4o mini family, which is designed to deliver capable output at a smaller computational footprint than full-scale variants. The model is accessible to developers through the OpenAI API and is intended for programmatic speech generation across a range of applications. It accepts a text input of up to 2,000 characters and returns audio output in a synthesized voice. GPT-4o-mini TTS is part of OpenAI's broader suite of audio models, which also includes transcription and speech-to-speech capabilities. Its focus is specifically on the text-to-speech task, producing clear and expressive spoken output from plain text. The model is well-suited for voice-enabled applications, accessibility tools, content narration, and any product that requires reliable, scalable audio generation without requiring a larger model. Developers can configure voice selection and other parameters through the API.

Unknown N/A context N/A output
Text to Speech Voice Selection Expressive Audio Output API Integration Configurable Parameters

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

OpenAI

Input Context Window

The number of tokens supported by the input context window.

N/A tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

N/A tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Unknown

Knowledge Cut-off Date

When the model's knowledge was last updated.

Unknown

API Providers

The providers that offer this model. This is not an exhaustive list.

OpenAI API

Modalities

Types of data this model can process.

Text Audio

What is GPT-4o-mini TTS

A fuller summary of positioning, capabilities, and source-specific details for GPT-4o-mini TTS.

GPT-4o-mini TTS is a text-to-speech model developed by OpenAI that converts written text into natural-sounding spoken audio. It belongs to the GPT-4o mini family, which is designed to deliver capable output at a smaller computational footprint than full-scale variants. The model is accessible to developers through the OpenAI API and is intended for programmatic speech generation across a range of applications. It accepts a text input of up to 2,000 characters and returns audio output in a synthesized voice.

GPT-4o-mini TTS is part of OpenAI's broader suite of audio models, which also includes transcription and speech-to-speech capabilities. Its focus is specifically on the text-to-speech task, producing clear and expressive spoken output from plain text. The model is well-suited for voice-enabled applications, accessibility tools, content narration, and any product that requires reliable, scalable audio generation without requiring a larger model. Developers can configure voice selection and other parameters through the API.

Capabilities

What GPT-4o-mini TTS supports

AI

Text to Speech

Converts written text input into synthesized spoken audio output. Accepts up to 2,000 characters of text per request.

AI

Voice Selection

Allows developers to choose from available voice options via a select input parameter in the API request.

AUD

Expressive Audio Output

Produces natural-sounding speech with expressive qualities, suitable for narration and voice-enabled interfaces.

API

API Integration

Available through the OpenAI API, enabling developers to integrate speech generation directly into applications via standard API calls.

AI

Configurable Parameters

Supports numeric and select input types, allowing developers to adjust settings such as speed or format alongside voice and prompt inputs.

Pricing for GPT-4o-mini TTS

Primary API pricing shown in the same “quick compare” spirit as the reference page.

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

OpenAI API

Configuration & Parameters

The configurable options currently documented for this model.

Voice

Select

Voice to use in TTS

Default: alloy
Alloy Ash Coral Echo Fable Onyx Nova Sage Shimmer

Instructions

Prompt

Control the voice of your generated audio with additional instructions.

Speed

Number

The speed of the generated audio. (Default is 1.0)

Default: 1 Range: 0.25 - 4 (step 0.05)

Output Format

Select
Default: mp3
MP3 WAV

Supported Request Parameters

Parameters currently listed by OpenRouter or the local catalog for this model.

Voice Instructions Speed Output Format

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about GPT-4o-mini TTS

GPT-4o-mini TTS discussions are most active in r/TextToSpeech, r/OpenAI, r/n8n. Top Reddit threads cluster around benchmark and model-comparison threads, coding workflow discussions.

The strongest match in this snapshot has 153 upvotes and 6 comments.

r/Anki 6 upvotes 9 comments June 27, 2025
HyperTTS 2.6.0: OpenAI gpt-4o-mini-tts Model For TTS

The OpenAI `gpt-4o-mini-tts` model is now supported in [HyperTTS](https://www.vocab.ai/hypertts). This is a feature contributed by Claus (thank you!). You have to select it in the voice options (the default is still `tts-1-hd` for OpenAI). This voice model accepts an optional `instruction` field. You can use it to instruct the model to speak in a certain way, or indicate that the source text is in a particular language. For more details, you can consult the [OpenAi reference](https://platform.openai.com/docs/guides/text-to-speech#text-to-speech-models). As with neural and LLM models, the actual output will really vary with the situation, so you'll have to experiment. Claus' feedback is: *GPT-4o mini TTS is the first OpenAI TTS model that provides usable output for my Greek flashcards.* The common feedback with OpenAI (and also ElevenLabs) is that non-english output was not that good and suffered from an american accent. Hopefully this new model improves that.

https://preview.redd.it/qdknt6tjdh9f1.png?width=2176&format=png&auto=webp&s=b10eb1eaebbe698d7b2f15658b7d59508374abfa

So what's next ? People have been asking for Google Gemini TTS. I've been working on this for [HyperTTS](https://www.vocab.ai/hypertts) but there's a serious limitation, google limits requests to 10 per minute, even on Tier 1 paid accounts. This means mass-generating Gemini audio will be a tedious process, and HyperTTS will need to implement some retry logic, which will be welcome anyway, to handle the occasional timeout. Besides that here are the [issues](https://github.com/Vocab-Apps/anki-hyper-tts/issues?q=is%3Aissue%20state%3Aopen%20label%3Apriority) that I will tackle in the coming weeks in HyperTTS.

Besides that, in the coming months, I'd like to make progress on an idea I started before Christmas: generating long-play audio files with Anki flashcard sounds, so that you can review your deck while walking. Kind of like a podcast. I have a working prototype but I need to finish it. I'm also actively thinking about how [Language Tools](https://www.vocab.ai/language-tools) can use LLMs. This is honestly an overdue feature given how AI chatbots have become amazing at translation, but also transliteration (for Chinese, you can confidently ask gpt-4 to convert to Pinyin).

[https://www.vocab.ai/updates](https://www.vocab.ai/updates)

Open Reddit thread

Boson AI has recently open-sourced the Higgs Audio V2 model.
[https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base](https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base)

The model demonstrates strong performance in automatic prosody adjustment and generating natural multi-speaker dialogues across languages .

Notably, it achieved a 75.7% win rate over GPT-4o-mini-tts in emotional expression on the EmergentTTS-Eval benchmark . The total parameter count for this model is approximately 5.8 billion (3.6B for the LLM and 2.2B for the Audio Dual FFN)

Open Reddit thread
r/ChatGPTcomplaints 19 upvotes 33 comments April 22, 2026
OpenAI API Model Deprecation:

Over the next three to six months, we will deprecate support for the older OpenAI model set to improve reliability and simplify model selection. Newer alternatives will be recommended for each model.

Access to these models in the API will be closed on the following dates:

End of Service July 23, 2026

preview-computer-use-2025-03-11

gpt-4o-audio-preview-2024-12-17

gpt-4o-mini-audio-preview-2024-12-17

gpt-4o-mini-realtime-preview-2024-12-17

gpt-4o-mini-search-preview-2025-03-11

gpt-4o-mini-tts-2025-03-20

gpt-4o-search-preview-2025-03 -11

gpt-5-chat-latest

gpt-5-codex

gpt-5.1-chat-latest

gpt-5.1-codex

gpt-5.1-codex-max

gpt-5.2-codex

gpt-5.1-codex-mini

gpt-audio-mini-2025-10-06

gpt-realtime-mini-2025-10-06

o3-deep-research-2025-06-26

o4-mini-deep-research-2025-06-26

Termination of service on October 23 2026

text-insert-3-small

gpt-3.5-turbo-0125

gpt-4-0613

gpt-4-1106-preview

gpt-4-turbo

gpt-4.1-nano

gpt-4o-2024-05-13

gpt-image-1

o1-2024-12-17

o1-pro-2025-03-19

o3-mini-2025-01-31o4-mini-2025-04-16

Improved versions of models that are outdated:

GPT-3.5 Turbo Tuned Versions

Gpt-4.1-nano-2025-04-1

Babbage-002 Tuned Versions

Davinci-002 Tuned Versions

o4-mini-2025-04-16

Tuned GPT-4 Version

We encourage you to explore our latest flagship models, which we present here, which offer improved performance and broader feature support.

Thank you for building with OpenAI. If you have any questions about model migration, please feel free to reach out to the OpenAI Developer Forum.

Open Reddit thread

Hello, I am having the darndest time with this.

Recently OpenAI announced that they are retiring `gpt-4o-mini-tts-2025-03-20`, which in my experience is the most flexible TTS model I’ve ever used. My main concern is with voice acting and adherence to directions. That model lets you prompt *how* something should be spoken, and it actually follows those instructions extremely well.

I have a chatbot setup where the LLM generates both the text response and a short set of TTS instructions describing how it should sound. So if the reply is angry, it will also instruct the TTS to speak in an angry tone. If I ask it to do a character impression (like a Batman-style voice), it will describe a low, gravelly, menacing delivery and the TTS actually performs it that way for that turn. Same with accents, whispering, yelling, sounding bored, excited, sarcastic, etc. It’s not just emotion labels, it’s more like full “acting direction,” and it works shockingly well.

Another big reason I’ve been using it is speed. It usually returns first audio in about 600–800ms, which is fast enough for real-time use.

The problem is that the newer OpenAI TTS model that’s supposed to replace it doesn’t behave the same way. In my testing it mostly ignores those kinds of instructions and delivers everything in a very flat, monotone style. So I’ve been trying to find something else that can match that combination of flexibility and speed, ideally something I can run locally on an RTX 5080, but I’m open to cloud options if they’re fast enough.

So far, I haven’t had much luck. Google Gemini 3.1 TTS is actually very solid in terms of following instructions and doing expressive delivery, but it’s far too slow for my use case, around 5 seconds to first audio.

FasterQwen TTS is extremely fast (under 500ms for me), but it doesn’t really follow instructions well. You can kind of work around it by generating voices with pre-baked emotional tones and then cloning from those, but that’s nowhere near as flexible as just telling the model what to do on the fly. I also haven’t been able to get it to produce accents at all.

Fish Audio S2 looked promising from a quality standpoint, but I couldn’t get it to run on my hardware at any reasonable speed.

ElevenLabs I haven’t explored deeply yet, mostly because the pricing looked high for the amount of experimentation I’m doing.

Cartesia, Resemble, and a few others I tried didn’t seem to have the same level of control either, especially for things like accents or more specific vocal qualities like a raspy or strained voice.

At this point it feels like most systems force a tradeoff between having a consistent voice and having strong “acting” or prosody control. The older OpenAI model was the only one I’ve used that really handled both at the same time.

Is there anything out there right now, local or cloud, that can match that combination of low latency and strong instruction-following for delivery? Or is this still kind of an unsolved problem outside of that specific OpenAI model?

Open Reddit thread
View more discussions →
FAQ

Common questions about GPT-4o-mini TTS

What is the input limit for GPT-4o-mini TTS?

The model supports a context window of 2,000 characters, which represents the maximum amount of text that can be submitted in a single request.

How is GPT-4o-mini TTS accessed?

The model is available through the OpenAI API using the model ID gpt-4o-mini-tts. Developers can integrate it into applications by making API calls with a text prompt and optional configuration parameters.

What input types does GPT-4o-mini TTS accept?

The model accepts a text prompt, select inputs for options such as voice choice, and a numeric input for parameters like speed, based on the available input types listed in the metadata.

Does GPT-4o-mini TTS have a training data cutoff date?

A specific training date is not provided for this model. As a text-to-speech model, it does not rely on a knowledge cutoff in the same way that language models do, since it converts text to audio rather than generating factual content.

What types of applications is GPT-4o-mini TTS suited for?

The model is designed for use cases that require programmatic speech generation, including voice-enabled applications, accessibility tools, content narration systems, and other products that need scalable audio output from text input.

More models from OpenAI

Continue browsing adjacent models from the same provider.

← All AI Models