Text to Speech
Converts written text input into synthesized spoken audio output. Accepts up to 2,000 characters of text per request.
GPT-4o-mini TTS is a text-to-speech model developed by OpenAI that converts written text into natural-sounding spoken audio. It belongs to the GPT-4o mini family, which is designed to deliver capable output at a smaller computational footprint than full-scale variants. The model is accessible to developers through the OpenAI API and is intended for programmatic speech generation across a range of applications. It accepts a text input of up to 2,000 characters and returns audio output in a synthesized voice. GPT-4o-mini TTS is part of OpenAI's broader suite of audio models, which also includes transcription and speech-to-speech capabilities. Its focus is specifically on the text-to-speech task, producing clear and expressive spoken output from plain text. The model is well-suited for voice-enabled applications, accessibility tools, content narration, and any product that requires reliable, scalable audio generation without requiring a larger model. Developers can configure voice selection and other parameters through the API.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for GPT-4o-mini TTS.
GPT-4o-mini TTS is a text-to-speech model developed by OpenAI that converts written text into natural-sounding spoken audio. It belongs to the GPT-4o mini family, which is designed to deliver capable output at a smaller computational footprint than full-scale variants. The model is accessible to developers through the OpenAI API and is intended for programmatic speech generation across a range of applications. It accepts a text input of up to 2,000 characters and returns audio output in a synthesized voice.
GPT-4o-mini TTS is part of OpenAI's broader suite of audio models, which also includes transcription and speech-to-speech capabilities. Its focus is specifically on the text-to-speech task, producing clear and expressive spoken output from plain text. The model is well-suited for voice-enabled applications, accessibility tools, content narration, and any product that requires reliable, scalable audio generation without requiring a larger model. Developers can configure voice selection and other parameters through the API.
Converts written text input into synthesized spoken audio output. Accepts up to 2,000 characters of text per request.
Allows developers to choose from available voice options via a select input parameter in the API request.
Produces natural-sounding speech with expressive qualities, suitable for narration and voice-enabled interfaces.
Available through the OpenAI API, enabling developers to integrate speech generation directly into applications via standard API calls.
Supports numeric and select input types, allowing developers to adjust settings such as speed or format alongside voice and prompt inputs.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
Voice to use in TTS
Control the voice of your generated audio with additional instructions.
The speed of the generated audio. (Default is 1.0)
Parameters currently listed by OpenRouter or the local catalog for this model.
Official model cards, release notes, docs, and other references synced from the source page.
GPT-4o-mini TTS discussions are most active in r/TextToSpeech, r/OpenAI, r/n8n. Top Reddit threads cluster around benchmark and model-comparison threads, coding workflow discussions.
The strongest match in this snapshot has 153 upvotes and 6 comments.
The OpenAI `gpt-4o-mini-tts` model is now supported in [HyperTTS](https://www.vocab.ai/hypertts). This is a feature contributed by Claus (thank you!). You have to select it in the voice options (the default is still `tts-1-hd` for OpenAI). This voice model accepts an optional `instruction` field. You can use it to instruct the model to speak in a certain way, or indicate that the source text is in a particular language. For more details, you can consult the [OpenAi reference](https://platform.openai.com/docs/guides/text-to-speech#text-to-speech-models). As with neural and LLM models, the actual output will really vary with the situation, so you'll have to experiment. Claus' feedback is: *GPT-4o mini TTS is the first OpenAI TTS model that provides usable output for my Greek flashcards.* The common feedback with OpenAI (and also ElevenLabs) is that non-english output was not that good and suffered from an american accent. Hopefully this new model improves that.
https://preview.redd.it/qdknt6tjdh9f1.png?width=2176&format=png&auto=webp&s=b10eb1eaebbe698d7b2f15658b7d59508374abfa
So what's next ? People have been asking for Google Gemini TTS. I've been working on this for [HyperTTS](https://www.vocab.ai/hypertts) but there's a serious limitation, google limits requests to 10 per minute, even on Tier 1 paid accounts. This means mass-generating Gemini audio will be a tedious process, and HyperTTS will need to implement some retry logic, which will be welcome anyway, to handle the occasional timeout. Besides that here are the [issues](https://github.com/Vocab-Apps/anki-hyper-tts/issues?q=is%3Aissue%20state%3Aopen%20label%3Apriority) that I will tackle in the coming weeks in HyperTTS.
Besides that, in the coming months, I'd like to make progress on an idea I started before Christmas: generating long-play audio files with Anki flashcard sounds, so that you can review your deck while walking. Kind of like a podcast. I have a working prototype but I need to finish it. I'm also actively thinking about how [Language Tools](https://www.vocab.ai/language-tools) can use LLMs. This is honestly an overdue feature given how AI chatbots have become amazing at translation, but also transliteration (for Chinese, you can confidently ask gpt-4 to convert to Pinyin).
[https://www.vocab.ai/updates](https://www.vocab.ai/updates)
I'm really confused as lots of posts say it's per 1M characters but the docs say 1M tokens — which would be an incredibly competitive rate, almost 8x cheaper than elevenlabs.
Boson AI has recently open-sourced the Higgs Audio V2 model.
[https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base](https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base)
The model demonstrates strong performance in automatic prosody adjustment and generating natural multi-speaker dialogues across languages .
Notably, it achieved a 75.7% win rate over GPT-4o-mini-tts in emotional expression on the EmergentTTS-Eval benchmark . The total parameter count for this model is approximately 5.8 billion (3.6B for the LLM and 2.2B for the Audio Dual FFN)
Over the next three to six months, we will deprecate support for the older OpenAI model set to improve reliability and simplify model selection. Newer alternatives will be recommended for each model.
Access to these models in the API will be closed on the following dates:
End of Service July 23, 2026
preview-computer-use-2025-03-11
gpt-4o-audio-preview-2024-12-17
gpt-4o-mini-audio-preview-2024-12-17
gpt-4o-mini-realtime-preview-2024-12-17
gpt-4o-mini-search-preview-2025-03-11
gpt-4o-mini-tts-2025-03-20
gpt-4o-search-preview-2025-03 -11
gpt-5-chat-latest
gpt-5-codex
gpt-5.1-chat-latest
gpt-5.1-codex
gpt-5.1-codex-max
gpt-5.2-codex
gpt-5.1-codex-mini
gpt-audio-mini-2025-10-06
gpt-realtime-mini-2025-10-06
o3-deep-research-2025-06-26
o4-mini-deep-research-2025-06-26
Termination of service on October 23 2026
text-insert-3-small
gpt-3.5-turbo-0125
gpt-4-0613
gpt-4-1106-preview
gpt-4-turbo
gpt-4.1-nano
gpt-4o-2024-05-13
gpt-image-1
o1-2024-12-17
o1-pro-2025-03-19
o3-mini-2025-01-31o4-mini-2025-04-16
Improved versions of models that are outdated:
GPT-3.5 Turbo Tuned Versions
Gpt-4.1-nano-2025-04-1
Babbage-002 Tuned Versions
Davinci-002 Tuned Versions
o4-mini-2025-04-16
Tuned GPT-4 Version
We encourage you to explore our latest flagship models, which we present here, which offer improved performance and broader feature support.
Thank you for building with OpenAI. If you have any questions about model migration, please feel free to reach out to the OpenAI Developer Forum.
Hello, I am having the darndest time with this.
Recently OpenAI announced that they are retiring `gpt-4o-mini-tts-2025-03-20`, which in my experience is the most flexible TTS model I’ve ever used. My main concern is with voice acting and adherence to directions. That model lets you prompt *how* something should be spoken, and it actually follows those instructions extremely well.
I have a chatbot setup where the LLM generates both the text response and a short set of TTS instructions describing how it should sound. So if the reply is angry, it will also instruct the TTS to speak in an angry tone. If I ask it to do a character impression (like a Batman-style voice), it will describe a low, gravelly, menacing delivery and the TTS actually performs it that way for that turn. Same with accents, whispering, yelling, sounding bored, excited, sarcastic, etc. It’s not just emotion labels, it’s more like full “acting direction,” and it works shockingly well.
Another big reason I’ve been using it is speed. It usually returns first audio in about 600–800ms, which is fast enough for real-time use.
The problem is that the newer OpenAI TTS model that’s supposed to replace it doesn’t behave the same way. In my testing it mostly ignores those kinds of instructions and delivers everything in a very flat, monotone style. So I’ve been trying to find something else that can match that combination of flexibility and speed, ideally something I can run locally on an RTX 5080, but I’m open to cloud options if they’re fast enough.
So far, I haven’t had much luck. Google Gemini 3.1 TTS is actually very solid in terms of following instructions and doing expressive delivery, but it’s far too slow for my use case, around 5 seconds to first audio.
FasterQwen TTS is extremely fast (under 500ms for me), but it doesn’t really follow instructions well. You can kind of work around it by generating voices with pre-baked emotional tones and then cloning from those, but that’s nowhere near as flexible as just telling the model what to do on the fly. I also haven’t been able to get it to produce accents at all.
Fish Audio S2 looked promising from a quality standpoint, but I couldn’t get it to run on my hardware at any reasonable speed.
ElevenLabs I haven’t explored deeply yet, mostly because the pricing looked high for the amount of experimentation I’m doing.
Cartesia, Resemble, and a few others I tried didn’t seem to have the same level of control either, especially for things like accents or more specific vocal qualities like a raspy or strained voice.
At this point it feels like most systems force a tradeoff between having a consistent voice and having strong “acting” or prosody control. The older OpenAI model was the only one I’ve used that really handled both at the same time.
Is there anything out there right now, local or cloud, that can match that combination of low latency and strong instruction-following for delivery? Or is this still kind of an unsolved problem outside of that specific OpenAI model?
The model supports a context window of 2,000 characters, which represents the maximum amount of text that can be submitted in a single request.
The model is available through the OpenAI API using the model ID gpt-4o-mini-tts. Developers can integrate it into applications by making API calls with a text prompt and optional configuration parameters.
The model accepts a text prompt, select inputs for options such as voice choice, and a numeric input for parameters like speed, based on the available input types listed in the metadata.
A specific training date is not provided for this model. As a text-to-speech model, it does not rely on a knowledge cutoff in the same way that language models do, since it converts text to audio rather than generating factual content.
The model is designed for use cases that require programmatic speech generation, including voice-enabled applications, accessibility tools, content narration systems, and other products that need scalable audio output from text input.
Continue browsing adjacent models from the same provider.