Low-Latency Speech
Generates audio from text with minimal delay, making it suitable for near real-time voice applications like interactive assistants.
TTS (tts-1) is OpenAI's text-to-speech model designed for speed and responsiveness. It converts written text into natural-sounding audio and is optimized to minimize the delay between text input and audio output. The model supports a 4096-token context window and is accessible through the OpenAI API, making it straightforward to integrate into existing applications and workflows. TTS is well-suited for use cases where timely audio delivery matters, such as interactive voice assistants, customer service systems, educational tools, and entertainment applications. OpenAI also offers a sibling model, tts-1-hd, which prioritizes audio fidelity over speed. Developers who need the fastest possible voice response times will find tts-1 the appropriate choice, while those who can tolerate slightly higher latency in exchange for higher audio quality may opt for tts-1-hd.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for TTS.
TTS (tts-1) is OpenAI's text-to-speech model designed for speed and responsiveness. It converts written text into natural-sounding audio and is optimized to minimize the delay between text input and audio output. The model supports a 4096-token context window and is accessible through the OpenAI API, making it straightforward to integrate into existing applications and workflows.
TTS is well-suited for use cases where timely audio delivery matters, such as interactive voice assistants, customer service systems, educational tools, and entertainment applications. OpenAI also offers a sibling model, tts-1-hd, which prioritizes audio fidelity over speed. Developers who need the fastest possible voice response times will find tts-1 the appropriate choice, while those who can tolerate slightly higher latency in exchange for higher audio quality may opt for tts-1-hd.
Generates audio from text with minimal delay, making it suitable for near real-time voice applications like interactive assistants.
Produces fluid, human-like speech from written text across a range of supported voices including alloy, echo, fable, onyx, nova, and shimmer.
Outputs audio in several formats including MP3, Opus, AAC, and FLAC, allowing developers to choose the format that fits their delivery requirements.
Accepts plain text input up to 4096 tokens per request and converts it to spoken audio in a single API call.
Available via the OpenAI REST API, enabling scalable voice output that can be embedded into products, pipelines, and third-party platforms.
Supports a configurable speech speed parameter ranging from 0.25x to 4.0x, giving developers control over the pacing of generated audio.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
Voice to use in TTS
Parameters currently listed by OpenRouter or the local catalog for this model.
Official model cards, release notes, docs, and other references synced from the source page.
TTS discussions are most active in r/Grimdank, r/LocalLLaMA, r/Rainbow6. Top Reddit threads cluster around benchmark and model-comparison threads. The strongest match in this snapshot has 19242 upvotes and 442 comments.
I am only doing this for private hobby projects.But I haven’t been up to date with the best TTS? Which one is it?
The ones that can show all types of emotions including grunts, etc, anger, screams, sadness.
Heya guys and gals,
Around a year ago I released and posted about Persona Engine as a fun side project, trying to get the whole ASR -> LLM -> TTS pipeline going fully locally while having a realtime avatar that is lip-synced (think VTuber). I was able to achieve this and was super happy with the result, but the TTS for me was definitely lacking, since I was using Sesame at the time as reference. After that I took a long break.
A week or two ago, I thought to give the project a refresh, and also wanted to see how far we have come with local models, and boy was I pleasantly surprised with Qwen3 TTS. During my initial tests it was lacking, especially the version published by the Qwen team themselves, but after digging around and experimenting a lot I was able to:
1. Make streaming with the model work reliably. The architecture of the model is perfect for this, since the decoder uses a sliding window, which means if you stream the LLM response, that's completely fine and the TTS will keep coherent prosody, pitch, and intonation.
2. Get the model working with llama.cpp, because I am using C# and speed is important, so also quantized it.
3. The model was lacking word-level timings and phonemes which Kokoro (the previous, more robotic sounding TTS) had. So I had to implement CTC word-level alignment to be able to know when certain words are spoken (important for subtitles + getting phonemes to have the lips move correctly).
Once this was all done, I also decided to finetune my own Qwen3-TTS voice. The cloning capabilities are really cool, but very lacking in contextual understanding and struggles with pronouncing. Additionally, the custom trained voices provided by the Qwen team didn't have any female native speakers, and I didn't want to create a new Live2D model.
In the end, the finetune blew me away and will probably continue improving it.
GitHub is here: [https://github.com/fagenorn/handcrafted-persona-engine](https://github.com/fagenorn/handcrafted-persona-engine)
Check it out, have fun, and let me know whatever crazy stuff you decide to do with it.
Hey, Im trying to see how much does synthetic data help with training ASR model. What is the best TTS? Im looking for something that sounds natural and not robotic. It would be really nice if the TTS could mimic english accents (american, british, french etc.). Thanks for the help.
I've tried several ways but not feeling satisfied:
1- chatterbox: too slow
2- Alltalk: never worked
3- system: bad quality
4- Kokoro: currently using but not impressed
\- what TTS way do you recommend?
\- If you mention elevenLab, is the price worth it? i did the calculation and it's 30 min per 5 dollar.
\- Edge, I know it's a privacy nightmare but is it worth it? I use openrouter anyway
\- I heard about Kitten TTS, and GPT-SoVITS v3 but nobody showed tutorial on how to use them on sillytavern
\- should I just wait for open router to give reasonable priced TTS API?
If you have low vram - qwen 3 tts is good
If you need something unique go for - tada 3b but it need 28gb vram
If you want best tts rn + have the commercial use allowed then go for - moss tts 8b its literally the best model out there
Literally voice clone is sooooooo powerful 😍
(Dont go for fish audio its not for commercial use but for fun its veryyyy good)
Edit: i found longcat DiT 3.5b its totally mind boggling. It is even better than MOSS 8b. And best at cloning voices
The model supports a context window of 4096 tokens per request, which corresponds to the maximum amount of text that can be converted to speech in a single API call.
OpenAI prices tts-1 based on the number of characters in the input text. Current pricing details are available on the OpenAI pricing page at platform.openai.com/pricing.
tts-1 supports six built-in voices: alloy, echo, fable, onyx, nova, and shimmer. Each voice has a distinct tone and style, but no custom voice cloning is supported natively through this model.
The model can output audio in MP3, Opus, AAC, and FLAC formats. MP3 is the default format returned by the API.
tts-1 is optimized for low latency and faster audio delivery, while tts-1-hd trades some speed for higher audio quality. Both models share the same voices and input format.
According to the provided metadata, the model's training date is listed as November 2024.
Continue browsing adjacent models from the same provider.