OpenAI

TTS

TTS (tts-1) is OpenAI's text-to-speech model designed for speed and responsiveness. It converts written text into natural-sounding audio and is optimized to minimize the delay between text input and audio output. The model supports a 4096-token context window and is accessible through the OpenAI API, making it straightforward to integrate into existing applications and workflows. TTS is well-suited for use cases where timely audio delivery matters, such as interactive voice assistants, customer service systems, educational tools, and entertainment applications. OpenAI also offers a sibling model, tts-1-hd, which prioritizes audio fidelity over speed. Developers who need the fastest possible voice response times will find tts-1 the appropriate choice, while those who can tolerate slightly higher latency in exchange for higher audio quality may opt for tts-1-hd.

November 2024 N/A context N/A output
Low-Latency Speech Natural Voice Output Multiple Audio Formats Text Input Processing API Integration Speed Control

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

OpenAI

Input Context Window

The number of tokens supported by the input context window.

N/A tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

N/A tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

November 2024

Knowledge Cut-off Date

When the model's knowledge was last updated.

November 2024

API Providers

The providers that offer this model. This is not an exhaustive list.

OpenAI API

Modalities

Types of data this model can process.

Text Audio

What is TTS

A fuller summary of positioning, capabilities, and source-specific details for TTS.

TTS (tts-1) is OpenAI's text-to-speech model designed for speed and responsiveness. It converts written text into natural-sounding audio and is optimized to minimize the delay between text input and audio output. The model supports a 4096-token context window and is accessible through the OpenAI API, making it straightforward to integrate into existing applications and workflows.

TTS is well-suited for use cases where timely audio delivery matters, such as interactive voice assistants, customer service systems, educational tools, and entertainment applications. OpenAI also offers a sibling model, tts-1-hd, which prioritizes audio fidelity over speed. Developers who need the fastest possible voice response times will find tts-1 the appropriate choice, while those who can tolerate slightly higher latency in exchange for higher audio quality may opt for tts-1-hd.

Capabilities

What TTS supports

AI

Low-Latency Speech

Generates audio from text with minimal delay, making it suitable for near real-time voice applications like interactive assistants.

AI

Natural Voice Output

Produces fluid, human-like speech from written text across a range of supported voices including alloy, echo, fable, onyx, nova, and shimmer.

AUD

Multiple Audio Formats

Outputs audio in several formats including MP3, Opus, AAC, and FLAC, allowing developers to choose the format that fits their delivery requirements.

AI

Text Input Processing

Accepts plain text input up to 4096 tokens per request and converts it to spoken audio in a single API call.

API

API Integration

Available via the OpenAI REST API, enabling scalable voice output that can be embedded into products, pipelines, and third-party platforms.

AI

Speed Control

Supports a configurable speech speed parameter ranging from 0.25x to 4.0x, giving developers control over the pacing of generated audio.

Pricing for TTS

Primary API pricing shown in the same “quick compare” spirit as the reference page.

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

OpenAI API

Configuration & Parameters

The configurable options currently documented for this model.

Voice

Select

Voice to use in TTS

Default: alloy
Alloy Echo Fable Onyx Nova Shimmer

Supported Request Parameters

Parameters currently listed by OpenRouter or the local catalog for this model.

Voice

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about TTS

TTS discussions are most active in r/Grimdank, r/LocalLLaMA, r/Rainbow6. Top Reddit threads cluster around benchmark and model-comparison threads. The strongest match in this snapshot has 19242 upvotes and 442 comments.

Heya guys and gals,

Around a year ago I released and posted about Persona Engine as a fun side project, trying to get the whole ASR -> LLM -> TTS pipeline going fully locally while having a realtime avatar that is lip-synced (think VTuber). I was able to achieve this and was super happy with the result, but the TTS for me was definitely lacking, since I was using Sesame at the time as reference. After that I took a long break.

A week or two ago, I thought to give the project a refresh, and also wanted to see how far we have come with local models, and boy was I pleasantly surprised with Qwen3 TTS. During my initial tests it was lacking, especially the version published by the Qwen team themselves, but after digging around and experimenting a lot I was able to:

1. Make streaming with the model work reliably. The architecture of the model is perfect for this, since the decoder uses a sliding window, which means if you stream the LLM response, that's completely fine and the TTS will keep coherent prosody, pitch, and intonation.
2. Get the model working with llama.cpp, because I am using C# and speed is important, so also quantized it.
3. The model was lacking word-level timings and phonemes which Kokoro (the previous, more robotic sounding TTS) had. So I had to implement CTC word-level alignment to be able to know when certain words are spoken (important for subtitles + getting phonemes to have the lips move correctly).

Once this was all done, I also decided to finetune my own Qwen3-TTS voice. The cloning capabilities are really cool, but very lacking in contextual understanding and struggles with pronouncing. Additionally, the custom trained voices provided by the Qwen team didn't have any female native speakers, and I didn't want to create a new Live2D model.

In the end, the finetune blew me away and will probably continue improving it.

GitHub is here: [https://github.com/fagenorn/handcrafted-persona-engine](https://github.com/fagenorn/handcrafted-persona-engine)

Check it out, have fun, and let me know whatever crazy stuff you decide to do with it.

Open Reddit thread
r/LocalLLaMA 6 upvotes 12 comments April 7, 2026
Whats the best open source/free TTS

Hey, Im trying to see how much does synthetic data help with training ASR model. What is the best TTS? Im looking for something that sounds natural and not robotic. It would be really nice if the TTS could mimic english accents (american, british, french etc.). Thanks for the help.

Open Reddit thread
r/SillyTavernAI 7 upvotes 34 comments February 28, 2026
What do you use for TTS?

I've tried several ways but not feeling satisfied:

1- chatterbox: too slow

2- Alltalk: never worked

3- system: bad quality

4- Kokoro: currently using but not impressed

\- what TTS way do you recommend?

\- If you mention elevenLab, is the price worth it? i did the calculation and it's 30 min per 5 dollar.

\- Edge, I know it's a privacy nightmare but is it worth it? I use openrouter anyway

\- I heard about Kitten TTS, and GPT-SoVITS v3 but nobody showed tutorial on how to use them on sillytavern

\- should I just wait for open router to give reasonable priced TTS API?

Open Reddit thread

If you have low vram - qwen 3 tts is good

If you need something unique go for - tada 3b but it need 28gb vram

If you want best tts rn + have the commercial use allowed then go for - moss tts 8b its literally the best model out there

Literally voice clone is sooooooo powerful 😍

(Dont go for fish audio its not for commercial use but for fun its veryyyy good)

Edit: i found longcat DiT 3.5b its totally mind boggling. It is even better than MOSS 8b. And best at cloning voices

Open Reddit thread
View more discussions →
FAQ

Common questions about TTS

What is the maximum input length for tts-1?

The model supports a context window of 4096 tokens per request, which corresponds to the maximum amount of text that can be converted to speech in a single API call.

How is tts-1 priced?

OpenAI prices tts-1 based on the number of characters in the input text. Current pricing details are available on the OpenAI pricing page at platform.openai.com/pricing.

What voices are available with tts-1?

tts-1 supports six built-in voices: alloy, echo, fable, onyx, nova, and shimmer. Each voice has a distinct tone and style, but no custom voice cloning is supported natively through this model.

What audio formats does tts-1 output?

The model can output audio in MP3, Opus, AAC, and FLAC formats. MP3 is the default format returned by the API.

What is the difference between tts-1 and tts-1-hd?

tts-1 is optimized for low latency and faster audio delivery, while tts-1-hd trades some speed for higher audio quality. Both models share the same voices and input format.

What is the training data cutoff for tts-1?

According to the provided metadata, the model's training date is listed as November 2024.

More models from OpenAI

Continue browsing adjacent models from the same provider.

← All AI Models