LLM Model Directory

Explore frontier AI models by provider, pricing, and context

Browse the synced model catalog by provider, release, pricing, and core capabilities.

All Text 132 Vision 13 Image 55 Video 32 Transcription 5 Text to Speech 6 Music 2 Lip Sync 5 3D 5

6 models 4 providers in view Current filter: All providers Type: Text to Speech

OpenAI

3 models

›

Text to Speech

TTS

Release date unavailable

TTS (tts-1) is OpenAI's text-to-speech model designed for speed and responsiveness. It converts written text into natural-sounding audio and is optimized to minimize the delay between text input and audio output. The model supports a 4096-token context window and is accessible through the OpenAI API, making it straightforward to integrate into existing applications and workflows. TTS is well-suited for use cases where timely audio delivery matters, such as interactive voice assistants, customer service systems, educational tools, and entertainment applications. OpenAI also offers a sibling model, tts-1-hd, which prioritizes audio fidelity over speed. Developers who need the fastest possible voice response times will find tts-1 the appropriate choice, while those who can tolerate slightly higher latency in exchange for higher audio quality may opt for tts-1-hd.

Context: N/A Output: N/A

Input: $15.00 Output: N/A

View model →

›

Text to Speech

GPT-4o-mini TTS

Release date unavailable

GPT-4o-mini TTS is a text-to-speech model developed by OpenAI that converts written text into natural-sounding spoken audio. It belongs to the GPT-4o mini family, which is designed to deliver capable output at a smaller computational footprint than full-scale variants. The model is accessible to developers through the OpenAI API and is intended for programmatic speech generation across a range of applications. It accepts a text input of up to 2,000 characters and returns audio output in a synthesized voice. GPT-4o-mini TTS is part of OpenAI's broader suite of audio models, which also includes transcription and speech-to-speech capabilities. Its focus is specifically on the text-to-speech task, producing clear and expressive spoken output from plain text. The model is well-suited for voice-enabled applications, accessibility tools, content narration, and any product that requires reliable, scalable audio generation without requiring a larger model. Developers can configure voice selection and other parameters through the API.

Context: N/A Output: N/A

Input: $0.60 Output: $12.00

View model →

›

Text to Speech

TTS HD

Release date unavailable

TTS HD (model ID: tts-1-hd) is a text-to-speech model developed by OpenAI that converts written text into natural-sounding spoken audio. It accepts a text input of up to 4096 tokens and produces audio output in a variety of supported voices. TTS-1-HD is the quality-optimized variant in OpenAI's TTS model family, designed to produce higher-fidelity audio compared to the standard TTS-1 offering. The model is well-suited for applications that require clear, natural-sounding voice output, such as voice assistants, audiobook narration, accessibility tools, and content creation workflows. It supports multiple built-in voices and can output audio in formats including MP3, Opus, AAC, and FLAC. Developers access the model through OpenAI's API, and it is available on MindStudio without requiring separate API key management.

Context: N/A Output: N/A

Input: $30.00 Output: N/A

View model →

Google

1 models

›

Text to Speech

Gemini 3.1 Flash TTS

Release date unavailable

The Gemini 3.1 Flash TTS Preview model provides powerful, low-latency speech generation with natural outputs, steerable prompts, and new expressive audio tags for precise narration control.

Context: N/A Output: 16,384 tokens

Input: $1.00 Output: N/A

View model →

ElevenLabs

1 models

›

Text to Speech

ElevenLabs TTS

Release date unavailable

ElevenLabs TTS is a text-to-speech platform developed by ElevenLabs that converts written text into natural-sounding audio across 70+ languages. The platform includes multiple speech models — Eleven v3, Eleven Multilingual v2, and Eleven Flash v2.5 — each designed for different use cases, from expressive long-form narration to ultra-low-latency real-time applications. It also supports voice cloning, allowing users to create digital replicas of voices that retain their characteristics across all supported languages. ElevenLabs TTS is well-suited for media companies, audiobook producers, game developers, publishers, and content creators who need scalable multilingual audio output. The platform's conversational AI component supports sub-100ms latency and can integrate with CRMs, payment systems, and telephony platforms, making it applicable for customer-facing voice agent deployments. The context window supports up to 10,000 tokens per request, and the platform accepts voice selection and configuration inputs through its API.

Context: 10,000 Output: N/A

Input: $240.00 Output: N/A

View model →

MiniMax

1 models

›

Text to Speech

Minimax Speech 2.8 HD

Release date unavailable

MiniMax Speech 2.8 HD is a high-definition text-to-speech model developed by MiniMax, built on an autoregressive Transformer architecture with a Flow-VAE decoder. Instead of using traditional mel-spectrogram vocoders, it models speech in a learned latent space, which produces audio with natural cadence, proper intonation, and emotional depth. The model accepts up to 50,000 tokens of input text and was trained through January 2026. The model offers 17 or more expressive voice presets spanning different genders, ages, and speaking styles, along with support for natural interjections such as laughs, sighs, and gasps embedded directly in text. Users can control emotion, speed, volume, pitch, sample rate, bitrate, channel configuration, and output format. These features make it well suited for audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility applications, and game development.

Context: 50K Output: N/A

Input: N/A Output: N/A

View model →