Text to Speech
Converts text into emotionally expressive audio across 70+ languages, with support for multi-speaker dialogue and long-form content up to 10,000 tokens.
ElevenLabs TTS is a text-to-speech platform developed by ElevenLabs that converts written text into natural-sounding audio across 70+ languages. The platform includes multiple speech models — Eleven v3, Eleven Multilingual v2, and Eleven Flash v2.5 — each designed for different use cases, from expressive long-form narration to ultra-low-latency real-time applications. It also supports voice cloning, allowing users to create digital replicas of voices that retain their characteristics across all supported languages. ElevenLabs TTS is well-suited for media companies, audiobook producers, game developers, publishers, and content creators who need scalable multilingual audio output. The platform's conversational AI component supports sub-100ms latency and can integrate with CRMs, payment systems, and telephony platforms, making it applicable for customer-facing voice agent deployments. The context window supports up to 10,000 tokens per request, and the platform accepts voice selection and configuration inputs through its API.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for ElevenLabs TTS.
ElevenLabs TTS is a text-to-speech platform developed by ElevenLabs that converts written text into natural-sounding audio across 70+ languages. The platform includes multiple speech models — Eleven v3, Eleven Multilingual v2, and Eleven Flash v2.5 — each designed for different use cases, from expressive long-form narration to ultra-low-latency real-time applications. It also supports voice cloning, allowing users to create digital replicas of voices that retain their characteristics across all supported languages.
ElevenLabs TTS is well-suited for media companies, audiobook producers, game developers, publishers, and content creators who need scalable multilingual audio output. The platform's conversational AI component supports sub-100ms latency and can integrate with CRMs, payment systems, and telephony platforms, making it applicable for customer-facing voice agent deployments. The context window supports up to 10,000 tokens per request, and the platform accepts voice selection and configuration inputs through its API.
Converts text into emotionally expressive audio across 70+ languages, with support for multi-speaker dialogue and long-form content up to 10,000 tokens.
Creates digital voice replicas from audio samples in both instant and professional-grade modes, preserving voice characteristics across all supported languages.
Generates speech in 70+ languages using a single model, enabling consistent voice identity across different language outputs.
Deploys voice and chat agents with sub-100ms latency, with integration support for CRMs, telephony platforms, and payment systems.
Eleven Flash v2.5 is optimized for low-latency streaming applications, making it suitable for live conversational and interactive use cases.
The Scribe v2 model transcribes audio in 90+ languages with speaker diarization, word-level timestamps, and real-time transcription support.
Generates studio-grade music from natural language prompts with control over genre, style, vocals, and song structure.
Accessible via REST API with voice selection and configuration inputs, supporting programmatic audio generation at scale.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
Parameters currently listed by OpenRouter or the local catalog for this model.
Official model cards, release notes, docs, and other references synced from the source page.
ElevenLabs TTS discussions are most active in r/homeassistant, r/TextToSpeech, r/ElevenLabs. Top Reddit threads cluster around benchmark and model-comparison threads.
The strongest match in this snapshot has 190 upvotes and 42 comments.
Hi everyone, I've been using st for a couple years, and think i've finally reached a point in my RP that i'm pretty pleased with the results (for now lol), and would like to share my setup.
**LLM - Claude Sonnet 4.6 / GLM 4.7 Flash (Openrouter)**
* For the model I use it really depends on how long the RP is (If its super long then my wallet can NOT afford sonnet), if I like the responses a model is giving me, and if it adheres to the image and tts formatting I use. I change my main model A LOT, so I just listed two of my most used ones.
* Also for image captioning I use a separate model, usually just grok4.1-fast.
**IMAGE GEN - ComfyUI + ComfyInject**
* ComfyInject is a plugin that is a GODSEND to those wanting images for every message, consistent image prompting, specific povs based on context, consistent clothes and accessories in images, etc. Totally customizable too, huge shoutout to u/momentobru who originally posted about it here in the subreddit. Github link: [https://github.com/Spadic21/ComfyInject](https://github.com/Spadic21/ComfyInject) . I will say that originally I had issues with the plugin communicating with the comfyui server after a few images, but this on the git page fixed it for me: [https://github.com/Spadic21/ComfyInject/issues/7](https://github.com/Spadic21/ComfyInject/issues/7) .
* I like to use divingIllustriousFlat\_v60VAE.safetensors, because it give a really good anime looking style which imo beats base hassakuxl or illustrious. I Have a 5060ti and it usually takes about 12 seconds to generate an image with 30 steps and (most of the time) 832px x 1216px.
**TTS - Elevenlabs V3**
* I feel like this part is pretty self-explanatory, it's simply just an amazing model, and I went ahead and got the membership so I usually clone the voices of fictional characters (mainly anime characters lol) to use, and it ends up really well.
* A feature I absolutely love is the emotion / sfx generation potential that's included with the V3 model in elevenlabs. When something in brackets "\[\]" is sent to the server to generate audio, it uses some recognition feature to either use the words inside the brackets to change the tone of the sentence afterwards, do almost any sound effect, or add / effect timing and rhythm within the audio generated.
* To utilize this I just add a couple sentences to the prompt explaining how to make use of this, like this: "FOR ALL DIALOGUE, (Text inside quotes), follow the following rules without exception no matter what: Constantly add tags in brackets "\[\]" to enhance the dialogue which is processed through TTS. Tags such as actions "\[falling against wooden floor\]", "\[stuttering\]", and pretty much any sound effect. Tags such as emotions "\[Seducingly\]"," \[Angrily\]", "\[Sad\]". Tags such as pacing / rythym "\[pauses\]", "\[stammers\], "\[rushed\]".Tags such as tone "\[yelling\]", "\[british accent\]", "\[shouts\]", "\[whispers\]". UTILIZE THOSE TAGS TO MAKE AN IMMERSIVE AND REALISTIC TEXT TO SPEECH EXPERIENCE."
Any suggestions or comments are appreciated❤.
Hey all,
I’m Praney, a solo dev. I’m partially dyslexic, so text-to-speech is not just a “nice to have” for me. I use it to read, write, review, and turn long scripts into audio.
I got tired of Elevenlabs TTS tools charging by usage and sending my scripts to someone else’s servers, so I built Vois.so: a local voice AI studio for desktop.
The basic idea is simple:
Write a script → assign voices → generate speech locally → arrange it on a timeline → master/export the final audio.
It started as my personal local ElevenLabs-style alternative, but it has turned into a full production workflow.
What it does:
\- Runs locally on desktop
\- Generates voice audio without uploading scripts to a cloud TTS API
\- Has multiple voice engines for fast, expressive, multilingual, and Omni-style generation
\- Includes a voice library with narrator, host, character, announcer, storyteller, and game-style voices
\- Supports voice cloning from a short sample
\- Lets you build multi-speaker scripts
\- Has a multi-track timeline with crossfades and arrangement tools
\- Includes mastering presets for things like audiobooks, podcasts, YouTube, and general audio
\- Exports finished audio files
The part that may be more relevant to this subreddit:
Vois also has a CLI, so Claude Code, Codex, Cursor, Gemini, etc. can control the app directly.
That means an agent can help with things like:
\- Drafting a podcast script
\- Splitting it into speakers
\- Assigning voices
\- Generating the narration
\- Exporting a finished audio file
\- Building audiobook chapters from longer text
I’m currently using Claude + Vois to build audiobooks and podcasts. Claude helps me structure and edit the scripts, then Vois turns them into finished audio locally.
The animated GIF shows the app in action.
It’s free for personal use to download and use on desktop. I’m not posting pricing here because that’s not really the point of this post.
I’m mainly curious:
If you had a local voice studio that Claude/Codex could control, what would you automate with it?
Audiobooks? Podcast drafts? Game dialogue? Voiceovers for docs/tutorials? Something else?
Full disclosure: I built this myself, so I’m happy to answer questions about the product, the agent workflow, or the local TTS side.
**Fix:** The ACPX plugin was disabled. ACPX provides the `reply_dispatch` hook that routes agent responses through the ACP dispatch path — which is where TTS processing actually happens. Without it, responses go through the non-ACP fallback path and TTS tags get silently ignored.
In `openclaw.json`, add `"acpx"` to `plugins.allow` and add `"acpx": { "enabled": true }` to `plugins.entries`, then restart the gateway.
Also heads up: there's a 10-character minimum on TTS text content, so short phrases like "Hi!" will silently skip synthesis even with the fix in place.
\---
Since the update before the latest (was on 2026.4.8, now on 2026.4.11), I get "No response generated. Please try again." whenever my Telegram OpenClaw bot tries to send a voice message via ElevenLabs TTS.
Gateway logs show only sendMessage ok, no sendVoice ever appears. The TTS config looks correct (messages.tts.provider = elevenlabs, auto = tagged, API key present). Claude Code has been investigating for hours, current theory is the ElevenLabs capability plugin isn't loading correctly in the gateway's runtime context despite appearing to work in isolation.
Anyone hit this after a recent update?
https://reddit.com/link/1t5cjgb/video/hl5biaf21jzg1/player
Hey all,
I’m Praney, a solo dev. I’m partially dyslexic, so text-to-speech is not just a “nice to have” for me. I use it to read, write, review, and turn long scripts into audio.
I got tired of cloud TTS tools charging by usage and sending my scripts to someone else’s servers, so I built Vois: a local voice AI studio for desktop.
The basic idea is simple:
Write a script → assign voices → generate speech locally → arrange it on a timeline → master/export the final audio.
It started as my personal local ElevenLabs-style alternative, but it has turned into a full production workflow.
What it does:
\- Runs locally on desktop
\- Generates voice audio without uploading scripts to a cloud TTS API
\- Has multiple voice engines for fast, expressive, multilingual, and Omni-style generation
\- Includes a voice library with narrator, host, character, announcer, storyteller, and game-style voices
\- Supports voice cloning from a short sample
\- Lets you build multi-speaker scripts
\- Has a multi-track timeline with crossfades and arrangement tools
\- Includes mastering presets for things like audiobooks, podcasts, YouTube, and general audio
\- Exports finished audio files
The part that may be more relevant to this subreddit:
Vois also has a CLI, so Claude Code, Codex, Cursor, Gemini, etc. can control the app directly.
That means an agent can help with things like:
\- Drafting a podcast script
\- Splitting it into speakers
\- Assigning voices
\- Generating the narration
\- Exporting a finished audio file
\- Building audiobook chapters from longer text
I’m currently using Claude/Codex + Vois to build audiobooks and podcasts. Claude or Codex helps me structure and edit the scripts, then Vois turns them into finished audio locally.
The animated GIF shows the app in action.
It’s free for personal use to download and use on desktop. I’m not posting pricing here because that’s not really the point of this post.
If you like to subscribe you can get $90 OFF our yearly sub - use "VOISNFRIENDS90OFF" (Sorry only 50 codes available).
I’m mainly curious:
If you had a local voice studio that Claude/Codex could control, what would you automate with it?
Audiobooks? Podcast drafts? Game dialogue? Voiceovers for docs/tutorials? Something else?
Full disclosure: I built this myself, so I’m happy to answer questions about the product, the agent workflow, or the local TTS side.
My LinkedIn: [https://www.linkedin.com/in/praney-behl-b9129313](https://www.linkedin.com/in/praney-behl-b9129313/)
Website: [vois.so](https://vois.so)
Tired of generic weather data? 🥱 I wanted my smart home to give me actually useful weather insights, inspired by how the Samsung Weather app tells you specific things like "rain stopping in 2 hours" or "snow likely to continue" ❄️
Instead of just showing "it's snowing, -2°C," my setup now:
- 🔍 Scrapes detailed weather insights from Weather.com
- 🗣️ Announces changes through Sonos speakers using Elevenlabs TTS for natural voice
- 📟 Displays current insight on my Awtrix Matrix
- ⏰ Only announces between 6:30 AM - 8:00 PM when motion is detected
- ⏱️ Has a 5-minute cooldown between announcements
"Snow likely for the next several hours" notification + Awtrix display"
Configuration in comments! 👇 Let me know if you'd like me to share the multiscrape config or automation yaml.
ElevenLabs TTS supports a context window of up to 10,000 tokens per request.
The text-to-speech models support 70+ languages. The Scribe v2 speech-to-text model extends transcription support to 90+ languages.
ElevenLabs offers several models including Eleven v3 for expressive storytelling, Eleven Multilingual v2 for broad language coverage, and Eleven Flash v2.5 for ultra-low-latency real-time applications.
API pricing details are available on the ElevenLabs API pricing page at elevenlabs.io/pricing/api.
According to the available metadata, the training date for ElevenLabs TTS is listed as January 2023.
Yes. The platform supports both instant and professional-grade voice cloning, which maintains the cloned voice's characteristics across all supported languages.
Continue browsing adjacent models from the same provider.