Voice Presets
Provides 17 or more built-in voice options spanning different genders, ages, and speaking styles, selectable via a dropdown input.
MiniMax Speech 2.8 HD is a high-definition text-to-speech model developed by MiniMax, built on an autoregressive Transformer architecture with a Flow-VAE decoder. Instead of using traditional mel-spectrogram vocoders, it models speech in a learned latent space, which produces audio with natural cadence, proper intonation, and emotional depth. The model accepts up to 50,000 tokens of input text and was trained through January 2026. The model offers 17 or more expressive voice presets spanning different genders, ages, and speaking styles, along with support for natural interjections such as laughs, sighs, and gasps embedded directly in text. Users can control emotion, speed, volume, pitch, sample rate, bitrate, channel configuration, and output format. These features make it well suited for audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility applications, and game development.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for Minimax Speech 2.8 HD.
MiniMax Speech 2.8 HD is a high-definition text-to-speech model developed by MiniMax, built on an autoregressive Transformer architecture with a Flow-VAE decoder. Instead of using traditional mel-spectrogram vocoders, it models speech in a learned latent space, which produces audio with natural cadence, proper intonation, and emotional depth. The model accepts up to 50,000 tokens of input text and was trained through January 2026.
The model offers 17 or more expressive voice presets spanning different genders, ages, and speaking styles, along with support for natural interjections such as laughs, sighs, and gasps embedded directly in text. Users can control emotion, speed, volume, pitch, sample rate, bitrate, channel configuration, and output format. These features make it well suited for audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility applications, and game development.
Provides 17 or more built-in voice options spanning different genders, ages, and speaking styles, selectable via a dropdown input.
Allows setting the emotional tone of synthesized speech — such as happy or calm — to match the intended content context.
Supports embedding over 20 human sounds like (laughs), (sighs), and (gasps) directly in input text for lifelike delivery.
Exposes configurable parameters for sample rate, bitrate, channel configuration, and output format through dedicated select inputs.
Accepts numeric inputs to adjust playback speed, volume level, and pitch independently for fine-grained audio tuning.
Supports a custom pronunciation dictionary to handle brand names, acronyms, and specialized terminology with precise phonetic control.
Accepts up to 50,000 tokens of input text in a single request, enabling long-form content like full audiobook chapters.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Places where this model is available, based on the synced detail-page metadata.
The configurable options currently documented for this model.
Voice preset to use for speech synthesis.
Speech speed multiplier.
Volume level.
Pitch adjustment.
Emotional tone of the speech delivery.
Audio sample rate in Hz.
Audio bitrate in bits per second.
Audio channel configuration.
Output audio format.
Boost recognition for a specific language.
Improves number-reading performance in English text (dates, currencies, etc.).
Parameters currently listed by OpenRouter or the local catalog for this model.
Official model cards, release notes, docs, and other references synced from the source page.
Minimax Speech 2.8 HD discussions are most active in r/HiggsfieldAI. The strongest match in this snapshot has 1 upvotes and 0 comments.
The Ultimate AI Text-to-Speech, Voice Swap, and Video Translation Tool: Higgsfield Audio
Imagine this: you have spent hours crafting the perfect AI visual. The lighting is right, the motion is smooth, the style is exactly how you envisioned it. Everything seems almost too perfect - except the audio sounds off.
Our latest, breakthrough update brings Higgsfield Audio that turns Higgsfield AI into a full-cycle AI content production platform. With three powerful new functions, Voiceover, Change Voice, and Translate, you no longer need to leave the platform to give your content a voice.
🔊 **Create your custom voice:** upload an audio file or record on the spot.
🔊 **Voiceover:** text-to-speech that supports 70+ languages with 21 voice presets as well as your own custom voices. Powered by Eleven v3 and MiniMax Speech 2.8 HD.
🔊 **Change voice:** reshape how any video sounds. 21 voice presets + your custom voices.
🔊 **Translate:** localise your videos into 10 supported world languages:
🇺🇸 English
🇨🇳 Chinese (Mandarin)
🇫🇷 French
🇮🇳 Hindi
🇮🇹 Italian
🇯🇵 Japanese
🇰🇷 Korean
🇧🇷 Portuguese
🇷🇺 Russian
🇹🇷 Turkish
With more coming soon! Any guesses which ones are joining the list? Drop a comment 😗
Check Higgsfield Audio in action 👉 [https://higgsfield.ai/cinema-studio](https://higgsfield.ai/cinema-studio)
The model supports a context window of 50,000 tokens, which allows for long-form content such as full chapters or extended scripts in a single request.
Users can configure sample rate, bitrate, channel (mono or stereo), and output format through dedicated select inputs, giving full control over the final audio file.
Yes. In addition to choosing from 17 or more voice presets, you can adjust speed, volume, pitch, and emotional tone, and embed natural interjections like (laughs) or (sighs) directly in the input text.
The model's training date is listed as January 2026.
The model is designed for use cases that require high-fidelity, human-sounding audio, including audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility tools, and game development.
Continue browsing adjacent models from the same provider.