MiniMax

Minimax Speech 2.8 HD

MiniMax Speech 2.8 HD is a high-definition text-to-speech model developed by MiniMax, built on an autoregressive Transformer architecture with a Flow-VAE decoder. Instead of using traditional mel-spectrogram vocoders, it models speech in a learned latent space, which produces audio with natural cadence, proper intonation, and emotional depth. The model accepts up to 50,000 tokens of input text and was trained through January 2026. The model offers 17 or more expressive voice presets spanning different genders, ages, and speaking styles, along with support for natural interjections such as laughs, sighs, and gasps embedded directly in text. Users can control emotion, speed, volume, pitch, sample rate, bitrate, channel configuration, and output format. These features make it well suited for audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility applications, and game development.

January 2026 50K context N/A output
Voice Presets Emotion Control Natural Interjections Audio Format Control Speech Rate & Pitch Custom Pronunciation

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

MiniMax

Input Context Window

The number of tokens supported by the input context window.

50K tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

N/A tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

January 2026

Knowledge Cut-off Date

When the model's knowledge was last updated.

January 2026

API Providers

The providers that offer this model. This is not an exhaustive list.

MiniMax

Modalities

Types of data this model can process.

Text Video Audio Code

What is Minimax Speech 2.8 HD

A fuller summary of positioning, capabilities, and source-specific details for Minimax Speech 2.8 HD.

MiniMax Speech 2.8 HD is a high-definition text-to-speech model developed by MiniMax, built on an autoregressive Transformer architecture with a Flow-VAE decoder. Instead of using traditional mel-spectrogram vocoders, it models speech in a learned latent space, which produces audio with natural cadence, proper intonation, and emotional depth. The model accepts up to 50,000 tokens of input text and was trained through January 2026.

The model offers 17 or more expressive voice presets spanning different genders, ages, and speaking styles, along with support for natural interjections such as laughs, sighs, and gasps embedded directly in text. Users can control emotion, speed, volume, pitch, sample rate, bitrate, channel configuration, and output format. These features make it well suited for audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility applications, and game development.

Capabilities

What Minimax Speech 2.8 HD supports

AI

Voice Presets

Provides 17 or more built-in voice options spanning different genders, ages, and speaking styles, selectable via a dropdown input.

AI

Emotion Control

Allows setting the emotional tone of synthesized speech — such as happy or calm — to match the intended content context.

AI

Natural Interjections

Supports embedding over 20 human sounds like (laughs), (sighs), and (gasps) directly in input text for lifelike delivery.

AUD

Audio Format Control

Exposes configurable parameters for sample rate, bitrate, channel configuration, and output format through dedicated select inputs.

AI

Speech Rate & Pitch

Accepts numeric inputs to adjust playback speed, volume level, and pitch independently for fine-grained audio tuning.

AI

Custom Pronunciation

Supports a custom pronunciation dictionary to handle brand names, acronyms, and specialized terminology with precise phonetic control.

AI

Large Text Input

Accepts up to 50,000 tokens of input text in a single request, enabling long-form content like full audiobook chapters.

Pricing for Minimax Speech 2.8 HD

Primary API pricing shown in the same “quick compare” spirit as the reference page.

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

MiniMax

Configuration & Parameters

The configurable options currently documented for this model.

Voice

Select

Voice preset to use for speech synthesis.

Default: Friendly_Person
Wise Woman Friendly Person Inspirational Girl Deep Voice Man Calm Woman Casual Guy Lively Girl Patient Man Young Knight Determined Man Lovely Girl Decent Boy Imposing Manner Elegant Man Abbess Sweet Girl 2 Exuberant Girl

Speed

Number

Speech speed multiplier.

Default: 1

Volume

Number

Volume level.

Default: 1

Pitch

Number

Pitch adjustment.

Emotion

Select

Emotional tone of the speech delivery.

Happy Sad Angry Fearful Disgusted Surprised Neutral

Sample Rate

Select

Audio sample rate in Hz.

Default: 44100
16,000 Hz 24,000 Hz 32,000 Hz 44,100 Hz (default)

Bitrate

Select

Audio bitrate in bits per second.

Default: 128000
32,000 64,000 128,000 (default) 256,000

Channel

Select

Audio channel configuration.

Mono Stereo

Format

Select

Output audio format.

MP3 WAV FLAC OGG PCM

Language Boost

Select

Boost recognition for a specific language.

Auto Afrikaans Arabic Bulgarian Catalan Chinese Chinese (Yue) Croatian Czech Danish Dutch English Filipino Finnish French German Greek Hebrew Hindi Hungarian Indonesian Italian Japanese Korean Malay Norwegian Nynorsk Persian Polish Portuguese Romanian Russian Slovak Slovenian Spanish Swedish Tamil Thai Turkish Ukrainian Vietnamese

English Normalization

Toggle Group

Improves number-reading performance in English text (dates, currencies, etc.).

Supported Request Parameters

Parameters currently listed by OpenRouter or the local catalog for this model.

Voice Speed Volume Pitch Emotion Sample Rate Bitrate Channel Format Language Boost English Normalization

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about Minimax Speech 2.8 HD

Minimax Speech 2.8 HD discussions are most active in r/HiggsfieldAI. The strongest match in this snapshot has 1 upvotes and 0 comments.

The Ultimate AI Text-to-Speech, Voice Swap, and Video Translation Tool: Higgsfield Audio

Imagine this: you have spent hours crafting the perfect AI visual. The lighting is right, the motion is smooth, the style is exactly how you envisioned it. Everything seems almost too perfect - except the audio sounds off.

Our latest, breakthrough update brings Higgsfield Audio that turns Higgsfield AI into a full-cycle AI content production platform. With three powerful new functions, Voiceover, Change Voice, and Translate, you no longer need to leave the platform to give your content a voice.

🔊 **Create your custom voice:** upload an audio file or record on the spot.
🔊 **Voiceover:** text-to-speech that supports 70+ languages with 21 voice presets as well as your own custom voices. Powered by Eleven v3 and MiniMax Speech 2.8 HD.
🔊 **Change voice:** reshape how any video sounds. 21 voice presets + your custom voices.
🔊 **Translate:** localise your videos into 10 supported world languages:

🇺🇸 English
🇨🇳 Chinese (Mandarin)
🇫🇷 French
🇮🇳 Hindi
🇮🇹 Italian
🇯🇵 Japanese
🇰🇷 Korean
🇧🇷 Portuguese
🇷🇺 Russian
🇹🇷 Turkish

With more coming soon! Any guesses which ones are joining the list? Drop a comment 😗

Check Higgsfield Audio in action 👉 [https://higgsfield.ai/cinema-studio](https://higgsfield.ai/cinema-studio)

Open Reddit thread
View more discussions →
FAQ

Common questions about Minimax Speech 2.8 HD

What is the maximum input length for MiniMax Speech 2.8 HD?

The model supports a context window of 50,000 tokens, which allows for long-form content such as full chapters or extended scripts in a single request.

What audio output formats and quality settings are available?

Users can configure sample rate, bitrate, channel (mono or stereo), and output format through dedicated select inputs, giving full control over the final audio file.

Can I control how the voice sounds beyond just selecting a preset?

Yes. In addition to choosing from 17 or more voice presets, you can adjust speed, volume, pitch, and emotional tone, and embed natural interjections like (laughs) or (sighs) directly in the input text.

What is the training data cutoff for this model?

The model's training date is listed as January 2026.

What types of applications is MiniMax Speech 2.8 HD best suited for?

The model is designed for use cases that require high-fidelity, human-sounding audio, including audiobook production, video voiceovers, podcast creation, e-learning narration, accessibility tools, and game development.

More models from MiniMax

Continue browsing adjacent models from the same provider.

← All AI Models