Mistral

Mistral Nemo

Mistral NeMo is a text generation model developed by Mistral, a French AI company. It features a 128,000-token context window and is trained with function calling support, making it suitable for agentic and tool-use workflows. The model has particular strength across eleven languages: English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. Mistral NeMo is a 12-billion parameter model built in collaboration with NVIDIA, which is reflected in the "NeMo" name referencing NVIDIA's NeMo framework. It is designed for developers and organizations building multilingual applications where broad language coverage and a large context window are priorities. The model's combination of function calling capability, multilingual training, and long-context handling makes it a practical choice for global deployment scenarios.

Jul 19, 2024 128,000 context 64,000 tokens output
Long Context Window Multilingual Generation Function Calling Code Generation Reasoning & World Knowledge Structured Output

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

Mistral

Model ID

The routed model identifier exposed by upstream providers.

mistralai/mistral-nemo

Input Context Window

The number of tokens supported by the input context window.

128,000 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

64,000 tokens tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Jul 19, 2024 1 year ago

Knowledge Cut-off Date

When the model's knowledge was last updated.

2024-04-30

API Providers

The providers that offer this model. This is not an exhaustive list.

DekaLLM, DeepInfra, Novita, Mistral

Modalities

Types of data this model can process.

Text

What is Mistral Nemo

A fuller summary of positioning, capabilities, and source-specific details for Mistral Nemo.

Mistral NeMo is a text generation model developed by Mistral, a French AI company. It features a 128,000-token context window and is trained with function calling support, making it suitable for agentic and tool-use workflows. The model has particular strength across eleven languages: English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.

Mistral NeMo is a 12-billion parameter model built in collaboration with NVIDIA, which is reflected in the "NeMo" name referencing NVIDIA's NeMo framework. It is designed for developers and organizations building multilingual applications where broad language coverage and a large context window are priorities. The model's combination of function calling capability, multilingual training, and long-context handling makes it a practical choice for global deployment scenarios.

Capabilities

What Mistral Nemo supports

CTX

Long Context Window

Processes up to 128,000 tokens in a single request, enabling analysis of long documents, codebases, or extended conversations without truncation.

AI

Multilingual Generation

Generates and understands text in eleven languages including English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.

AI

Function Calling

Supports structured function calling, allowing the model to invoke external tools and APIs as part of agentic or automated workflows.

</>

Code Generation

Produces and reasons about code across common programming languages, with coding accuracy noted as a focus area for its parameter size.

RN

Reasoning & World Knowledge

Applies multi-step reasoning and broad factual knowledge to answer questions, summarize content, and solve problems in text form.

JSON

Structured Output

Can return responses in structured formats such as JSON, useful for downstream data processing and integration tasks.

Pricing for Mistral Nemo

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

maxTemperature 1
maxResponseSize 64,000 tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

DekaLLM DeepInfra Novita Mistral

Provider Endpoints

Endpoint-level provider data currently available for this model.

DekaLLM

1d uptime: 97.8% Supported params: 8 Implicit caching: No

DeepInfra

Max output: 16,384 1d uptime: 99.7% Supported params: 12 Implicit caching: No

Novita

Max output: 16,000 1d uptime: 78.1% Supported params: 11 Implicit caching: No

Mistral

1d uptime: 99.8% Supported params: 11 Implicit caching: No

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about Mistral Nemo

Mistral Nemo discussions are most active in r/LocalLLaMA, r/SillyTavernAI, r/MistralAI.

Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 894 upvotes and 2088 comments.

r/LocalLLaMA 363 upvotes 113 comments August 15, 2024
Mistral Nemo appreciation post

We all complained for months that there were no new models in the ~13B size range, after all the good Llama-2-13B finetunes came out.

Just wanna say thank you to those genius french fucks over at Mistral for Nemo. 12B parameters and 128k context is a very useful combination. It’s enough of a size improvement over 7B to feel a little more “solid” when talking to it, and it runs circles around Llama-2-13B, with 32x the context length.

Thank you mistral!

Open Reddit thread
r/LocalLLaMA 23 upvotes 30 comments March 11, 2026
Mistral NEMO upscale, but kinda weird

**March, 2026**. I wanted to **upscale**, I wanted to **prune**. So why not have both? And why's the fish fat anyway? And is this even coherent at this point?

It's coherent, follows instructions, knows new stuff, and new languages.

# The model is available here:

[https://huggingface.co/SicariusSicariiStuff/Fat\_Fish](https://huggingface.co/SicariusSicariiStuff/Fat_Fish)

It started as a normal Mistral **Nemo**, then it ate about **3B tokens**, and absolutely unhinged modifications were made to it, making it thiccer at all the right(?) places.

Basically, this is a highly experimental **proper upscale** of [mistralai/Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407).

About 1,000$ went into this little project, not that bad of an investment for a worthwhile upscale experiment done to a Mistral-based model.

**IMPORTANT:** This is an intermediate step of what I have in mind; this model, while (surprisingly) coherent, needs more work. I decided to release it publicly 'as is' in its current form, because multiple people expressed enthusiasm in wanting to tune it (based unhinged curiosity, to be honest).

# But WHY?!

Because I think that:

1. Mistral Nemo is excellent
2. We likely won't get many more dense models, because MOE master race

Both points hold more gravitas than people realize. While Mistral released newer versions of dense models at a similar size (14B, for example), their old Nemo, in many people's opinion, was generally better. How do I know? Simple, look how many tunes (post 2025, and even 2026) Nemo got, vs the newer bases. Also, the benchmarks suggest that the old Nemo knows more stuff and is very tuning-friendly.

For the second point, while 'here and there' the open source community gets a new dense base, they are few and far between, since the meteoric rise of (mostly giant) moes.

Basically, I went "If I can't get a new base model, I'll make one myself", sort of.

# "Proper" upscale AND a prune

Why do I say "proper"? Aren't there countless upscales of various models in the wild? Not really. Most of the "upscales" are just **stack merges** made with mergekit, and often **down\_proj** is zeroed out, because slapping duplicated layers in random segments usually makes the model output ascii chars and some random words. **No layers were zeroed out during the feeding of this fish**.

This is **both an upscale AND a prune**, truly naughty stuff was made to the beloved little Nemo.

Here are the main architecture changes I made:

|Parameter|Base Nemo|Fat\_Fish|
|:-|:-|:-|
|Hidden Size|5120|5120|
|Intermediate Size|14336|**12608**|
|Layers|32|**56**|
|Attention Heads|32|**48**|
|Key/Value Heads|8|**12 (because why not)**|

* **Why 12 KV heads instead of 16?** While I know **12 isn’t a neat divisor**, I wanted to see how it behaves in practice. Theoretically, increasing KV heads should improve **context representation and attention fidelity**, but jumping all the way to **16 would introduce a noticeably larger memory and compute overhead** during both training and inference. I experimented with **12 as a middle ground**, and it ended up working surprisingly well — stable during tuning, no issues during inference, and it also behaved nicely under **quantization**. So despite being a slightly “awkward” number architecturally, in practice it turned out to be a **very workable compromise between efficiency and capacity**.

# Suggestions on how to use it

This model is **NOT** made for human consumption 'as is', but rather as a base to build upon. You don't just eat raw dough now, do you? (actually, I'm sure that somewhere someone is 🥟👨‍🍳)

While noise was injected into various places to encourage the model and duplicated tensors in specific places to be noisy enough, so they can learn new stuff, surprisingly, after the massive CPT, some of them began to converge to nearly the same patterns. Hence, I recommend:

* Running layer similarity analysis
* Target the layers with the most similarity for full finetuning while keeping the rest frozen

# What new data was added

|Data Source / Type|Percentage|Notes|
|:-|:-|:-|
|Fandom / Lore Knowledge|**20%**|Heavy emphasis on *Morrowind*, *Fallout*, and *Kenshi* Knowledge and lore|
|Human Written Content|**50%**|General internet writing, essays, blogs, discussions, and natural dialogue|
|Synthetic Instruct Data|**4%**|Instruction-style prompts|
|Hebrew Text Corpus|**16%**|Modern Hebrew web text, forums, documentation, and conversational data|
|Other Mixed Sources|**10%**|Miscellaneous datasets and balancing material|

# SAFETY

* Not very safe. Neither are knives; it's a dangerous world out there.

For the paper lovers, here's some more reading material about the subject:

* [Compact Language Models via Pruning and Knowledge Distillation](https://arxiv.org/abs/2407.14679)
* [LLM Pruning and Distillation in Practice: The Minitron Approach](https://arxiv.org/abs/2408.11796)

Open Reddit thread
r/LocalLLaMA 142 upvotes 83 comments July 26, 2024
Mistral Nemo 12B Instruct is a killer for eRP - Storytelling

I’ve been playing with Nemo for a few days now, and it blows me away at how coherent it is. It’s slightly ‘less creative and more repetitive’ than Llama 3 8B fine-tunes… But it feels ‘more coherent and has better instruction capabilities’.

If Nemo Instruct is really good on its own, I can only imagine what fine-tunes will come out of it.

P.S. There’s also an upscaled version of Nemo, 21B.
I’ve been mainly using the 21B version at 6_K @ 16K context on my 4090.

I don’t know of there’s a difference yet between 12B and 21B… 🤔
I have to experiment with both of them a bit more.

But 21B Nemo is very impressive.

———

/u/TheLocalDrummer you should give Nemo Instruct a look into.
We would all love to see “Moistal Nemolicious”

___

Update: Here is the 21B version of Nemo.
https://huggingface.co/TheSkullery/NeMoria-21b

Open Reddit thread
r/LocalLLaMA 194 upvotes 63 comments July 19, 2024
Mistral NeMo 60% less VRAM fits in 12GB + 4bit BnB + 3 bug / issues

Hey r/LocalLLaMA! Sorry took a bit longer than usual since I found **3 issues / bugs** in Mistral NeMo which made finetuning / inference runs break - should be all fixed in [Unsloth](https://github.com/unslothai/unsloth) [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth), and I collabed with the wonderful Hugging Face on 1 issue, and waiting to get more clarity from the Mistral on another!

Anyways finetuning Mistral NeMo 12b fits in **12GB of VRAM is 2x faster and uses 60% less VRAM**, with no accuracy degradation and works for free in a Google Colab, which you can try in this [notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing). I also have a [Kaggle notebook](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-nemo-12b-unsloth-notebook) which provides 30 hours for free per week of GPUs!

I uploaded 4bit bitandbytes quants for finetuning and inference as well to [https://huggingface.co/unsloth/Mistral-Nemo-Base-2407-bnb-4bit](https://huggingface.co/unsloth/Mistral-Nemo-Base-2407-bnb-4bit) for the base model and [https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit](https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit) for the instruct model.

**3 issues / bugs I found during implementing Mistral NeMo:**

https://preview.redd.it/4qmk7i099idd1.png?width=2054&format=png&auto=webp&s=56510e0e26a32fa2d1ab8b657c58593d10c9c638

1. <**/s> EOS token is untrained** in the base model but trained in instruct - confirming with Mistral if this is a feature or a bug - could make finetunes break with NaNs and infinities. Mistral 7b does not have this issue.
2. **EOS token is auto appended**. This can break finetuning and inference - collabed with HF to fix this quickly :)
3. **Not 5120 for Wq but 4096** - HF transformers main branch already has a fix for this - please update transformers! Unsloth auto patches, so no need to update!
4. More details in our blog: [https://unsloth.ai/blog/mistral-nemo](https://unsloth.ai/blog/mistral-nemo)

Also just made new documentation for Unsloth as well! [https://docs.unsloth.ai/](https://docs.unsloth.ai/) If you don't know what Unsloth is, it's a free open source package to make finetuning LLMs like Llama-3, Phi-3, Gemma-2 and now Mistral NeMO 2x faster, use 70% less memory with no degradation in accuracy. We use OpenAI's Triton language to write all kernels, derive backprop steps and reduce FLOPs by some maths tricks!

* We also now support RoPE scaling in CodeGemma, Gemma, Gemma-2, Qwen as well!
* And added training on completions / outputs!

To update Unsloth in a local machine (or install it), please use (no need for Colab / Kaggle)

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

More details in a [Github release](https://github.com/unslothai/unsloth/releases/tag/July-Mistral-2024) and try out the free finetuning Colab notebook for Mistral NeMo 12b: [https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing) Thanks!

Open Reddit thread
View more discussions →
FAQ

Common questions about Mistral Nemo

What is the context window size for Mistral NeMo?

Mistral NeMo supports a context window of up to 128,000 tokens, allowing it to process long documents or extended conversations in a single request.

Which languages does Mistral NeMo support?

The model is trained with particular strength in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.

Does Mistral NeMo support function calling?

Yes, Mistral NeMo is trained on function calling, making it suitable for tool-use and agentic application workflows.

Who developed Mistral NeMo?

Mistral NeMo was developed by Mistral in collaboration with NVIDIA. The "NeMo" designation reflects the NVIDIA NeMo framework partnership.

What is the knowledge cutoff date for Mistral NeMo?

The metadata provided does not specify a training cutoff date. For the most accurate information, consult Mistral's official documentation.

More models from Mistral

Continue browsing adjacent models from the same provider.

← All AI Models