Long Context Window
Processes up to 128,000 tokens in a single request, enabling analysis of long documents, codebases, or extended conversations without truncation.
Mistral NeMo is a text generation model developed by Mistral, a French AI company. It features a 128,000-token context window and is trained with function calling support, making it suitable for agentic and tool-use workflows. The model has particular strength across eleven languages: English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi. Mistral NeMo is a 12-billion parameter model built in collaboration with NVIDIA, which is reflected in the "NeMo" name referencing NVIDIA's NeMo framework. It is designed for developers and organizations building multilingual applications where broad language coverage and a large context window are priorities. The model's combination of function calling capability, multilingual training, and long-context handling makes it a practical choice for global deployment scenarios.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The routed model identifier exposed by upstream providers.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for Mistral Nemo.
Mistral NeMo is a text generation model developed by Mistral, a French AI company. It features a 128,000-token context window and is trained with function calling support, making it suitable for agentic and tool-use workflows. The model has particular strength across eleven languages: English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.
Mistral NeMo is a 12-billion parameter model built in collaboration with NVIDIA, which is reflected in the "NeMo" name referencing NVIDIA's NeMo framework. It is designed for developers and organizations building multilingual applications where broad language coverage and a large context window are priorities. The model's combination of function calling capability, multilingual training, and long-context handling makes it a practical choice for global deployment scenarios.
Processes up to 128,000 tokens in a single request, enabling analysis of long documents, codebases, or extended conversations without truncation.
Generates and understands text in eleven languages including English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.
Supports structured function calling, allowing the model to invoke external tools and APIs as part of agentic or automated workflows.
Produces and reasons about code across common programming languages, with coding accuracy noted as a focus area for its parameter size.
Applies multi-step reasoning and broad factual knowledge to answer questions, summarize content, and solve problems in text form.
Can return responses in structured formats such as JSON, useful for downstream data processing and integration tasks.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Additional usage-cost dimensions synced into the project for this model.
Places where this model is available, based on the synced detail-page metadata.
Endpoint-level provider data currently available for this model.
Official model cards, release notes, docs, and other references synced from the source page.
Mistral Nemo discussions are most active in r/LocalLLaMA, r/SillyTavernAI, r/MistralAI.
Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 894 upvotes and 2088 comments.
We all complained for months that there were no new models in the ~13B size range, after all the good Llama-2-13B finetunes came out.
Just wanna say thank you to those genius french fucks over at Mistral for Nemo. 12B parameters and 128k context is a very useful combination. It’s enough of a size improvement over 7B to feel a little more “solid” when talking to it, and it runs circles around Llama-2-13B, with 32x the context length.
Thank you mistral!
**March, 2026**. I wanted to **upscale**, I wanted to **prune**. So why not have both? And why's the fish fat anyway? And is this even coherent at this point?
It's coherent, follows instructions, knows new stuff, and new languages.
# The model is available here:
[https://huggingface.co/SicariusSicariiStuff/Fat\_Fish](https://huggingface.co/SicariusSicariiStuff/Fat_Fish)
It started as a normal Mistral **Nemo**, then it ate about **3B tokens**, and absolutely unhinged modifications were made to it, making it thiccer at all the right(?) places.
Basically, this is a highly experimental **proper upscale** of [mistralai/Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407).
About 1,000$ went into this little project, not that bad of an investment for a worthwhile upscale experiment done to a Mistral-based model.
**IMPORTANT:** This is an intermediate step of what I have in mind; this model, while (surprisingly) coherent, needs more work. I decided to release it publicly 'as is' in its current form, because multiple people expressed enthusiasm in wanting to tune it (based unhinged curiosity, to be honest).
# But WHY?!
Because I think that:
1. Mistral Nemo is excellent
2. We likely won't get many more dense models, because MOE master race
Both points hold more gravitas than people realize. While Mistral released newer versions of dense models at a similar size (14B, for example), their old Nemo, in many people's opinion, was generally better. How do I know? Simple, look how many tunes (post 2025, and even 2026) Nemo got, vs the newer bases. Also, the benchmarks suggest that the old Nemo knows more stuff and is very tuning-friendly.
For the second point, while 'here and there' the open source community gets a new dense base, they are few and far between, since the meteoric rise of (mostly giant) moes.
Basically, I went "If I can't get a new base model, I'll make one myself", sort of.
# "Proper" upscale AND a prune
Why do I say "proper"? Aren't there countless upscales of various models in the wild? Not really. Most of the "upscales" are just **stack merges** made with mergekit, and often **down\_proj** is zeroed out, because slapping duplicated layers in random segments usually makes the model output ascii chars and some random words. **No layers were zeroed out during the feeding of this fish**.
This is **both an upscale AND a prune**, truly naughty stuff was made to the beloved little Nemo.
Here are the main architecture changes I made:
|Parameter|Base Nemo|Fat\_Fish|
|:-|:-|:-|
|Hidden Size|5120|5120|
|Intermediate Size|14336|**12608**|
|Layers|32|**56**|
|Attention Heads|32|**48**|
|Key/Value Heads|8|**12 (because why not)**|
* **Why 12 KV heads instead of 16?** While I know **12 isn’t a neat divisor**, I wanted to see how it behaves in practice. Theoretically, increasing KV heads should improve **context representation and attention fidelity**, but jumping all the way to **16 would introduce a noticeably larger memory and compute overhead** during both training and inference. I experimented with **12 as a middle ground**, and it ended up working surprisingly well — stable during tuning, no issues during inference, and it also behaved nicely under **quantization**. So despite being a slightly “awkward” number architecturally, in practice it turned out to be a **very workable compromise between efficiency and capacity**.
# Suggestions on how to use it
This model is **NOT** made for human consumption 'as is', but rather as a base to build upon. You don't just eat raw dough now, do you? (actually, I'm sure that somewhere someone is 🥟👨🍳)
While noise was injected into various places to encourage the model and duplicated tensors in specific places to be noisy enough, so they can learn new stuff, surprisingly, after the massive CPT, some of them began to converge to nearly the same patterns. Hence, I recommend:
* Running layer similarity analysis
* Target the layers with the most similarity for full finetuning while keeping the rest frozen
# What new data was added
|Data Source / Type|Percentage|Notes|
|:-|:-|:-|
|Fandom / Lore Knowledge|**20%**|Heavy emphasis on *Morrowind*, *Fallout*, and *Kenshi* Knowledge and lore|
|Human Written Content|**50%**|General internet writing, essays, blogs, discussions, and natural dialogue|
|Synthetic Instruct Data|**4%**|Instruction-style prompts|
|Hebrew Text Corpus|**16%**|Modern Hebrew web text, forums, documentation, and conversational data|
|Other Mixed Sources|**10%**|Miscellaneous datasets and balancing material|
# SAFETY
* Not very safe. Neither are knives; it's a dangerous world out there.
For the paper lovers, here's some more reading material about the subject:
* [Compact Language Models via Pruning and Knowledge Distillation](https://arxiv.org/abs/2407.14679)
* [LLM Pruning and Distillation in Practice: The Minitron Approach](https://arxiv.org/abs/2408.11796)
I’ve been playing with Nemo for a few days now, and it blows me away at how coherent it is. It’s slightly ‘less creative and more repetitive’ than Llama 3 8B fine-tunes… But it feels ‘more coherent and has better instruction capabilities’.
If Nemo Instruct is really good on its own, I can only imagine what fine-tunes will come out of it.
P.S. There’s also an upscaled version of Nemo, 21B.
I’ve been mainly using the 21B version at 6_K @ 16K context on my 4090.
I don’t know of there’s a difference yet between 12B and 21B… 🤔
I have to experiment with both of them a bit more.
But 21B Nemo is very impressive.
———
/u/TheLocalDrummer you should give Nemo Instruct a look into.
We would all love to see “Moistal Nemolicious”
___
Update: Here is the 21B version of Nemo.
https://huggingface.co/TheSkullery/NeMoria-21b
Hey r/LocalLLaMA! Sorry took a bit longer than usual since I found **3 issues / bugs** in Mistral NeMo which made finetuning / inference runs break - should be all fixed in [Unsloth](https://github.com/unslothai/unsloth) [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth), and I collabed with the wonderful Hugging Face on 1 issue, and waiting to get more clarity from the Mistral on another!
Anyways finetuning Mistral NeMo 12b fits in **12GB of VRAM is 2x faster and uses 60% less VRAM**, with no accuracy degradation and works for free in a Google Colab, which you can try in this [notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing). I also have a [Kaggle notebook](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-nemo-12b-unsloth-notebook) which provides 30 hours for free per week of GPUs!
I uploaded 4bit bitandbytes quants for finetuning and inference as well to [https://huggingface.co/unsloth/Mistral-Nemo-Base-2407-bnb-4bit](https://huggingface.co/unsloth/Mistral-Nemo-Base-2407-bnb-4bit) for the base model and [https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit](https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit) for the instruct model.
**3 issues / bugs I found during implementing Mistral NeMo:**
https://preview.redd.it/4qmk7i099idd1.png?width=2054&format=png&auto=webp&s=56510e0e26a32fa2d1ab8b657c58593d10c9c638
1. <**/s> EOS token is untrained** in the base model but trained in instruct - confirming with Mistral if this is a feature or a bug - could make finetunes break with NaNs and infinities. Mistral 7b does not have this issue.
2. **EOS token is auto appended**. This can break finetuning and inference - collabed with HF to fix this quickly :)
3. **Not 5120 for Wq but 4096** - HF transformers main branch already has a fix for this - please update transformers! Unsloth auto patches, so no need to update!
4. More details in our blog: [https://unsloth.ai/blog/mistral-nemo](https://unsloth.ai/blog/mistral-nemo)
Also just made new documentation for Unsloth as well! [https://docs.unsloth.ai/](https://docs.unsloth.ai/) If you don't know what Unsloth is, it's a free open source package to make finetuning LLMs like Llama-3, Phi-3, Gemma-2 and now Mistral NeMO 2x faster, use 70% less memory with no degradation in accuracy. We use OpenAI's Triton language to write all kernels, derive backprop steps and reduce FLOPs by some maths tricks!
* We also now support RoPE scaling in CodeGemma, Gemma, Gemma-2, Qwen as well!
* And added training on completions / outputs!
To update Unsloth in a local machine (or install it), please use (no need for Colab / Kaggle)
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
More details in a [Github release](https://github.com/unslothai/unsloth/releases/tag/July-Mistral-2024) and try out the free finetuning Colab notebook for Mistral NeMo 12b: [https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing) Thanks!
Mistral NeMo supports a context window of up to 128,000 tokens, allowing it to process long documents or extended conversations in a single request.
The model is trained with particular strength in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.
Yes, Mistral NeMo is trained on function calling, making it suitable for tool-use and agentic application workflows.
Mistral NeMo was developed by Mistral in collaboration with NVIDIA. The "NeMo" designation reflects the NVIDIA NeMo framework partnership.
The metadata provided does not specify a training cutoff date. For the most accurate information, consult Mistral's official documentation.
Continue browsing adjacent models from the same provider.