Multimodal Input
Processes both text and image inputs within a single unified model backbone, enabling tasks that combine visual and language understanding.
Llama 4 Scout is a multimodal AI model developed by Meta, released in early 2025 as part of the Llama 4 model family. It uses a Mixture of Experts (MoE) architecture with 17 billion active parameters, 16 experts, and 109 billion total parameters, meaning only a subset of parameters is activated per token during inference. The model processes both text and image inputs within a unified backbone and supports a 130,000-token context window. Llama 4 Scout is designed for developers and enterprises building applications that require combined text and vision understanding. Its MoE design makes it more compute-efficient during training and inference compared to dense models of similar total parameter counts. On MindStudio, it is served via Groq, which provides low-latency inference for the instruct-tuned variant.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for Llama 4 Scout.
Llama 4 Scout is a multimodal AI model developed by Meta, released in early 2025 as part of the Llama 4 model family. It uses a Mixture of Experts (MoE) architecture with 17 billion active parameters, 16 experts, and 109 billion total parameters, meaning only a subset of parameters is activated per token during inference. The model processes both text and image inputs within a unified backbone and supports a 130,000-token context window.
Llama 4 Scout is designed for developers and enterprises building applications that require combined text and vision understanding. Its MoE design makes it more compute-efficient during training and inference compared to dense models of similar total parameter counts. On MindStudio, it is served via Groq, which provides low-latency inference for the instruct-tuned variant.
Processes both text and image inputs within a single unified model backbone, enabling tasks that combine visual and language understanding.
Supports up to 130,000 tokens of context, allowing it to handle long documents, extended conversations, or large code files in a single request.
Uses a 16-expert MoE architecture with 109 billion total parameters, activating only 17 billion per token to reduce compute cost while maintaining output quality.
Fine-tuned as an instruct model, enabling it to follow natural language instructions for tasks like summarization, Q&A, and structured generation.
Served on Groq's LPU infrastructure, which is designed to deliver low-latency token generation for real-time applications.
Capable of generating, explaining, and debugging code across common programming languages as part of its general instruction-following training.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Additional usage-cost dimensions synced into the project for this model.
Places where this model is available, based on the synced detail-page metadata.
Endpoint-level provider data currently available for this model.
Benchmark scores synced from the current model source and normalized into the local catalog.
| Benchmark | Score |
|---|---|
|
AIME 2024
American math olympiad problems
|
|
|
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
|
|
|
HLE
Questions that challenge frontier models across many domains
|
|
|
LiveCodeBench
Real-world coding tasks from recent competitions
|
|
|
MATH-500
Undergraduate and competition-level math problems
|
|
|
MMLU-Pro
Expert knowledge across 14 academic disciplines
|
|
|
SciCode
Scientific research coding and numerical methods
|
Official model cards, release notes, docs, and other references synced from the source page.
Llama 4 Scout discussions are most active in r/LocalLLaMA, r/AIToolsPerformance, r/unsloth.
Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 2185 upvotes and 193 comments.
I just watched Llamacon this morning and did some quick research while reading comments, and it seems like the vast majority of people aren't happy with the new Llama 4 Scout and Maverick models. Can someone explain why? I've finetuned some 3.1 models before, and I was wondering if it's even worth switching to 4. Any thoughts?
Is Qwen 3.5 a direct replacement to Llama 4 in your opinion? Seems too much of a coincidence
Edit: 3.5 Plus and not Max
Llama 4 Scout and Maverick left me really disappointed. It might explain why Joelle Pineau, Meta’s AI research lead, just got fired. Why are these models so underwhelming? My armchair analyst intuition suggests it’s partly the tiny expert size in their mixture-of-experts setup. 17B parameters? Feels small these days.
Meta’s struggle proves that having all the GPUs and Data in the world doesn’t mean much if the ideas aren’t fresh. Companies like DeepSeek, OpenAI etc. show real innovation is what pushes AI forward. You can’t just throw resources at a problem and hope for magic. Guess that’s the tricky part of AI, it’s not just about brute force, but brainpower too.
I just finished my KCORES LLM Arena tests, adding Llama-4-Scout & Llama-4-Maverick to the mix.
My conclusion is that they completely surpassed my expectations... in a negative direction.
Llama-4-Maverick, the 402B parameter model, performs roughly on par with Qwen-QwQ-32B in terms of coding ability. Meanwhile, Llama-4-Scout is comparable to something like Grok-2 or Ernie 4.5...
You can just look at the "20 bouncing balls" test... the results are frankly terrible / abysmal.
Considering Llama-4-Maverick is a massive 402B parameters, why wouldn't I just use DeepSeek-V3-0324? Or even Qwen-QwQ-32B would be preferable – while its performance is similar, it's only 32B.
And as for Llama-4-Scout... well... let's just leave it at that / use it if it makes you happy, I guess... Meta, have you truly given up on the coding domain? Did you really just release vaporware?
Of course, its multimodal and long-context capabilities are currently unknown, as this review focuses solely on coding. I'd advise looking at other reviews or forming your own opinion based on actual usage for those aspects. In summary: I strongly advise against using Llama 4 for coding. Perhaps it might be worth trying for long text translation or multimodal tasks.
Hey r/LocalLLaMA! I'm super excited to announce our new revamped 2.0 version of our Dynamic quants which outperform leading quantization methods on 5-shot MMLU and KL Divergence!
* For accurate benchmarking, we built an evaluation framework to match the reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, **QAT** and **standard imatrix GGUF** quants. See benchmark details below or check our Docs for full analysis: [https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs](https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs).
* For dynamic 2.0 GGUFs, we report **KL Divergence** and Disk Space change. Our Gemma 3 Q3\_K\_XL quant for example reduces the KL Divergence by 7.5% whilst increasing in only 2% of disk space!
https://preview.redd.it/d2upyhrp5uwe1.png?width=1714&format=png&auto=webp&s=7972946d6a21bd516022779337d6b3b70a13a77d
* According to the paper "Accuracy is Not All You Need" [https://arxiv.org/abs/2407.09141](https://arxiv.org/abs/2407.09141), the authors showcase how **perplexity is a bad metric since it's a geometric mean, and so output tokens can cancel out**. It's best to directly report "Flips", which is how answers change from being incorrect to correct and vice versa.
https://preview.redd.it/x1dcukp76uwe1.png?width=1991&format=png&auto=webp&s=39c6a92749133cf53ad5b88824ca023347c40036
* In fact I was having some issues with Gemma 3 - layer pruning methods and old methods did not seem to work at all with Gemma 3 (my guess is it's due to the 4 layernorms). The paper shows if you prune layers, the "flips" increase dramatically. **They also show KL Divergence to be around 98% correlated with "flips"**, so my goal is to reduce it!
* Also I found current standard imatrix quants overfit on Wikitext - the perplexity is always lower when using these datasets, and I decided to instead use **conversational style datasets sourced from high quality outputs from LLMs with 100% manual inspection (took me many days!!)**
* Going forward, all GGUF uploads will leverage Dynamic 2.0 along with our hand curated **300K–1.5M token calibration dataset** to improve conversational chat performance. Safetensors 4-bit BnB uploads might also be updated later.
* Gemma 3 27B details on KLD below:
|Quant type|KLD old|Old GB|KLD New|New GB|
|:-|:-|:-|:-|:-|
|IQ1\_S|1.035688|5.83|0.972932|6.06|
|IQ1\_M|0.832252|6.33|0.800049|6.51|
|IQ2\_XXS|0.535764|7.16|0.521039|7.31|
|IQ2\_M|0.26554|8.84|0.258192|8.96|
|Q2\_K\_XL|0.229671|9.78|0.220937|9.95|
|Q3\_K\_XL|0.087845|12.51|0.080617|12.76|
|Q4\_K\_XL|0.024916|15.41|0.023701|15.64|
# We also helped and fixed a few Llama 4 bugs:
Llama 4 Scout changed the RoPE Scaling configuration in their official repo. We helped resolve issues in llama.cpp to enable this [change here](https://github.com/ggml-org/llama.cpp/pull/12889)
https://preview.redd.it/g8et5pp67uwe1.png?width=2091&format=png&auto=webp&s=4a30f52ee76504d889f44f2c3950a4e8027686d6
Llama 4's QK Norm's epsilon for both Scout and Maverick should be from the config file - this means using 1e-05 and not 1e-06. We helped resolve these in [llama.cpp](https://github.com/ggml-org/llama.cpp/pull/12889) and [transformers](https://github.com/huggingface/transformers/pull/37418)
The Llama 4 team and vLLM also independently fixed an issue with QK Norm being shared across all heads (should not be so) [here](https://github.com/vllm-project/vllm/pull/16311). MMLU Pro increased from 68.58% to 71.53% accuracy.
[Wolfram Ravenwolf](https://x.com/WolframRvnwlf/status/1909735579564331016) showcased how our GGUFs via llama.cpp attain much higher accuracy than third party inference providers - this was most likely a combination of improper implementation and issues explained above.
**Dynamic v2.0 GGUFs** (you can also view [all GGUFs here](https://huggingface.co/collections/unsloth/unsloth-dynamic-v20-quants-68060d147e9b9231112823e6)):
|DeepSeek: [R1](https://huggingface.co/unsloth/DeepSeek-R1-GGUF-UD) • [V3-0324](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD)|**Llama:** [4 (Scout)](https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF) • [3.1 (8B)](https://huggingface.co/unsloth/Llama-3.1-8B-Instruct-GGUF)|
|:-|:-|
|**Gemma 3:** [4B](https://huggingface.co/unsloth/gemma-3-4b-it-GGUF) • [12B](https://huggingface.co/unsloth/gemma-3-12b-it-GGUF) • [27B](https://huggingface.co/unsloth/gemma-3-27b-it-GGUF)|**Mistral:** [Small-3.1-2503](https://huggingface.co/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF)|
## MMLU 5 shot Benchmarks for Gemma 3 27B betweeen QAT and normal:
**TLDR - Our dynamic 4bit quant gets +1% in MMLU vs QAT whilst being 2GB smaller!**
More details here: [https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs](https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs)
| Model | Unsloth | Unsloth + QAT | Disk Size | Efficiency |
|-------------|---------|----------------|-----------|------------|
| IQ1_S | 41.87 | 43.37 | 6.06 | 3.03 |
| IQ1_M | 48.10 | 47.23 | 6.51 | 3.42 |
| Q2_K_XL | 68.70 | 67.77 | 9.95 | 4.30 |
| Q3_K_XL | 70.87 | 69.50 | 12.76 | 3.49 |
| **Q4_K_XL** | **71.47** | **71.07** | **15.64** | **2.94** |
| Q5_K_M | 71.77 | 71.23 | 17.95 | 2.58 |
| Q6_K | 71.87 | 71.60 | 20.64 | 2.26 |
| Q8_0 | 71.60 | 71.53 | 26.74 | 1.74 |
| **Google QAT** | | **70.64** | **17.2** | **2.65** |
Llama 4 Scout supports a context window of 130,000 tokens, which allows for long documents, extended conversations, or large inputs to be processed in a single request.
Llama 4 Scout has 109 billion total parameters, but uses a Mixture of Experts architecture that activates only 17 billion parameters per token during inference.
Yes. Llama 4 Scout is a multimodal model that can process both text and image inputs within a unified model backbone.
According to the model metadata, Llama 4 Scout's training data has a cutoff in early 2025.
Llama 4 Scout is developed and published by Meta. On MindStudio, it is served via Groq using the llama-4-scout-17b-16e-instruct model variant.
Continue browsing adjacent models from the same provider.