Meta

Llama 4 Scout

Llama 4 Scout is a multimodal AI model developed by Meta, released in early 2025 as part of the Llama 4 model family. It uses a Mixture of Experts (MoE) architecture with 17 billion active parameters, 16 experts, and 109 billion total parameters, meaning only a subset of parameters is activated per token during inference. The model processes both text and image inputs within a unified backbone and supports a 130,000-token context window. Llama 4 Scout is designed for developers and enterprises building applications that require combined text and vision understanding. Its MoE design makes it more compute-efficient during training and inference compared to dense models of similar total parameter counts. On MindStudio, it is served via Groq, which provides low-latency inference for the instruct-tuned variant.

Apr 11, 2022 130,000 context 8,192 tokens output
Multimodal Input Long Context Window Mixture of Experts Instruction Following Fast Inference via Groq Code Generation

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

Meta

Input Context Window

The number of tokens supported by the input context window.

130,000 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

8,192 tokens tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Apr 11, 2022 4 years ago

Knowledge Cut-off Date

When the model's knowledge was last updated.

Unknown

API Providers

The providers that offer this model. This is not an exhaustive list.

Hugging Face, DeepInfra, Groq, Novita, Google

Modalities

Types of data this model can process.

Text Image

What is Llama 4 Scout

A fuller summary of positioning, capabilities, and source-specific details for Llama 4 Scout.

Llama 4 Scout is a multimodal AI model developed by Meta, released in early 2025 as part of the Llama 4 model family. It uses a Mixture of Experts (MoE) architecture with 17 billion active parameters, 16 experts, and 109 billion total parameters, meaning only a subset of parameters is activated per token during inference. The model processes both text and image inputs within a unified backbone and supports a 130,000-token context window.

Llama 4 Scout is designed for developers and enterprises building applications that require combined text and vision understanding. Its MoE design makes it more compute-efficient during training and inference compared to dense models of similar total parameter counts. On MindStudio, it is served via Groq, which provides low-latency inference for the instruct-tuned variant.

Capabilities

What Llama 4 Scout supports

MM

Multimodal Input

Processes both text and image inputs within a single unified model backbone, enabling tasks that combine visual and language understanding.

CTX

Long Context Window

Supports up to 130,000 tokens of context, allowing it to handle long documents, extended conversations, or large code files in a single request.

AI

Mixture of Experts

Uses a 16-expert MoE architecture with 109 billion total parameters, activating only 17 billion per token to reduce compute cost while maintaining output quality.

AI

Instruction Following

Fine-tuned as an instruct model, enabling it to follow natural language instructions for tasks like summarization, Q&A, and structured generation.

AI

Fast Inference via Groq

Served on Groq's LPU infrastructure, which is designed to deliver low-latency token generation for real-time applications.

</>

Code Generation

Capable of generating, explaining, and debugging code across common programming languages as part of its general instruction-following training.

Pricing for Llama 4 Scout

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

maxTemperature 1
maxResponseSize 8,192 tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

Hugging Face DeepInfra Groq Novita Google

Provider Endpoints

Endpoint-level provider data currently available for this model.

DeepInfra

Max output: 16,384 1d uptime: 100.0% Supported params: 13 Implicit caching: No

Groq

Max output: 8,192 1d uptime: 99.7% Supported params: 8 Implicit caching: No

Novita

Max output: 131,072 1d uptime: 99.9% Supported params: 9 Implicit caching: No

Google

Max output: 8,192 1d uptime: 99.8% Supported params: 12 Implicit caching: No

Model Performance

Benchmark scores synced from the current model source and normalized into the local catalog.

Benchmark Score
AIME 2024
American math olympiad problems
28.3%
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
58.7%
HLE
Questions that challenge frontier models across many domains
4.3%
LiveCodeBench
Real-world coding tasks from recent competitions
29.9%
MATH-500
Undergraduate and competition-level math problems
84.4%
MMLU-Pro
Expert knowledge across 14 academic disciplines
75.2%
SciCode
Scientific research coding and numerical methods
17.0%

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about Llama 4 Scout

Llama 4 Scout discussions are most active in r/LocalLLaMA, r/AIToolsPerformance, r/unsloth.

Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 2185 upvotes and 193 comments.

r/LocalLLaMA 8 upvotes 33 comments April 29, 2025
Why is Llama 4 considered bad?

I just watched Llamacon this morning and did some quick research while reading comments, and it seems like the vast majority of people aren't happy with the new Llama 4 Scout and Maverick models. Can someone explain why? I've finetuned some 3.1 models before, and I was wondering if it's even worth switching to 4. Any thoughts?

Open Reddit thread
r/LocalLLaMA 2,185 upvotes 193 comments April 6, 2025
Meta's Llama 4 Fell Short

Llama 4 Scout and Maverick left me really disappointed. It might explain why Joelle Pineau, Meta’s AI research lead, just got fired. Why are these models so underwhelming? My armchair analyst intuition suggests it’s partly the tiny expert size in their mixture-of-experts setup. 17B parameters? Feels small these days.

Meta’s struggle proves that having all the GPUs and Data in the world doesn’t mean much if the ideas aren’t fresh. Companies like DeepSeek, OpenAI etc. show real innovation is what pushes AI forward. You can’t just throw resources at a problem and hope for magic. Guess that’s the tricky part of AI, it’s not just about brute force, but brainpower too.

Open Reddit thread
r/LocalLLaMA 538 upvotes 247 comments April 6, 2025
I'm incredibly disappointed with Llama-4

I just finished my KCORES LLM Arena tests, adding Llama-4-Scout & Llama-4-Maverick to the mix.
My conclusion is that they completely surpassed my expectations... in a negative direction.

Llama-4-Maverick, the 402B parameter model, performs roughly on par with Qwen-QwQ-32B in terms of coding ability. Meanwhile, Llama-4-Scout is comparable to something like Grok-2 or Ernie 4.5...

You can just look at the "20 bouncing balls" test... the results are frankly terrible / abysmal.

Considering Llama-4-Maverick is a massive 402B parameters, why wouldn't I just use DeepSeek-V3-0324? Or even Qwen-QwQ-32B would be preferable – while its performance is similar, it's only 32B.

And as for Llama-4-Scout... well... let's just leave it at that / use it if it makes you happy, I guess... Meta, have you truly given up on the coding domain? Did you really just release vaporware?

Of course, its multimodal and long-context capabilities are currently unknown, as this review focuses solely on coding. I'd advise looking at other reviews or forming your own opinion based on actual usage for those aspects. In summary: I strongly advise against using Llama 4 for coding. Perhaps it might be worth trying for long text translation or multimodal tasks.

Open Reddit thread
r/LocalLLaMA 303 upvotes 176 comments April 24, 2025
Unsloth Dynamic v2.0 GGUFs + Llama 4 Bug Fixes + KL Divergence

Hey r/LocalLLaMA! I'm super excited to announce our new revamped 2.0 version of our Dynamic quants which outperform leading quantization methods on 5-shot MMLU and KL Divergence!

* For accurate benchmarking, we built an evaluation framework to match the reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, **QAT** and **standard imatrix GGUF** quants. See benchmark details below or check our Docs for full analysis: [https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs](https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs).
* For dynamic 2.0 GGUFs, we report **KL Divergence** and Disk Space change. Our Gemma 3 Q3\_K\_XL quant for example reduces the KL Divergence by 7.5% whilst increasing in only 2% of disk space!

https://preview.redd.it/d2upyhrp5uwe1.png?width=1714&format=png&auto=webp&s=7972946d6a21bd516022779337d6b3b70a13a77d

* According to the paper "Accuracy is Not All You Need" [https://arxiv.org/abs/2407.09141](https://arxiv.org/abs/2407.09141), the authors showcase how **perplexity is a bad metric since it's a geometric mean, and so output tokens can cancel out**. It's best to directly report "Flips", which is how answers change from being incorrect to correct and vice versa.

https://preview.redd.it/x1dcukp76uwe1.png?width=1991&format=png&auto=webp&s=39c6a92749133cf53ad5b88824ca023347c40036

* In fact I was having some issues with Gemma 3 - layer pruning methods and old methods did not seem to work at all with Gemma 3 (my guess is it's due to the 4 layernorms). The paper shows if you prune layers, the "flips" increase dramatically. **They also show KL Divergence to be around 98% correlated with "flips"**, so my goal is to reduce it!
* Also I found current standard imatrix quants overfit on Wikitext - the perplexity is always lower when using these datasets, and I decided to instead use **conversational style datasets sourced from high quality outputs from LLMs with 100% manual inspection (took me many days!!)**
* Going forward, all GGUF uploads will leverage Dynamic 2.0 along with our hand curated **300K–1.5M token calibration dataset** to improve conversational chat performance. Safetensors 4-bit BnB uploads might also be updated later.
* Gemma 3 27B details on KLD below:

|Quant type|KLD old|Old GB|KLD New|New GB|
|:-|:-|:-|:-|:-|
|IQ1\_S|1.035688|5.83|0.972932|6.06|
|IQ1\_M|0.832252|6.33|0.800049|6.51|
|IQ2\_XXS|0.535764|7.16|0.521039|7.31|
|IQ2\_M|0.26554|8.84|0.258192|8.96|
|Q2\_K\_XL|0.229671|9.78|0.220937|9.95|
|Q3\_K\_XL|0.087845|12.51|0.080617|12.76|
|Q4\_K\_XL|0.024916|15.41|0.023701|15.64|

# We also helped and fixed a few Llama 4 bugs:

Llama 4 Scout changed the RoPE Scaling configuration in their official repo. We helped resolve issues in llama.cpp to enable this [change here](https://github.com/ggml-org/llama.cpp/pull/12889)

https://preview.redd.it/g8et5pp67uwe1.png?width=2091&format=png&auto=webp&s=4a30f52ee76504d889f44f2c3950a4e8027686d6

Llama 4's QK Norm's epsilon for both Scout and Maverick should be from the config file - this means using 1e-05 and not 1e-06. We helped resolve these in [llama.cpp](https://github.com/ggml-org/llama.cpp/pull/12889) and [transformers](https://github.com/huggingface/transformers/pull/37418)

The Llama 4 team and vLLM also independently fixed an issue with QK Norm being shared across all heads (should not be so) [here](https://github.com/vllm-project/vllm/pull/16311). MMLU Pro increased from 68.58% to 71.53% accuracy.

[Wolfram Ravenwolf](https://x.com/WolframRvnwlf/status/1909735579564331016) showcased how our GGUFs via llama.cpp attain much higher accuracy than third party inference providers - this was most likely a combination of improper implementation and issues explained above.

**Dynamic v2.0 GGUFs** (you can also view [all GGUFs here](https://huggingface.co/collections/unsloth/unsloth-dynamic-v20-quants-68060d147e9b9231112823e6)):

|DeepSeek: [R1](https://huggingface.co/unsloth/DeepSeek-R1-GGUF-UD) • [V3-0324](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD)|**Llama:** [4 (Scout)](https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF) • [3.1 (8B)](https://huggingface.co/unsloth/Llama-3.1-8B-Instruct-GGUF)|
|:-|:-|
|**Gemma 3:** [4B](https://huggingface.co/unsloth/gemma-3-4b-it-GGUF) • [12B](https://huggingface.co/unsloth/gemma-3-12b-it-GGUF) • [27B](https://huggingface.co/unsloth/gemma-3-27b-it-GGUF)|**Mistral:** [Small-3.1-2503](https://huggingface.co/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF)|

## MMLU 5 shot Benchmarks for Gemma 3 27B betweeen QAT and normal:

**TLDR - Our dynamic 4bit quant gets +1% in MMLU vs QAT whilst being 2GB smaller!**

More details here: [https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs](https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs)

| Model | Unsloth | Unsloth + QAT | Disk Size | Efficiency |
|-------------|---------|----------------|-----------|------------|
| IQ1_S | 41.87 | 43.37 | 6.06 | 3.03 |
| IQ1_M | 48.10 | 47.23 | 6.51 | 3.42 |
| Q2_K_XL | 68.70 | 67.77 | 9.95 | 4.30 |
| Q3_K_XL | 70.87 | 69.50 | 12.76 | 3.49 |
| **Q4_K_XL** | **71.47** | **71.07** | **15.64** | **2.94** |
| Q5_K_M | 71.77 | 71.23 | 17.95 | 2.58 |
| Q6_K | 71.87 | 71.60 | 20.64 | 2.26 |
| Q8_0 | 71.60 | 71.53 | 26.74 | 1.74 |
| **Google QAT** | | **70.64** | **17.2** | **2.65** |

Open Reddit thread
View more discussions →
FAQ

Common questions about Llama 4 Scout

What is the context window for Llama 4 Scout?

Llama 4 Scout supports a context window of 130,000 tokens, which allows for long documents, extended conversations, or large inputs to be processed in a single request.

How many parameters does Llama 4 Scout have?

Llama 4 Scout has 109 billion total parameters, but uses a Mixture of Experts architecture that activates only 17 billion parameters per token during inference.

Does Llama 4 Scout support image inputs?

Yes. Llama 4 Scout is a multimodal model that can process both text and image inputs within a unified model backbone.

When was Llama 4 Scout trained?

According to the model metadata, Llama 4 Scout's training data has a cutoff in early 2025.

Who publishes Llama 4 Scout and where is it hosted on MindStudio?

Llama 4 Scout is developed and published by Meta. On MindStudio, it is served via Groq using the llama-4-scout-17b-16e-instruct model variant.

More models from Meta

Continue browsing adjacent models from the same provider.

← All AI Models