Llama 4 Scout

Llama 4 Scout is a multimodal AI model developed by Meta, released in early 2025 as part of the Llama 4 model family. It uses a Mixture of Experts (MoE) architecture with 17 billion active parameters, 16 experts, and 109 billion total parameters, meaning only a subset of parameters is activated per token during inference. The model processes both text and image inputs within a unified backbone and supports a 130,000-token context window. Llama 4 Scout is designed for developers and enterprises building applications that require combined text and vision understanding. Its MoE design makes it more compute-efficient during training and inference compared to dense models of similar total parameter counts. On MindStudio, it is served via Groq, which provides low-latency inference for the instruct-tuned variant.

Apr 11, 2022 130,000 context 8,192 tokens output

Multimodal Input Long Context Window Mixture of Experts Instruction Following Fast Inference via Groq Code Generation

Overview ↓ About ↓ Capabilities ↓ Pricing ↓ Price Comparison ↓ Providers ↓ Benchmarks ↓ Tools ↓ Daily ↓ Resources ↓ Community ↓ FAQ ↓

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

Meta

Input Context Window

The number of tokens supported by the input context window.

130,000 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

8,192 tokens tokens

Open Source

Whether the model's code is available for public use.

Release Date

When the model was first released.

Apr 11, 2022 4 years ago

Knowledge Cut-off Date

When the model's knowledge was last updated.

Unknown

API Providers

The providers that offer this model. This is not an exhaustive list.

Hugging Face, DeepInfra, Groq, Novita, Google

Modalities

Types of data this model can process.

Text Image

What is Llama 4 Scout

A fuller summary of positioning, capabilities, and source-specific details for Llama 4 Scout.

Llama 4 Scout is a multimodal AI model developed by Meta, released in early 2025 as part of the Llama 4 model family. It uses a Mixture of Experts (MoE) architecture with 17 billion active parameters, 16 experts, and 109 billion total parameters, meaning only a subset of parameters is activated per token during inference. The model processes both text and image inputs within a unified backbone and supports a 130,000-token context window.

Llama 4 Scout is designed for developers and enterprises building applications that require combined text and vision understanding. Its MoE design makes it more compute-efficient during training and inference compared to dense models of similar total parameter counts. On MindStudio, it is served via Groq, which provides low-latency inference for the instruct-tuned variant.

Capabilities

What Llama 4 Scout supports

Multimodal Input

Processes both text and image inputs within a single unified model backbone, enabling tasks that combine visual and language understanding.

CTX

Long Context Window

Supports up to 130,000 tokens of context, allowing it to handle long documents, extended conversations, or large code files in a single request.

Mixture of Experts

Uses a 16-expert MoE architecture with 109 billion total parameters, activating only 17 billion per token to reduce compute cost while maintaining output quality.

Instruction Following

Fine-tuned as an instruct model, enabling it to follow natural language instructions for tasks like summarization, Q&A, and structured generation.

Fast Inference via Groq

Served on Groq's LPU infrastructure, which is designed to deliver low-latency token generation for real-time applications.

</>

Code Generation

Capable of generating, explaining, and debugging code across common programming languages as part of its general instruction-following training.

Pricing for Llama 4 Scout

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Input tokens $0.11 Per million tokens

Output tokens N/A Per million tokens

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

maxTemperature 1

maxResponseSize 8,192 tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

Hugging Face DeepInfra Groq Novita Google

Provider Endpoints

Endpoint-level provider data currently available for this model.

DeepInfra

Max output: 16,384 1d uptime: 99.9% Supported params: 13 Implicit caching: No

Groq

Max output: 8,192 1d uptime: 99.8% Supported params: 9 Implicit caching: No

Novita

Max output: 131,072 1d uptime: 99.9% Supported params: 9 Implicit caching: No

Google

Max output: 8,192 1d uptime: 99.9% Supported params: 12 Implicit caching: No

Model Performance

Benchmark scores synced from the current model source and normalized into the local catalog.

Benchmark	Score
AIME 2024 American math olympiad problems	28.3%
GPQA Diamond PhD-level science questions (biology, physics, chemistry)	58.7%
HLE Questions that challenge frontier models across many domains	4.3%
LiveCodeBench Real-world coding tasks from recent competitions	29.9%
MATH-500 Undergraduate and competition-level math problems	84.4%
MMLU-Pro Expert knowledge across 14 academic disciplines	75.2%
SciCode Scientific research coding and numerical methods	17.0%

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Product Announcement Announcements

→

Documentation Documentation

→

Meta Llama 4 Official Blog Announcements

→

Hugging Face Model Card Open Source

→

Meta Llama GitHub Open Source

→

Official Website

→

Technical Specifications

→

Research Paper

→

Responsible Use Guide

→

Usage License

→

AI tools related to Llama 4 Scout

These tools are strongly connected to Llama 4 Scout through direct product references, provider mentions, or explicit model mappings.

AI Assistant

Viinyx AI

Viinyx AI is an all-in-one browser extension that provides access to multiple AI models, including ChatGPT, Claude, Meta AI, and Gemini, directly on any website. Key features include page and video summarization, multi-PDF chat, chat history, AI writing assistance, and image generation. The extension operates within your browser session and supports Bring Your Own Key (BYOK) functionality for upgraded accounts.

Free 0 visits

AI Marketing

Hashmeta AI

Hashmeta AI is a Singapore-based AI agency focused on AI transformation and marketing. By integrating marketing expertise with AI agents, they assist businesses in achieving significant growth. Their service offerings include AI-powered SEO writing, lead response, and customer engagement, designed to provide high-level agency results at a more accessible price point. The team plans, builds, and executes tailored AI-driven marketing campaigns to ensure quality, speed, and scalability.

Free 27 visits 9 saves

AI Image Generator

Imagine with Meta AI

Imagine with Meta AI is a standalone tool that enables creative hobbyists to generate images using Emu, Meta's image foundation model. Users provide text descriptions, and the AI generates corresponding images. Please note that these AI-generated images may occasionally be inaccurate or inappropriate.

Free 0 visits 3 saves

AI Chatbot

Galactica

Galactica is an AI model trained on scientific literature, developed by Meta AI and Papers with Code as a research project to help users access and process scientific information. While initially released as a demo for research feedback, it was later removed from public access due to concerns regarding the generation of inaccurate information.

Free 0 visits 8 saves

Related Daily Briefs

Recent daily stories tied to Llama 4 Scout through direct model mentions or provider-level coverage.

Open Source Infra

Hugging Face and Meta Signal a Broader Shift Around Continuous Linear Integration

Hugging Face and Meta are becoming more practical to evaluate and deploy.

2026-06-29 AI Models AI API

Multimodal Creative

Offline reinforcement learning (RL); STRIVE-D Framework Improves Autonomous; Meta Confirms Instagram AI Chatbot

Meta are becoming more practical to evaluate and deploy.

2026-06-08 Text to Video Policy

Multimodal Creative

Sign-Gated On-Policy Distillation via; Reviving the Voice of Endangered Nüshu; Virtual-point-based Solutions to Handle

Cognition and Meta are becoming more practical to evaluate and deploy.

2026-06-08 Text to Video Policy

Research Benchmarks

RAM Neural Model Accelerates Robotic Reachability as Dri-MED Optimizes Bandits and Optics Refine AI

Meta and MiniMax are pushing more practical AI product shifts.

2026-06-08 AI Models AI API

Community discussion

What people think about Llama 4 Scout

Llama 4 Scout discussions are most active in r/LocalLLaMA, r/AIToolsPerformance, r/unsloth.

Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 2185 upvotes and 193 comments.

r/LocalLLaMA 8 upvotes 33 comments April 29, 2025

Why is Llama 4 considered bad?

I just watched Llamacon this morning and did some quick research while reading comments, and it seems like the vast majority of people aren't happy with the new Llama 4 Scout and Maverick models. Can someone explain why? I've finetuned some 3.1 models before, and I was wondering if it's even worth switching to 4. Any thoughts?

Open Reddit thread

r/LocalLLaMA 120 upvotes 40 comments February 17, 2026

Qwen 3.5, replacement to Llama 4 Scout?

Is Qwen 3.5 a direct replacement to Llama 4 in your opinion? Seems too much of a coincidence

Edit: 3.5 Plus and not Max

Open Reddit thread

r/LocalLLaMA 2,185 upvotes 193 comments April 6, 2025

Meta's Llama 4 Fell Short

Llama 4 Scout and Maverick left me really disappointed. It might explain why Joelle Pineau, Meta’s AI research lead, just got fired. Why are these models so underwhelming? My armchair analyst intuition suggests it’s partly the tiny expert size in their mixture-of-experts setup. 17B parameters? Feels small these days.

Meta’s struggle proves that having all the GPUs and Data in the world doesn’t mean much if the ideas aren’t fresh. Companies like DeepSeek, OpenAI etc. show real innovation is what pushes AI forward. You can’t just throw resources at a problem and hope for magic. Guess that’s the tricky part of AI, it’s not just about brute force, but brainpower too.

Open Reddit thread

r/LocalLLaMA 538 upvotes 247 comments April 6, 2025

I'm incredibly disappointed with Llama-4

I just finished my KCORES LLM Arena tests, adding Llama-4-Scout & Llama-4-Maverick to the mix.
My conclusion is that they completely surpassed my expectations... in a negative direction.

Llama-4-Maverick, the 402B parameter model, performs roughly on par with Qwen-QwQ-32B in terms of coding ability. Meanwhile, Llama-4-Scout is comparable to something like Grok-2 or Ernie 4.5...

You can just look at the "20 bouncing balls" test... the results are frankly terrible / abysmal.

Considering Llama-4-Maverick is a massive 402B parameters, why wouldn't I just use DeepSeek-V3-0324? Or even Qwen-QwQ-32B would be preferable – while its performance is similar, it's only 32B.

And as for Llama-4-Scout... well... let's just leave it at that / use it if it makes you happy, I guess... Meta, have you truly given up on the coding domain? Did you really just release vaporware?

Of course, its multimodal and long-context capabilities are currently unknown, as this review focuses solely on coding. I'd advise looking at other reviews or forming your own opinion based on actual usage for those aspects. In summary: I strongly advise against using Llama 4 for coding. Perhaps it might be worth trying for long text translation or multimodal tasks.

Open Reddit thread

r/LocalLLaMA 303 upvotes 176 comments April 24, 2025

Unsloth Dynamic v2.0 GGUFs + Llama 4 Bug Fixes + KL Divergence

Hey r/LocalLLaMA! I'm super excited to announce our new revamped 2.0 version of our Dynamic quants which outperform leading quantization methods on 5-shot MMLU and KL Divergence!

* For accurate benchmarking, we built an evaluation framework to match the reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, **QAT** and **standard imatrix GGUF** quants. See benchmark details below or check our Docs for full analysis: [https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs](https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs).
* For dynamic 2.0 GGUFs, we report **KL Divergence** and Disk Space change. Our Gemma 3 Q3\_K\_XL quant for example reduces the KL Divergence by 7.5% whilst increasing in only 2% of disk space!

https://preview.redd.it/d2upyhrp5uwe1.png?width=1714&format=png&auto=webp&s=7972946d6a21bd516022779337d6b3b70a13a77d

* According to the paper "Accuracy is Not All You Need" [https://arxiv.org/abs/2407.09141](https://arxiv.org/abs/2407.09141), the authors showcase how **perplexity is a bad metric since it's a geometric mean, and so output tokens can cancel out**. It's best to directly report "Flips", which is how answers change from being incorrect to correct and vice versa.

https://preview.redd.it/x1dcukp76uwe1.png?width=1991&format=png&auto=webp&s=39c6a92749133cf53ad5b88824ca023347c40036

* In fact I was having some issues with Gemma 3 - layer pruning methods and old methods did not seem to work at all with Gemma 3 (my guess is it's due to the 4 layernorms). The paper shows if you prune layers, the "flips" increase dramatically. **They also show KL Divergence to be around 98% correlated with "flips"**, so my goal is to reduce it!
* Also I found current standard imatrix quants overfit on Wikitext - the perplexity is always lower when using these datasets, and I decided to instead use **conversational style datasets sourced from high quality outputs from LLMs with 100% manual inspection (took me many days!!)**
* Going forward, all GGUF uploads will leverage Dynamic 2.0 along with our hand curated **300K–1.5M token calibration dataset** to improve conversational chat performance. Safetensors 4-bit BnB uploads might also be updated later.
* Gemma 3 27B details on KLD below:

|Quant type|KLD old|Old GB|KLD New|New GB|
|:-|:-|:-|:-|:-|
|IQ1\_S|1.035688|5.83|0.972932|6.06|
|IQ1\_M|0.832252|6.33|0.800049|6.51|
|IQ2\_XXS|0.535764|7.16|0.521039|7.31|
|IQ2\_M|0.26554|8.84|0.258192|8.96|
|Q2\_K\_XL|0.229671|9.78|0.220937|9.95|
|Q3\_K\_XL|0.087845|12.51|0.080617|12.76|
|Q4\_K\_XL|0.024916|15.41|0.023701|15.64|

# We also helped and fixed a few Llama 4 bugs:

Llama 4 Scout changed the RoPE Scaling configuration in their official repo. We helped resolve issues in llama.cpp to enable this [change here](https://github.com/ggml-org/llama.cpp/pull/12889)

https://preview.redd.it/g8et5pp67uwe1.png?width=2091&format=png&auto=webp&s=4a30f52ee76504d889f44f2c3950a4e8027686d6

Llama 4's QK Norm's epsilon for both Scout and Maverick should be from the config file - this means using 1e-05 and not 1e-06. We helped resolve these in [llama.cpp](https://github.com/ggml-org/llama.cpp/pull/12889) and [transformers](https://github.com/huggingface/transformers/pull/37418)

The Llama 4 team and vLLM also independently fixed an issue with QK Norm being shared across all heads (should not be so) [here](https://github.com/vllm-project/vllm/pull/16311). MMLU Pro increased from 68.58% to 71.53% accuracy.

[Wolfram Ravenwolf](https://x.com/WolframRvnwlf/status/1909735579564331016) showcased how our GGUFs via llama.cpp attain much higher accuracy than third party inference providers - this was most likely a combination of improper implementation and issues explained above.

**Dynamic v2.0 GGUFs** (you can also view [all GGUFs here](https://huggingface.co/collections/unsloth/unsloth-dynamic-v20-quants-68060d147e9b9231112823e6)):

|DeepSeek: [R1](https://huggingface.co/unsloth/DeepSeek-R1-GGUF-UD) • [V3-0324](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD)|**Llama:** [4 (Scout)](https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF) • [3.1 (8B)](https://huggingface.co/unsloth/Llama-3.1-8B-Instruct-GGUF)|
|:-|:-|
|**Gemma 3:** [4B](https://huggingface.co/unsloth/gemma-3-4b-it-GGUF) • [12B](https://huggingface.co/unsloth/gemma-3-12b-it-GGUF) • [27B](https://huggingface.co/unsloth/gemma-3-27b-it-GGUF)|**Mistral:** [Small-3.1-2503](https://huggingface.co/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF)|

## MMLU 5 shot Benchmarks for Gemma 3 27B betweeen QAT and normal:

**TLDR - Our dynamic 4bit quant gets +1% in MMLU vs QAT whilst being 2GB smaller!**

More details here: [https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs](https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs)

| Model | Unsloth | Unsloth + QAT | Disk Size | Efficiency |
|-------------|---------|----------------|-----------|------------|
| IQ1_S | 41.87 | 43.37 | 6.06 | 3.03 |
| IQ1_M | 48.10 | 47.23 | 6.51 | 3.42 |
| Q2_K_XL | 68.70 | 67.77 | 9.95 | 4.30 |
| Q3_K_XL | 70.87 | 69.50 | 12.76 | 3.49 |
| **Q4_K_XL** | **71.47** | **71.07** | **15.64** | **2.94** |
| Q5_K_M | 71.77 | 71.23 | 17.95 | 2.58 |
| Q6_K | 71.87 | 71.60 | 20.64 | 2.26 |
| Q8_0 | 71.60 | 71.53 | 26.74 | 1.74 |
| **Google QAT** | | **70.64** | **17.2** | **2.65** |

Open Reddit thread

View more discussions →

FAQ

Common questions about Llama 4 Scout

What is the context window for Llama 4 Scout?

Llama 4 Scout supports a context window of 130,000 tokens, which allows for long documents, extended conversations, or large inputs to be processed in a single request.

How many parameters does Llama 4 Scout have?

Llama 4 Scout has 109 billion total parameters, but uses a Mixture of Experts architecture that activates only 17 billion parameters per token during inference.

Does Llama 4 Scout support image inputs?

Yes. Llama 4 Scout is a multimodal model that can process both text and image inputs within a unified model backbone.

When was Llama 4 Scout trained?

According to the model metadata, Llama 4 Scout's training data has a cutoff in early 2025.

Who publishes Llama 4 Scout and where is it hosted on MindStudio?

Llama 4 Scout is developed and published by Meta. On MindStudio, it is served via Groq using the llama-4-scout-17b-16e-instruct model variant.

More models from Meta

Continue browsing adjacent models from the same provider.

← All AI Models