OpenAI

GPT OSS 20B

GPT OSS 20B is an open-weight text generation model released by OpenAI in August 2025, representing the company's first open-weight release since GPT-2 in 2019. It uses a Mixture-of-Experts (MoE) architecture with 21 billion total parameters, activating approximately 3.6 billion parameters per token across 4 of 32 experts in 24 layers. Combined with MXFP4 4-bit quantization, the model runs within 16GB of memory, making it suitable for consumer hardware and on-device deployment. It is licensed under Apache 2.0, allowing local hosting, firewall-protected deployment, and fine-tuning for custom use cases. GPT OSS 20B supports a 128,000-token context window and includes adjustable reasoning levels — low, medium, and high — with chain-of-thought traces. Its documented strengths include coding, mathematical reasoning, and scientific analysis, along with tool use and agentic workflow support. The model also produces structured outputs for predictable, schema-conforming responses. It is available through Hugging Face, Amazon SageMaker, Amazon Bedrock, and NVIDIA NIM, and is well-suited for developers and organizations that require a self-hosted, customizable AI model without relying on cloud infrastructure.

Aug 05, 2025 128,000 context 32,768 tokens output
Adjustable Reasoning Long Context Window Coding and Math Tool Use and Agents Structured Outputs On-Device Efficiency

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

OpenAI

Model ID

The routed model identifier exposed by upstream providers.

openai/gpt-oss-20b:free

Input Context Window

The number of tokens supported by the input context window.

128,000 tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

32,768 tokens tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Aug 05, 2025 10 months ago

Knowledge Cut-off Date

When the model's knowledge was last updated.

August 2025

API Providers

The providers that offer this model. This is not an exhaustive list.

OpenInference

Modalities

Types of data this model can process.

Text

What is GPT OSS 20B

A fuller summary of positioning, capabilities, and source-specific details for GPT OSS 20B.

GPT OSS 20B is an open-weight text generation model released by OpenAI in August 2025, representing the company's first open-weight release since GPT-2 in 2019. It uses a Mixture-of-Experts (MoE) architecture with 21 billion total parameters, activating approximately 3.6 billion parameters per token across 4 of 32 experts in 24 layers. Combined with MXFP4 4-bit quantization, the model runs within 16GB of memory, making it suitable for consumer hardware and on-device deployment. It is licensed under Apache 2.0, allowing local hosting, firewall-protected deployment, and fine-tuning for custom use cases.

GPT OSS 20B supports a 128,000-token context window and includes adjustable reasoning levels — low, medium, and high — with chain-of-thought traces. Its documented strengths include coding, mathematical reasoning, and scientific analysis, along with tool use and agentic workflow support. The model also produces structured outputs for predictable, schema-conforming responses. It is available through Hugging Face, Amazon SageMaker, Amazon Bedrock, and NVIDIA NIM, and is well-suited for developers and organizations that require a self-hosted, customizable AI model without relying on cloud infrastructure.

Capabilities

What GPT OSS 20B supports

RN

Adjustable Reasoning

Supports low, medium, and high reasoning levels with chain-of-thought traces, giving developers control over the depth of reasoning applied per request.

CTX

Long Context Window

Handles up to 128,000 tokens per request, enabling processing of long documents, codebases, or extended multi-turn conversations in a single pass.

</>

Coding and Math

Documented core strengths include code generation, mathematical reasoning, and scientific analysis tasks.

AG

Tool Use and Agents

Supports tool calling and agentic workflows, allowing the model to interact with external functions and APIs as part of multi-step tasks.

JSON

Structured Outputs

Produces structured, schema-conforming responses for use cases that require predictable output formats such as JSON.

AI

On-Device Efficiency

Uses MXFP4 4-bit quantization and a MoE architecture to run within 16GB of memory, making local deployment on consumer hardware feasible.

AI

Open-Weight License

Released under Apache 2.0, allowing self-hosting, fine-tuning, and deployment behind firewalls without usage restrictions from the license.

AI

Fine-Tuning Support

Supports fine-tuning for custom use cases, with documented integration via Hugging Face libraries on Amazon SageMaker.

Pricing for GPT OSS 20B

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

maxTemperature 2
maxResponseSize 32,768 tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

OpenInference

Provider Endpoints

Endpoint-level provider data currently available for this model.

OpenInference

Max output: 8,192 1d uptime: 99.6% Supported params: 8 Implicit caching: No

Model Performance

Benchmark scores synced from the current model source and normalized into the local catalog.

Benchmark Score
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
68.8%
HLE
Questions that challenge frontier models across many domains
9.8%
LiveCodeBench
Real-world coding tasks from recent competitions
77.7%
MMLU-Pro
Expert knowledge across 14 academic disciplines
74.8%
SciCode
Scientific research coding and numerical methods
34.4%

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about GPT OSS 20B

GPT OSS 20B discussions are most active in r/LocalLLaMA, r/LocalLLM, r/MiniPCs. Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions.

The strongest match in this snapshot has 1501 upvotes and 260 comments.

r/selfhosted 1,501 upvotes 260 comments August 6, 2025
You can now run OpenAI's gpt-oss model on your local device! (14GB RAM)

Hello everyone! OpenAI just released their first open-source models in 5 years, and now, you can have your own GPT-4o and o3 model at home! They're called 'gpt-oss'.

There's two models, a smaller 20B parameter model and a 120B one that rivals o4-mini. **Both** models outperform GPT-4o in various tasks, including reasoning, coding, math, health and agentic tasks.

To run the models locally (laptop, Mac, desktop etc), we at [**Unsloth**](https://docs.unsloth.ai/) converted these models and also **fixed bugs** to increase the model's output quality. Our GitHub repo: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)

Optimal setup:

* The 20B model runs at >10 tokens/s in **full precision**, with **14GB RAM**/unified memory. Smaller versions use 12GB RAM.
* The 120B model runs in full precision at >40 token/s with \~64GB RAM/unified mem.

There is no minimum requirement to run the models as they run even if you only have a 6GB CPU, but it will be slower inference.

Thus, **no is GPU required**, especially for the 20B model, but having one significantly boosts inference speeds (\~80 tokens/s). With something like an H100 you can get 140 tokens/s throughput which is way faster than the ChatGPT app.

You can run our uploads with bug fixes via llama.cpp or Unsloth Studio for the best performance. If the 120B model is too slow, try the smaller 20B version - it’s super fast and performs as well as o3-mini.

* Links to the model GGUFs to run: [gpt-oss-20B-GGUF](https://huggingface.co/unsloth/gpt-oss-20b-GGUF) and [gpt-oss-120B-GGUF](https://huggingface.co/unsloth/gpt-oss-120b-GGUF)
* Our **step-by-step guide** which we'd recommend you guys to read as it pretty much covers everything: [https://docs.unsloth.ai/basics/gpt-oss](https://docs.unsloth.ai/basics/gpt-oss)

Thanks so much once again for reading! I'll be replying to **every person** btw so feel free to ask any questions!

Open Reddit thread

The creator of heretic p-e-w opened a pull request #211 with a new method called Arbitrary-Rank Ablation (ARA)

[the creator of the project explanation](https://preview.redd.it/oxx4oi0c8ong1.png?width=726&format=png&auto=webp&s=eedfc3c10e1e841ee0dc56ce3bb5442a463a0f25)

For comparison, the previous best was

[eww](https://preview.redd.it/tnd9wchd8ong1.png?width=453&format=png&auto=webp&s=d737894d591f7c443d99ccaa92b0588818a4c48e)

74 refusals even after heretic, which is pretty ridiculous. It still refuses almost all the same things as the base model since OpenAI lobotomized it so heavily, but now with the new method, ARA has finally defeated GPT-OSS (no system messages even needed to get results like this one)

[rest of output not shown for obvious reasons but go download it yourself if you wanna see](https://preview.redd.it/1l5dji7f8ong1.png?width=962&format=png&auto=webp&s=d55aadccf01adf2917e67ceb6a5fbcc1b41abea1)

This means the future of open source AI is actually open and actually free, not even OpenAI's ultra sophisticated lobotomization can defeat what the open source community can do!

[https://huggingface.co/p-e-w/gpt-oss-20b-heretic-ara-v3](https://huggingface.co/p-e-w/gpt-oss-20b-heretic-ara-v3)

This is still experimental, so most heretic models you see online for the time being will probably not use this method. It's only in an unreleased version of Heretic for now, make sure you get ones that say they use MPOA+SOMA for now, but if you can once this becomes available in the full Heretic release, there will be more that use ARA, so almost always use those if available.

Open Reddit thread
r/openrouter 1 upvotes 1 comments August 22, 2025
gpt-oss-20b:free gone from OpenRouter?

Looks like openai/gpt-oss-20b:free is gone from OpenRouter? was it free only for limited time?

https://preview.redd.it/hebhyooh4lkf1.png?width=2046&format=png&auto=webp&s=93f3e49af9ae2b6d5a1c531f16b95a65ff085d2b

Open Reddit thread
r/LocalLLaMA 433 upvotes 58 comments February 10, 2026
Train MoE models 12x faster with 30% less memory! (<15GB VRAM)

Hey r/LocalLlama! We’re excited to introduce \~12x faster Mixture of Experts (MoE) training with **>35% less VRAM** and **\~6x longer context** via our new custom Triton kernels and math optimizations (no accuracy loss). Unsloth repo: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)

* Unsloth now supports fast training for MoE architectures including gpt-oss, Qwen3 (30B, 235B, VL, Coder), DeepSeek R1/V3 and GLM (4.5-Air, 4.7, Flash).
* gpt-oss-20b fine-tunes in **12.8GB VRAM**. Qwen3-30B-A3B (16-bit LoRA) uses 63GB.
* Our kernels work on both data-center (B200, H100), **consumer** and older GPUs (e.g., RTX 3090), and FFT, LoRA and QLoRA.
* The larger the model and more context you use, **the more pronounced the memory savings from our Unsloth kernels will be** (efficiency will scale exponentially).
* We previously introduced Unsloth Flex Attention for gpt-oss, and these optimizations should make it even more efficient.

In collaboration with Hugging Face, we made all MoE training runs standardized with PyTorch’s new `torch._grouped_mm` function. Transformers v5 was recently optimized with \~6x faster MoE than v4 and Unsloth pushes this even further with custom Triton grouped‑GEMM + LoRA kernels for an **additional** \~2x speedup, >35% VRAM reduction and >6x longer context (12-30x overall speedup vs v4).

You can read our educational blogpost for detailed analysis, benchmarks and more: [https://unsloth.ai/docs/new/faster-moe](https://unsloth.ai/docs/new/faster-moe)

We also released support for embedding model fine-tuning recently. You can use our free MoE fine-tuning notebooks:

|[**gpt-oss (20b)**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb) **(free)**|[gpt-oss (500K context)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt_oss_(20B)_500K_Context_Fine_tuning.ipynb)|[GLM-4.7-Flash](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GLM_Flash_A100(80GB).ipynb) (A100)|
|:-|:-|:-|
|[gpt-oss-120b](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(120B)_A100-Fine-tuning.ipynb) (A100)|[Qwen3-30B-A3B](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_MoE.ipynb) (A100)|[TinyQwen3 MoE T4](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/TinyQwen3_MoE.ipynb) (free)|

To update Unsloth to auto make training faster, update our Docker or:

pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

Thanks for reading and hope y'all have a lovely week. We hear it'll be a busy week! :)

Open Reddit thread
r/LocalLLaMA 395 upvotes 55 comments September 26, 2025
Gpt-oss Reinforcement Learning - Fastest inference now in Unsloth! (<15GB VRAM)

Hey guys we've got lots of updates for Reinforcement Learning (RL)! We’re excited to introduce gpt-oss, Vision, and even better RL in Unsloth. Our new gpt-oss RL inference also achieves the fastest token/s vs. any other implementation. Our GitHub: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)

1. Inference is crucial in RL training. Since gpt-oss RL isn’t vLLM compatible, we rewrote Transformers inference for 3× faster speeds (\~21 tok/s). For BF16, Unsloth also delivers the fastest inference (\~30 tok/s), especially relative to VRAM use vs. any other implementation.
2. We made a free & completely new custom notebook showing how RL can automatically create faster matrix multiplication kernels: [gpt-oss-20b GSPO Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb). We also show you how to counteract reward-hacking which is one of RL's biggest challenges.
3. Unsloth also uses the least VRAM (50% less) and supports the most context length (8x more). gpt-oss-20b RL fits in 15GB VRAM.
4. As usual, there is no accuracy degradation.
5. We released [Vision RL](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl), allowing you to train Gemma 3, Qwen2.5-VL with GRPO free in our Colab notebooks.
6. We also previously introduced more [memory efficient RL](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide/memory-efficient-rl) with Standby and extra kernels and algorithms. Unsloth RL now uses 90% less VRAM, and enables 16× longer context lengths than any setup.
7. ⚠️ Reminder to NOT use Flash Attention 3 for gpt-oss as it'll make your training loss wrong.
8. We released [DeepSeek-V3.1-Terminus](https://docs.unsloth.ai/models/deepseek-v3.1-how-to-run-locally) Dynamic GGUFs. We showcased how 3-bit V3.1 scores 75.6% on Aider Polyglot, beating Claude-4-Opus (thinking).

For our new gpt-oss RL release, would recommend you guys to read our blog/guide which details our entire findings and bugs etc.: [https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning)

Thanks guys for reading and hope you all have a lovely Friday and weekend! 🦥

Open Reddit thread
View more discussions →
FAQ

Common questions about GPT OSS 20B

What is the context window for GPT OSS 20B?

GPT OSS 20B supports a context window of 128,000 tokens, allowing it to process long documents, extended conversations, or large codebases in a single request.

What are the hardware requirements to run GPT OSS 20B locally?

Due to its MoE architecture and MXFP4 4-bit quantization, GPT OSS 20B can run within 16GB of memory, making it compatible with consumer-grade hardware without requiring a high-end GPU.

What license does GPT OSS 20B use?

GPT OSS 20B is released under the Apache 2.0 license, which permits local deployment, fine-tuning, and use behind firewalls for both personal and commercial purposes.

What is the training data cutoff for GPT OSS 20B?

The model was released in August 2025. A specific training data cutoff date is not listed in the available metadata; refer to the official model card on Hugging Face for the most accurate information.

What platforms can I use to deploy GPT OSS 20B?

GPT OSS 20B is available through Hugging Face, Amazon SageMaker, Amazon Bedrock, and NVIDIA NIM, in addition to local self-hosting using the open-weight model files.

Does GPT OSS 20B support fine-tuning?

Yes, GPT OSS 20B supports fine-tuning. AWS has published a guide for fine-tuning the model on Amazon SageMaker using Hugging Face libraries, and the Apache 2.0 license permits custom fine-tuning workflows.

More models from OpenAI

Continue browsing adjacent models from the same provider.

← All AI Models