Adjustable Reasoning
Supports low, medium, and high reasoning levels with chain-of-thought traces, giving developers control over the depth of reasoning applied per request.
GPT OSS 20B is an open-weight text generation model released by OpenAI in August 2025, representing the company's first open-weight release since GPT-2 in 2019. It uses a Mixture-of-Experts (MoE) architecture with 21 billion total parameters, activating approximately 3.6 billion parameters per token across 4 of 32 experts in 24 layers. Combined with MXFP4 4-bit quantization, the model runs within 16GB of memory, making it suitable for consumer hardware and on-device deployment. It is licensed under Apache 2.0, allowing local hosting, firewall-protected deployment, and fine-tuning for custom use cases. GPT OSS 20B supports a 128,000-token context window and includes adjustable reasoning levels — low, medium, and high — with chain-of-thought traces. Its documented strengths include coding, mathematical reasoning, and scientific analysis, along with tool use and agentic workflow support. The model also produces structured outputs for predictable, schema-conforming responses. It is available through Hugging Face, Amazon SageMaker, Amazon Bedrock, and NVIDIA NIM, and is well-suited for developers and organizations that require a self-hosted, customizable AI model without relying on cloud infrastructure.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The routed model identifier exposed by upstream providers.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for GPT OSS 20B.
GPT OSS 20B is an open-weight text generation model released by OpenAI in August 2025, representing the company's first open-weight release since GPT-2 in 2019. It uses a Mixture-of-Experts (MoE) architecture with 21 billion total parameters, activating approximately 3.6 billion parameters per token across 4 of 32 experts in 24 layers. Combined with MXFP4 4-bit quantization, the model runs within 16GB of memory, making it suitable for consumer hardware and on-device deployment. It is licensed under Apache 2.0, allowing local hosting, firewall-protected deployment, and fine-tuning for custom use cases.
GPT OSS 20B supports a 128,000-token context window and includes adjustable reasoning levels — low, medium, and high — with chain-of-thought traces. Its documented strengths include coding, mathematical reasoning, and scientific analysis, along with tool use and agentic workflow support. The model also produces structured outputs for predictable, schema-conforming responses. It is available through Hugging Face, Amazon SageMaker, Amazon Bedrock, and NVIDIA NIM, and is well-suited for developers and organizations that require a self-hosted, customizable AI model without relying on cloud infrastructure.
Supports low, medium, and high reasoning levels with chain-of-thought traces, giving developers control over the depth of reasoning applied per request.
Handles up to 128,000 tokens per request, enabling processing of long documents, codebases, or extended multi-turn conversations in a single pass.
Documented core strengths include code generation, mathematical reasoning, and scientific analysis tasks.
Supports tool calling and agentic workflows, allowing the model to interact with external functions and APIs as part of multi-step tasks.
Produces structured, schema-conforming responses for use cases that require predictable output formats such as JSON.
Uses MXFP4 4-bit quantization and a MoE architecture to run within 16GB of memory, making local deployment on consumer hardware feasible.
Released under Apache 2.0, allowing self-hosting, fine-tuning, and deployment behind firewalls without usage restrictions from the license.
Supports fine-tuning for custom use cases, with documented integration via Hugging Face libraries on Amazon SageMaker.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Additional usage-cost dimensions synced into the project for this model.
Places where this model is available, based on the synced detail-page metadata.
Endpoint-level provider data currently available for this model.
Benchmark scores synced from the current model source and normalized into the local catalog.
| Benchmark | Score |
|---|---|
|
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
|
|
|
HLE
Questions that challenge frontier models across many domains
|
|
|
LiveCodeBench
Real-world coding tasks from recent competitions
|
|
|
MMLU-Pro
Expert knowledge across 14 academic disciplines
|
|
|
SciCode
Scientific research coding and numerical methods
|
Official model cards, release notes, docs, and other references synced from the source page.
GPT OSS 20B discussions are most active in r/LocalLLaMA, r/LocalLLM, r/MiniPCs. Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions.
The strongest match in this snapshot has 1501 upvotes and 260 comments.
Hello everyone! OpenAI just released their first open-source models in 5 years, and now, you can have your own GPT-4o and o3 model at home! They're called 'gpt-oss'.
There's two models, a smaller 20B parameter model and a 120B one that rivals o4-mini. **Both** models outperform GPT-4o in various tasks, including reasoning, coding, math, health and agentic tasks.
To run the models locally (laptop, Mac, desktop etc), we at [**Unsloth**](https://docs.unsloth.ai/) converted these models and also **fixed bugs** to increase the model's output quality. Our GitHub repo: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)
Optimal setup:
* The 20B model runs at >10 tokens/s in **full precision**, with **14GB RAM**/unified memory. Smaller versions use 12GB RAM.
* The 120B model runs in full precision at >40 token/s with \~64GB RAM/unified mem.
There is no minimum requirement to run the models as they run even if you only have a 6GB CPU, but it will be slower inference.
Thus, **no is GPU required**, especially for the 20B model, but having one significantly boosts inference speeds (\~80 tokens/s). With something like an H100 you can get 140 tokens/s throughput which is way faster than the ChatGPT app.
You can run our uploads with bug fixes via llama.cpp or Unsloth Studio for the best performance. If the 120B model is too slow, try the smaller 20B version - it’s super fast and performs as well as o3-mini.
* Links to the model GGUFs to run: [gpt-oss-20B-GGUF](https://huggingface.co/unsloth/gpt-oss-20b-GGUF) and [gpt-oss-120B-GGUF](https://huggingface.co/unsloth/gpt-oss-120b-GGUF)
* Our **step-by-step guide** which we'd recommend you guys to read as it pretty much covers everything: [https://docs.unsloth.ai/basics/gpt-oss](https://docs.unsloth.ai/basics/gpt-oss)
Thanks so much once again for reading! I'll be replying to **every person** btw so feel free to ask any questions!
The creator of heretic p-e-w opened a pull request #211 with a new method called Arbitrary-Rank Ablation (ARA)
[the creator of the project explanation](https://preview.redd.it/oxx4oi0c8ong1.png?width=726&format=png&auto=webp&s=eedfc3c10e1e841ee0dc56ce3bb5442a463a0f25)
For comparison, the previous best was
[eww](https://preview.redd.it/tnd9wchd8ong1.png?width=453&format=png&auto=webp&s=d737894d591f7c443d99ccaa92b0588818a4c48e)
74 refusals even after heretic, which is pretty ridiculous. It still refuses almost all the same things as the base model since OpenAI lobotomized it so heavily, but now with the new method, ARA has finally defeated GPT-OSS (no system messages even needed to get results like this one)
[rest of output not shown for obvious reasons but go download it yourself if you wanna see](https://preview.redd.it/1l5dji7f8ong1.png?width=962&format=png&auto=webp&s=d55aadccf01adf2917e67ceb6a5fbcc1b41abea1)
This means the future of open source AI is actually open and actually free, not even OpenAI's ultra sophisticated lobotomization can defeat what the open source community can do!
[https://huggingface.co/p-e-w/gpt-oss-20b-heretic-ara-v3](https://huggingface.co/p-e-w/gpt-oss-20b-heretic-ara-v3)
This is still experimental, so most heretic models you see online for the time being will probably not use this method. It's only in an unreleased version of Heretic for now, make sure you get ones that say they use MPOA+SOMA for now, but if you can once this becomes available in the full Heretic release, there will be more that use ARA, so almost always use those if available.
Looks like openai/gpt-oss-20b:free is gone from OpenRouter? was it free only for limited time?
https://preview.redd.it/hebhyooh4lkf1.png?width=2046&format=png&auto=webp&s=93f3e49af9ae2b6d5a1c531f16b95a65ff085d2b
Hey r/LocalLlama! We’re excited to introduce \~12x faster Mixture of Experts (MoE) training with **>35% less VRAM** and **\~6x longer context** via our new custom Triton kernels and math optimizations (no accuracy loss). Unsloth repo: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)
* Unsloth now supports fast training for MoE architectures including gpt-oss, Qwen3 (30B, 235B, VL, Coder), DeepSeek R1/V3 and GLM (4.5-Air, 4.7, Flash).
* gpt-oss-20b fine-tunes in **12.8GB VRAM**. Qwen3-30B-A3B (16-bit LoRA) uses 63GB.
* Our kernels work on both data-center (B200, H100), **consumer** and older GPUs (e.g., RTX 3090), and FFT, LoRA and QLoRA.
* The larger the model and more context you use, **the more pronounced the memory savings from our Unsloth kernels will be** (efficiency will scale exponentially).
* We previously introduced Unsloth Flex Attention for gpt-oss, and these optimizations should make it even more efficient.
In collaboration with Hugging Face, we made all MoE training runs standardized with PyTorch’s new `torch._grouped_mm` function. Transformers v5 was recently optimized with \~6x faster MoE than v4 and Unsloth pushes this even further with custom Triton grouped‑GEMM + LoRA kernels for an **additional** \~2x speedup, >35% VRAM reduction and >6x longer context (12-30x overall speedup vs v4).
You can read our educational blogpost for detailed analysis, benchmarks and more: [https://unsloth.ai/docs/new/faster-moe](https://unsloth.ai/docs/new/faster-moe)
We also released support for embedding model fine-tuning recently. You can use our free MoE fine-tuning notebooks:
|[**gpt-oss (20b)**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb) **(free)**|[gpt-oss (500K context)](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt_oss_(20B)_500K_Context_Fine_tuning.ipynb)|[GLM-4.7-Flash](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GLM_Flash_A100(80GB).ipynb) (A100)|
|:-|:-|:-|
|[gpt-oss-120b](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(120B)_A100-Fine-tuning.ipynb) (A100)|[Qwen3-30B-A3B](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_MoE.ipynb) (A100)|[TinyQwen3 MoE T4](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/TinyQwen3_MoE.ipynb) (free)|
To update Unsloth to auto make training faster, update our Docker or:
pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo
Thanks for reading and hope y'all have a lovely week. We hear it'll be a busy week! :)
Hey guys we've got lots of updates for Reinforcement Learning (RL)! We’re excited to introduce gpt-oss, Vision, and even better RL in Unsloth. Our new gpt-oss RL inference also achieves the fastest token/s vs. any other implementation. Our GitHub: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)
1. Inference is crucial in RL training. Since gpt-oss RL isn’t vLLM compatible, we rewrote Transformers inference for 3× faster speeds (\~21 tok/s). For BF16, Unsloth also delivers the fastest inference (\~30 tok/s), especially relative to VRAM use vs. any other implementation.
2. We made a free & completely new custom notebook showing how RL can automatically create faster matrix multiplication kernels: [gpt-oss-20b GSPO Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb). We also show you how to counteract reward-hacking which is one of RL's biggest challenges.
3. Unsloth also uses the least VRAM (50% less) and supports the most context length (8x more). gpt-oss-20b RL fits in 15GB VRAM.
4. As usual, there is no accuracy degradation.
5. We released [Vision RL](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl), allowing you to train Gemma 3, Qwen2.5-VL with GRPO free in our Colab notebooks.
6. We also previously introduced more [memory efficient RL](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide/memory-efficient-rl) with Standby and extra kernels and algorithms. Unsloth RL now uses 90% less VRAM, and enables 16× longer context lengths than any setup.
7. ⚠️ Reminder to NOT use Flash Attention 3 for gpt-oss as it'll make your training loss wrong.
8. We released [DeepSeek-V3.1-Terminus](https://docs.unsloth.ai/models/deepseek-v3.1-how-to-run-locally) Dynamic GGUFs. We showcased how 3-bit V3.1 scores 75.6% on Aider Polyglot, beating Claude-4-Opus (thinking).
For our new gpt-oss RL release, would recommend you guys to read our blog/guide which details our entire findings and bugs etc.: [https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning)
Thanks guys for reading and hope you all have a lovely Friday and weekend! 🦥
GPT OSS 20B supports a context window of 128,000 tokens, allowing it to process long documents, extended conversations, or large codebases in a single request.
Due to its MoE architecture and MXFP4 4-bit quantization, GPT OSS 20B can run within 16GB of memory, making it compatible with consumer-grade hardware without requiring a high-end GPU.
GPT OSS 20B is released under the Apache 2.0 license, which permits local deployment, fine-tuning, and use behind firewalls for both personal and commercial purposes.
The model was released in August 2025. A specific training data cutoff date is not listed in the available metadata; refer to the official model card on Hugging Face for the most accurate information.
GPT OSS 20B is available through Hugging Face, Amazon SageMaker, Amazon Bedrock, and NVIDIA NIM, in addition to local self-hosting using the open-weight model files.
Yes, GPT OSS 20B supports fine-tuning. AWS has published a guide for fine-tuning the model on Amazon SageMaker using Hugging Face libraries, and the Apache 2.0 license permits custom fine-tuning workflows.
Continue browsing adjacent models from the same provider.