Mixture-of-Experts Architecture
Uses a MoE design with 128 experts per layer, activating only ~5.1 billion of 116.8 billion total parameters per token for efficient inference.
GPT OSS 120B is OpenAI's largest open-weight model, released in August 2025 under the Apache 2.0 license. It has approximately 116.8 billion total parameters and uses a Mixture-of-Experts (MoE) architecture that activates only around 5.1 billion parameters per token, enabling efficient inference on a single H100 GPU. The model is part of the GPT OSS family and is designed for commercial and private deployments without licensing restrictions. The model is built for coding, mathematical reasoning, scientific analysis, and agentic workflows. It supports a 128,000-token context window, adjustable reasoning levels (low, medium, and high), and native tool use including web browsing, Python code execution, and custom developer-defined functions. Architecturally, it uses 36 transformer layers with 128 experts per MoE layer (top 4 active per token), Grouped Query Attention, Rotary Position Embeddings, and an alternating local/dense attention pattern, and it is available for local inference via Hugging Face Transformers, llama.cpp, and vLLM.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The routed model identifier exposed by upstream providers.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for GPT OSS 120B.
GPT OSS 120B is OpenAI's largest open-weight model, released in August 2025 under the Apache 2.0 license. It has approximately 116.8 billion total parameters and uses a Mixture-of-Experts (MoE) architecture that activates only around 5.1 billion parameters per token, enabling efficient inference on a single H100 GPU. The model is part of the GPT OSS family and is designed for commercial and private deployments without licensing restrictions.
The model is built for coding, mathematical reasoning, scientific analysis, and agentic workflows. It supports a 128,000-token context window, adjustable reasoning levels (low, medium, and high), and native tool use including web browsing, Python code execution, and custom developer-defined functions. Architecturally, it uses 36 transformer layers with 128 experts per MoE layer (top 4 active per token), Grouped Query Attention, Rotary Position Embeddings, and an alternating local/dense attention pattern, and it is available for local inference via Hugging Face Transformers, llama.cpp, and vLLM.
Uses a MoE design with 128 experts per layer, activating only ~5.1 billion of 116.8 billion total parameters per token for efficient inference.
Supports low, medium, and high reasoning levels, allowing developers to tune the trade-off between response speed and reasoning depth.
Handles up to 128,000 tokens per request, equivalent to roughly 100,000 words of text in a single prompt.
Designed for software development, mathematical reasoning, and scientific analysis tasks requiring multi-step problem solving.
Natively supports web browsing, Python code execution, and custom developer-defined functions as callable tools.
Built for multi-step agentic tasks and integrates with agent frameworks, supporting complex sequences of tool calls and decisions.
Released under the Apache 2.0 license, permitting commercial use, fine-tuning, and private deployment without royalty obligations.
Tagged as very fast; the MoE architecture keeps active parameter count low, and the model fits on a single H100 GPU for local deployment.
Supports fine-tuning workflows, allowing developers to adapt the base model to domain-specific tasks using standard training pipelines.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Additional usage-cost dimensions synced into the project for this model.
Places where this model is available, based on the synced detail-page metadata.
Endpoint-level provider data currently available for this model.
Benchmark scores synced from the current model source and normalized into the local catalog.
| Benchmark | Score |
|---|---|
|
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
|
|
|
HLE
Questions that challenge frontier models across many domains
|
|
|
LiveCodeBench
Real-world coding tasks from recent competitions
|
|
|
MMLU-Pro
Expert knowledge across 14 academic disciplines
|
|
|
SciCode
Scientific research coding and numerical methods
|
Official model cards, release notes, docs, and other references synced from the source page.
GPT OSS 120B discussions are most active in r/LocalLLaMA, r/LLMDevs, r/AIToolsPerformance.
Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 1501 upvotes and 260 comments.
Looking at current pricing, the economics of local inference are getting harder to justify for pure capability:
- **Qwen: Qwen3 Coder 480B A35B** - free with 262,000 context
- **OpenAI: gpt-oss-120b** - $0.04/M with 131,072 context
- **Z.ai: GLM 4 32B** - $0.10/M with 128,000 context
- **Qwen: Qwen3 235B A22B Thinking 2507** - $0.15/M with 131,072 context
Even **Arcee AI: Maestro Reasoning** at $0.90/M for a dedicated reasoning model with 131K context is competitive against the electricity cost of running a 48GB+ VRAM rig at full load.
The local inference crowd has historically argued three pillars: cost, privacy, and latency. But when a 480B-parameter coder model is free with 262K context, the cost argument weakens significantly. Apple's work on self-distillation for code generation suggests models will keep getting more efficient on the API side too.
That said, the DGX Spark situation - NVFP4 support still missing after 6 months - shows the hardware side moves slower. And the "Signals" paper on trajectory sampling for agentic interactions hints that complex agent workflows may still benefit from local control.
So honest question: for those of you still running local inference in April 2026, is it purely privacy/compliance driving that choice, or are there workloads where local still beats these API prices on quality?
i am on opencode and i wanted to try the gpt-oss-120b (free), but i get as error:
\[OpenInference\] no healthy upstream
is that normal?
Hello everyone! OpenAI just released their first open-source models in 5 years, and now, you can have your own GPT-4o and o3 model at home! They're called 'gpt-oss'.
There's two models, a smaller 20B parameter model and a 120B one that rivals o4-mini. **Both** models outperform GPT-4o in various tasks, including reasoning, coding, math, health and agentic tasks.
To run the models locally (laptop, Mac, desktop etc), we at [**Unsloth**](https://docs.unsloth.ai/) converted these models and also **fixed bugs** to increase the model's output quality. Our GitHub repo: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)
Optimal setup:
* The 20B model runs at >10 tokens/s in **full precision**, with **14GB RAM**/unified memory. Smaller versions use 12GB RAM.
* The 120B model runs in full precision at >40 token/s with \~64GB RAM/unified mem.
There is no minimum requirement to run the models as they run even if you only have a 6GB CPU, but it will be slower inference.
Thus, **no is GPU required**, especially for the 20B model, but having one significantly boosts inference speeds (\~80 tokens/s). With something like an H100 you can get 140 tokens/s throughput which is way faster than the ChatGPT app.
You can run our uploads with bug fixes via llama.cpp or Unsloth Studio for the best performance. If the 120B model is too slow, try the smaller 20B version - it’s super fast and performs as well as o3-mini.
* Links to the model GGUFs to run: [gpt-oss-20B-GGUF](https://huggingface.co/unsloth/gpt-oss-20b-GGUF) and [gpt-oss-120B-GGUF](https://huggingface.co/unsloth/gpt-oss-120b-GGUF)
* Our **step-by-step guide** which we'd recommend you guys to read as it pretty much covers everything: [https://docs.unsloth.ai/basics/gpt-oss](https://docs.unsloth.ai/basics/gpt-oss)
Thanks so much once again for reading! I'll be replying to **every person** btw so feel free to ask any questions!
https://preview.redd.it/2us0qrfxqehf1.png?width=630&format=png&auto=webp&s=1bfeee4f5c507cb78b493d80d227de8f1ce1c402
https://preview.redd.it/aatui1dxqehf1.png?width=635&format=png&auto=webp&s=0a1e46362a0db0d5c301c19814e317defc5c60af
kilo code: [Signup](https://kilocode.ai/users/sign_up?referral-code=36b1ea02-7746-4fa9-a660-e199cefdbe29)
**1. Get Your API Key:** Visit [https://build.nvidia.com/settings/api-keys](https://build.nvidia.com/settings/api-keys) to generate your free NVIDIA API key.
**2. Configure Kilo Code**
* Open Kilo Code Settings → Providers
* Set **API Provider**: "OpenAI Compatible"
* **Base URL**: [`https://integrate.api.nvidia.com/v1`](https://integrate.api.nvidia.com/v1)
* **API Key**: Paste your NVIDIA API key
* **Model**: `openai/gpt-oss-120b`
**3. Enable Key Features**
* ✅ **Image Support** \- Model handles visual inputs
* ✅ **Prompt Caching** \- Faster responses for repeated prompts
* ✅ **Enable R1 model parameters** \- Optimized reasoning
* Set **Context Window**: 128000 tokens
* **Model Reasoning Effort**: High
**4. Save & Start Coding** Click "Save" and you're ready to use this powerful 120B parameter model for free coding assistance with image understanding capabilities!
The model offers enterprise-grade performance with multimodal support, perfect for complex coding tasks that require both text and visual understanding.
I was tired of hopping between NVIDIA NIM endpoints trying to find one that actually responds (and doing that while wasting my paid Claude/Codex/Gemini quotas).
So I built free-coding-models: a TUI that pings coding-focused LLMs in parallel, ranks them by latency + uptime, and then lets you launch OpenCode / configure OpenClaw with the best one in a keypress.
`npm i -g free-coding-models`
**What it does**
* Monitors **134 coding models** across **17 providers** (NVIDIA NIM, Groq, Cerebras, SambaNova, OpenRouter, HuggingFace, Replicate, DeepInfra, Fireworks, Codestral, Hyperbolic, Scaleway, Google AI, Together, Cloudflare Workers AI, Perplexity…)
* **Parallel pings + continuous monitoring** (latency updates live + rolling averages + uptime %)
* Built-in **provider key management** (press P) + optional --no-telemetry
* For OpenClaw: it can also **patch the allowlist** so you can use *all* NVIDIA models without “model not allowed” errors
**If you don’t know what NVIDIA NIM is:**
NVIDIA NIM is capped at 40 RPM which is honestly huge for a free tier, and plenty for day-to-day vibe coding ! You just have to make an account and set the API Key.
NIM = **NVIDIA Inference Microservices** (hosted APIs / containers for running foundation models on NVIDIA infra). NVIDIA advertises **free access for NVIDIA Developer Program members** (intended for dev/testing/prototyping).
Repo: [https://github.com/vava-nessa/free-coding-models](https://github.com/vava-nessa/free-coding-models) Please star it ;)
**Feedback wanted:** which tool should I support next after OpenCode/OpenClaw ?
(Cursor? Claude Code via proxy? KiloCode?)
GPT OSS 120B supports a 128,000-token context window, which is roughly equivalent to 100,000 words of text in a single request.
The model is released under the Apache 2.0 license, which permits commercial use, modification, fine-tuning, and private deployment.
Based on the available metadata, the model was released in August 2025. A specific training data cutoff date is not stated in the provided metadata.
The model has approximately 116.8 billion total parameters, but its Mixture-of-Experts architecture activates only around 5.1 billion parameters per token during inference, reducing compute requirements compared to a dense model of the same total size.
The model is available on AWS via Amazon Bedrock and SageMaker JumpStart, on NVIDIA NIM, and locally through Hugging Face Transformers, llama.cpp, and vLLM. It fits on a single H100 GPU for local inference.
Yes. The model natively supports web browsing, Python code execution, and custom developer-defined functions, and it is designed for multi-step agentic workflows and integration with agent frameworks.
Continue browsing adjacent models from the same provider.