OpenAI

GPT OSS 120B

GPT OSS 120B is OpenAI's largest open-weight model, released in August 2025 under the Apache 2.0 license. It has approximately 116.8 billion total parameters and uses a Mixture-of-Experts (MoE) architecture that activates only around 5.1 billion parameters per token, enabling efficient inference on a single H100 GPU. The model is part of the GPT OSS family and is designed for commercial and private deployments without licensing restrictions. The model is built for coding, mathematical reasoning, scientific analysis, and agentic workflows. It supports a 128,000-token context window, adjustable reasoning levels (low, medium, and high), and native tool use including web browsing, Python code execution, and custom developer-defined functions. Architecturally, it uses 36 transformer layers with 128 experts per MoE layer (top 4 active per token), Grouped Query Attention, Rotary Position Embeddings, and an alternating local/dense attention pattern, and it is available for local inference via Hugging Face Transformers, llama.cpp, and vLLM.

Aug 05, 2025 131.1K context 32,768 tokens output
Mixture-of-Experts Architecture Adjustable Reasoning Long Context Window Coding and Math Tool Use Agentic Workflows

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

OpenAI

Model ID

The routed model identifier exposed by upstream providers.

openai/gpt-oss-120b:free

Input Context Window

The number of tokens supported by the input context window.

131.1K tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

32,768 tokens tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Aug 05, 2025 10 months ago

Knowledge Cut-off Date

When the model's knowledge was last updated.

August 2025

API Providers

The providers that offer this model. This is not an exhaustive list.

OpenInference

Modalities

Types of data this model can process.

Text Code

What is GPT OSS 120B

A fuller summary of positioning, capabilities, and source-specific details for GPT OSS 120B.

GPT OSS 120B is OpenAI's largest open-weight model, released in August 2025 under the Apache 2.0 license. It has approximately 116.8 billion total parameters and uses a Mixture-of-Experts (MoE) architecture that activates only around 5.1 billion parameters per token, enabling efficient inference on a single H100 GPU. The model is part of the GPT OSS family and is designed for commercial and private deployments without licensing restrictions.

The model is built for coding, mathematical reasoning, scientific analysis, and agentic workflows. It supports a 128,000-token context window, adjustable reasoning levels (low, medium, and high), and native tool use including web browsing, Python code execution, and custom developer-defined functions. Architecturally, it uses 36 transformer layers with 128 experts per MoE layer (top 4 active per token), Grouped Query Attention, Rotary Position Embeddings, and an alternating local/dense attention pattern, and it is available for local inference via Hugging Face Transformers, llama.cpp, and vLLM.

Capabilities

What GPT OSS 120B supports

AI

Mixture-of-Experts Architecture

Uses a MoE design with 128 experts per layer, activating only ~5.1 billion of 116.8 billion total parameters per token for efficient inference.

RN

Adjustable Reasoning

Supports low, medium, and high reasoning levels, allowing developers to tune the trade-off between response speed and reasoning depth.

CTX

Long Context Window

Handles up to 128,000 tokens per request, equivalent to roughly 100,000 words of text in a single prompt.

</>

Coding and Math

Designed for software development, mathematical reasoning, and scientific analysis tasks requiring multi-step problem solving.

TL

Tool Use

Natively supports web browsing, Python code execution, and custom developer-defined functions as callable tools.

AG

Agentic Workflows

Built for multi-step agentic tasks and integrates with agent frameworks, supporting complex sequences of tool calls and decisions.

AI

Open Source License

Released under the Apache 2.0 license, permitting commercial use, fine-tuning, and private deployment without royalty obligations.

AI

Fast Inference

Tagged as very fast; the MoE architecture keeps active parameter count low, and the model fits on a single H100 GPU for local deployment.

AI

Fine-Tuning Support

Supports fine-tuning workflows, allowing developers to adapt the base model to domain-specific tasks using standard training pipelines.

Pricing for GPT OSS 120B

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

maxTemperature 2
maxResponseSize 32,768 tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

OpenInference

Provider Endpoints

Endpoint-level provider data currently available for this model.

OpenInference

Max output: 131,072 1d uptime: 99.9% Supported params: 8 Implicit caching: No

Model Performance

Benchmark scores synced from the current model source and normalized into the local catalog.

Benchmark Score
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
78.2%
HLE
Questions that challenge frontier models across many domains
18.5%
LiveCodeBench
Real-world coding tasks from recent competitions
87.8%
MMLU-Pro
Expert knowledge across 14 academic disciplines
80.8%
SciCode
Scientific research coding and numerical methods
38.9%

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about GPT OSS 120B

GPT OSS 120B discussions are most active in r/LocalLLaMA, r/LLMDevs, r/AIToolsPerformance.

Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 1501 upvotes and 260 comments.

Looking at current pricing, the economics of local inference are getting harder to justify for pure capability:

- **Qwen: Qwen3 Coder 480B A35B** - free with 262,000 context
- **OpenAI: gpt-oss-120b** - $0.04/M with 131,072 context
- **Z.ai: GLM 4 32B** - $0.10/M with 128,000 context
- **Qwen: Qwen3 235B A22B Thinking 2507** - $0.15/M with 131,072 context

Even **Arcee AI: Maestro Reasoning** at $0.90/M for a dedicated reasoning model with 131K context is competitive against the electricity cost of running a 48GB+ VRAM rig at full load.

The local inference crowd has historically argued three pillars: cost, privacy, and latency. But when a 480B-parameter coder model is free with 262K context, the cost argument weakens significantly. Apple's work on self-distillation for code generation suggests models will keep getting more efficient on the API side too.

That said, the DGX Spark situation - NVFP4 support still missing after 6 months - shows the hardware side moves slower. And the "Signals" paper on trajectory sampling for agentic interactions hints that complex agent workflows may still benefit from local control.

So honest question: for those of you still running local inference in April 2026, is it purely privacy/compliance driving that choice, or are there workloads where local still beats these API prices on quality?

Open Reddit thread
r/selfhosted 1,501 upvotes 260 comments August 6, 2025
You can now run OpenAI's gpt-oss model on your local device! (14GB RAM)

Hello everyone! OpenAI just released their first open-source models in 5 years, and now, you can have your own GPT-4o and o3 model at home! They're called 'gpt-oss'.

There's two models, a smaller 20B parameter model and a 120B one that rivals o4-mini. **Both** models outperform GPT-4o in various tasks, including reasoning, coding, math, health and agentic tasks.

To run the models locally (laptop, Mac, desktop etc), we at [**Unsloth**](https://docs.unsloth.ai/) converted these models and also **fixed bugs** to increase the model's output quality. Our GitHub repo: [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)

Optimal setup:

* The 20B model runs at >10 tokens/s in **full precision**, with **14GB RAM**/unified memory. Smaller versions use 12GB RAM.
* The 120B model runs in full precision at >40 token/s with \~64GB RAM/unified mem.

There is no minimum requirement to run the models as they run even if you only have a 6GB CPU, but it will be slower inference.

Thus, **no is GPU required**, especially for the 20B model, but having one significantly boosts inference speeds (\~80 tokens/s). With something like an H100 you can get 140 tokens/s throughput which is way faster than the ChatGPT app.

You can run our uploads with bug fixes via llama.cpp or Unsloth Studio for the best performance. If the 120B model is too slow, try the smaller 20B version - it’s super fast and performs as well as o3-mini.

* Links to the model GGUFs to run: [gpt-oss-20B-GGUF](https://huggingface.co/unsloth/gpt-oss-20b-GGUF) and [gpt-oss-120B-GGUF](https://huggingface.co/unsloth/gpt-oss-120b-GGUF)
* Our **step-by-step guide** which we'd recommend you guys to read as it pretty much covers everything: [https://docs.unsloth.ai/basics/gpt-oss](https://docs.unsloth.ai/basics/gpt-oss)

Thanks so much once again for reading! I'll be replying to **every person** btw so feel free to ask any questions!

Open Reddit thread
r/n8n_on_server 60 upvotes 33 comments August 6, 2025
Setup GPT-OSS-120B in Kilo Code [ COMPLETELY FREE]

https://preview.redd.it/2us0qrfxqehf1.png?width=630&format=png&auto=webp&s=1bfeee4f5c507cb78b493d80d227de8f1ce1c402

https://preview.redd.it/aatui1dxqehf1.png?width=635&format=png&auto=webp&s=0a1e46362a0db0d5c301c19814e317defc5c60af

kilo code: [Signup](https://kilocode.ai/users/sign_up?referral-code=36b1ea02-7746-4fa9-a660-e199cefdbe29)

**1. Get Your API Key:** Visit [https://build.nvidia.com/settings/api-keys](https://build.nvidia.com/settings/api-keys) to generate your free NVIDIA API key.

**2. Configure Kilo Code**

* Open Kilo Code Settings → Providers
* Set **API Provider**: "OpenAI Compatible"
* **Base URL**: [`https://integrate.api.nvidia.com/v1`](https://integrate.api.nvidia.com/v1)
* **API Key**: Paste your NVIDIA API key
* **Model**: `openai/gpt-oss-120b`

**3. Enable Key Features**

* ✅ **Image Support** \- Model handles visual inputs
* ✅ **Prompt Caching** \- Faster responses for repeated prompts
* ✅ **Enable R1 model parameters** \- Optimized reasoning
* Set **Context Window**: 128000 tokens
* **Model Reasoning Effort**: High

**4. Save & Start Coding** Click "Save" and you're ready to use this powerful 120B parameter model for free coding assistance with image understanding capabilities!

The model offers enterprise-grade performance with multimodal support, perfect for complex coding tasks that require both text and visual understanding.

Open Reddit thread

I was tired of hopping between NVIDIA NIM endpoints trying to find one that actually responds (and doing that while wasting my paid Claude/Codex/Gemini quotas).

So I built free-coding-models: a TUI that pings coding-focused LLMs in parallel, ranks them by latency + uptime, and then lets you launch OpenCode / configure OpenClaw with the best one in a keypress.

`npm i -g free-coding-models`

**What it does**

* Monitors **134 coding models** across **17 providers** (NVIDIA NIM, Groq, Cerebras, SambaNova, OpenRouter, HuggingFace, Replicate, DeepInfra, Fireworks, Codestral, Hyperbolic, Scaleway, Google AI, Together, Cloudflare Workers AI, Perplexity…) 
* **Parallel pings + continuous monitoring** (latency updates live + rolling averages + uptime %) 
* Built-in **provider key management** (press P) + optional --no-telemetry 
* For OpenClaw: it can also **patch the allowlist** so you can use *all* NVIDIA models without “model not allowed” errors 

**If you don’t know what NVIDIA NIM is:**

NVIDIA NIM is capped at 40 RPM which is honestly huge for a free tier, and plenty for day-to-day vibe coding ! You just have to make an account and set the API Key.

NIM = **NVIDIA Inference Microservices** (hosted APIs / containers for running foundation models on NVIDIA infra). NVIDIA advertises **free access for NVIDIA Developer Program members** (intended for dev/testing/prototyping). 

Repo: [https://github.com/vava-nessa/free-coding-models](https://github.com/vava-nessa/free-coding-models)  Please star it ;)

**Feedback wanted:** which tool should I support next after OpenCode/OpenClaw ?

(Cursor? Claude Code via proxy? KiloCode?)

Open Reddit thread
View more discussions →
FAQ

Common questions about GPT OSS 120B

What is the context window for GPT OSS 120B?

GPT OSS 120B supports a 128,000-token context window, which is roughly equivalent to 100,000 words of text in a single request.

What license does GPT OSS 120B use?

The model is released under the Apache 2.0 license, which permits commercial use, modification, fine-tuning, and private deployment.

What is the training data cutoff for GPT OSS 120B?

Based on the available metadata, the model was released in August 2025. A specific training data cutoff date is not stated in the provided metadata.

How many parameters does GPT OSS 120B have, and how does the MoE architecture affect inference?

The model has approximately 116.8 billion total parameters, but its Mixture-of-Experts architecture activates only around 5.1 billion parameters per token during inference, reducing compute requirements compared to a dense model of the same total size.

Where can GPT OSS 120B be deployed?

The model is available on AWS via Amazon Bedrock and SageMaker JumpStart, on NVIDIA NIM, and locally through Hugging Face Transformers, llama.cpp, and vLLM. It fits on a single H100 GPU for local inference.

Does GPT OSS 120B support tool use and agentic tasks?

Yes. The model natively supports web browsing, Python code execution, and custom developer-defined functions, and it is designed for multi-step agentic workflows and integration with agent frameworks.

More models from OpenAI

Continue browsing adjacent models from the same provider.

← All AI Models