Hybrid Thinking Mode
Switches between fast conversational responses and deep step-by-step reasoning within a single model, controlled by how the model is prompted rather than by selecting a separate endpoint.
DeepSeek-V3.1 is a 671-billion parameter large language model developed by DeepSeek, using a Mixture-of-Experts (MoE) architecture that activates 37 billion parameters at any given time. It supports a 128,000-token context window and was trained through August 2025, with an enhanced base model built using a two-phase long-context extension process that included 630 billion tokens at the 32K phase and 209 billion tokens at the 128K phase. The model accepts text input and produces text output across a wide range of general-purpose tasks. What distinguishes DeepSeek-V3.1 from earlier versions is its hybrid thinking design: a single model that can operate in a fast conversational mode or a slower step-by-step reasoning mode, selectable through prompting rather than requiring a separate model. Post-training improvements have also focused on tool use and agentic workflows, including multi-step API calls, web search, and code execution. This makes it well-suited for coding, mathematical reasoning, long-document analysis, and complex multi-turn agent tasks.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The routed model identifier exposed by upstream providers.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for DeepSeek V3.1.
DeepSeek-V3.1 is a 671-billion parameter large language model developed by DeepSeek, using a Mixture-of-Experts (MoE) architecture that activates 37 billion parameters at any given time. It supports a 128,000-token context window and was trained through August 2025, with an enhanced base model built using a two-phase long-context extension process that included 630 billion tokens at the 32K phase and 209 billion tokens at the 128K phase. The model accepts text input and produces text output across a wide range of general-purpose tasks.
What distinguishes DeepSeek-V3.1 from earlier versions is its hybrid thinking design: a single model that can operate in a fast conversational mode or a slower step-by-step reasoning mode, selectable through prompting rather than requiring a separate model. Post-training improvements have also focused on tool use and agentic workflows, including multi-step API calls, web search, and code execution. This makes it well-suited for coding, mathematical reasoning, long-document analysis, and complex multi-turn agent tasks.
Switches between fast conversational responses and deep step-by-step reasoning within a single model, controlled by how the model is prompted rather than by selecting a separate endpoint.
Supports up to 128,000 tokens of context, enabling analysis of long documents, extended codebases, or multi-turn conversations without truncation.
Handles multi-step agentic workflows including external API calls, web search, and code execution, with post-training improvements specifically targeting tool-calling reliability.
Generates, explains, and debugs code across multiple programming languages, with the option to invoke thinking mode for complex algorithmic problems.
Solves multi-step math problems using the model's thinking mode, which produces intermediate reasoning steps before arriving at a final answer.
Uses a MoE design with 671 billion total parameters but only 37 billion activated per forward pass, allowing large model capacity with more efficient inference.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Additional usage-cost dimensions synced into the project for this model.
Places where this model is available, based on the synced detail-page metadata.
Endpoint-level provider data currently available for this model.
Benchmark scores synced from the current model source and normalized into the local catalog.
| Benchmark | Score |
|---|---|
|
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
|
|
|
HLE
Questions that challenge frontier models across many domains
|
|
|
LiveCodeBench
Real-world coding tasks from recent competitions
|
|
|
MMLU-Pro
Expert knowledge across 14 academic disciplines
|
|
|
SciCode
Scientific research coding and numerical methods
|
Official model cards, release notes, docs, and other references synced from the source page.
DeepSeek V3.1 discussions are most active in r/LocalLLaMA, r/SillyTavernAI, r/DeepSeek.
Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 641 upvotes and 46 comments.
No unnecessary hate but ChatGPTs will oftern provide you with scraps and have some kind of limit when generating lengthy code. DeepSeek did this in one shot.
Prompt: write a p5.js program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically
Hey r/LocalLLaMA ! It took a bit longer than expected, but we made dynamic imatrix GGUFs for DeepSeek V3.1 at [https://huggingface.co/unsloth/DeepSeek-V3.1-GGUF](https://huggingface.co/unsloth/DeepSeek-V3.1-GGUF) There is also a TQ1\_0 (for naming only) version (**170GB**) which is 1 file for Ollama compatibility and works via `ollama run` [`hf.co/unsloth/DeepSeek-V3.1-GGUF:TQ1_0`](http://hf.co/unsloth/DeepSeek-V3.1-GGUF:TQ1_0)
All dynamic quants use higher bits (6-8bit) for very important layers, and unimportant layers are quantized down. We used over 2-3 million tokens of high quality calibration data for the imatrix phase.
* You must use `--jinja` to enable the correct chat template. You can also use `enable_thinking = True` / `thinking = True`
* You will get the following error when using other quants: `terminate called after throwing an instance of 'std::runtime_error' what(): split method must have between 1 and 1 positional arguments and between 0 and 0 keyword arguments at row 3, column 1908` We fixed it in all our quants!
* The official recommended settings are `--temp 0.6 --top_p 0.95`
* Use `-ot ".ffn_.*_exps.=CPU"` to offload MoE layers to RAM!
* Use KV Cache quantization to enable longer contexts. Try `--cache-type-k q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1` and for V quantization, you have to compile llama.cpp with Flash Attention support.
More docs on how to run it and other stuff at [https://docs.unsloth.ai/basics/deepseek-v3.1](https://docs.unsloth.ai/basics/deepseek-v3.1) I normally recommend using the Q2\_K\_XL or Q3\_K\_XL quants - they work very well!
We evaluated Deepseek v3.1 chat using a minimal agent (no tools other than bash, common-sense prompts, main agent class implemented in some 100 lines of python) and get 53.8% on SWE-bench verified (if you want to reproduce it, you can install [https://github.com/SWE-agent/mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent) and it's a one-liner to evaluate on SWE-bench).
https://preview.redd.it/d1dmlmo78gkf1.png?width=780&format=png&auto=webp&s=449eca28d86413e9259d33e66c7df67036c317a5
It currently gets on 2nd place among open source models on our leaderboard (SWE-bench bash-only, where we compare all models with this exact setup, see [https://www.swebench.com/](https://www.swebench.com/) ).
Still working on adding some more models, in particular open source ones. We haven't evaluated DeepSeek v3.1 reasoning so far (it doesn't have tool calls, so it's probably going to be less used for agents).
One of the interesting things is that Deepseek v3.1 chat maxes out later with respect to the number of steps taken by the agent, especially compared to the GPT models. To squeeze out the maximum performance you might have to run for 150 steps.
https://preview.redd.it/ok2y7rta8gkf1.png?width=2157&format=png&auto=webp&s=add6cf27c09da63de3a0169e76a577a038eaa9d2
As a result of the high step numbers, I'd say the effective cost is somewhere near that of GPT-5 mini if you use the official API (the next plot basically shows different cost to performance points depending on how high you set the step limit of the agent — agents succeed fast, but fail very slowly, so you can spend a lot of money without getting a higher resolve rate).
https://preview.redd.it/8dfgx8cc8gkf1.png?width=720&format=png&auto=webp&s=ff3667c6de5ebb0deafc5b4f7c7a031d70af833b
(sorry that the cost/step plots still mostly show proprietary models, we'll have a more complete plot soon).
(note: xpost from https://www.reddit.com/r/DeepSeek/comments/1mwp8ji/evaluating\_deepseek\_v31\_chat\_with\_a\_minimal\_agent/)
It's like Christmas for me when a new model drops.
DeepSeek-V3.1 supports a context window of 128,000 tokens, suitable for long documents, large codebases, and extended multi-turn conversations.
The model has 671 billion total parameters. Due to its Mixture-of-Experts architecture, only 37 billion parameters are activated during any single forward pass.
Based on the provided metadata, DeepSeek-V3.1's training date is listed as August 2025, which represents the approximate knowledge cutoff for the model.
DeepSeek-V3.1 can operate in a fast non-thinking conversational mode or a slower step-by-step reasoning mode. The mode is selected through prompting rather than by choosing a different model or endpoint.
The model weights for both DeepSeek-V3.1 and DeepSeek-V3.1-Base are available on Hugging Face, making local or self-hosted deployment possible for those with sufficient hardware resources.
Continue browsing adjacent models from the same provider.