Qwen

Qwen3.6-35B-A3B

Qwen3.6-35B-A3B is an open-weight multimodal model from Alibaba Cloud with 35 billion total parameters and 3 billion active parameters per token. It uses a hybrid sparse mixture-of-experts architecture combining Gated...

Apr 27, 2026 262.1K context 262,144 tokens output
Text Image Video Tools Structured Output Reasoning

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

Qwen

Model ID

The routed model identifier exposed by upstream providers.

qwen/qwen3.6-35b-a3b

Input Context Window

The number of tokens supported by the input context window.

262.1K tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

262,144 tokens tokens

Open Source

Whether the model's code is available for public use.

No

Release Date

When the model was first released.

Apr 27, 2026 1 month ago

Knowledge Cut-off Date

When the model's knowledge was last updated.

Unknown

API Providers

The providers that offer this model. This is not an exhaustive list.

Io Net, Ambient, Parasail, AtlasCloud, AkashML, SiliconFlow, WandB

Modalities

Types of data this model can process.

Text Image Video

What is Qwen3.6-35B-A3B

A fuller summary of positioning, capabilities, and source-specific details for Qwen3.6-35B-A3B.

Qwen3.6-35B-A3B is an open-weight multimodal model from Alibaba Cloud with 35 billion total parameters and 3 billion active parameters per token. It uses a hybrid sparse mixture-of-experts architecture combining Gated...

Capabilities

What Qwen3.6-35B-A3B supports

RN

Reasoning Controls

OpenRouter lists GPT-5.5 with reasoning support and explicit reasoning-related request parameters.

JSON

Structured Outputs

Structured output settings are exposed through OpenRouter for schema-driven or format-controlled responses.

TL

Tool Calling

Tool invocation and tool selection are supported in the routed OpenRouter interface for this model.

MM

Multimodal I/O

This model accepts text input, image input, video input and returns text output.

CTX

Large Context Window

OpenRouter currently lists a context window of 262.1K with up to 262,144 tokens maximum output tokens.

Pricing for Qwen3.6-35B-A3B

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

Cache read $0.05
maxTemperature 1
maxResponseSize 262,144 tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

Io Net Ambient Parasail AtlasCloud AkashML SiliconFlow WandB

Provider Endpoints

Endpoint-level provider data currently available for this model.

Io Net

Max output: 262,140 1d uptime: 85.6% Supported params: 11 Implicit caching: No

Ambient

Max output: 262,144 1d uptime: 96.1% Supported params: 19 Implicit caching: No

Parasail

Max output: 262,144 1d uptime: 99.1% Supported params: 16 Implicit caching: No

AtlasCloud

Max output: 65,536 1d uptime: 99.7% Supported params: 15 Implicit caching: No

AkashML

Max output: 262,144 1d uptime: 99.2% Supported params: 15 Implicit caching: No

SiliconFlow

Max output: 262,144 1d uptime: 97.6% Supported params: 11 Implicit caching: No

WandB

Max output: 262,144 1d uptime: 99.7% Supported params: 13 Implicit caching: No

Configuration & Parameters

The configurable options currently documented for this model.

Top P

Number

Nucleus sampling. Considers only tokens whose cumulative probability exceeds this threshold.

Default: 0.95 Range: 0 - 1 (step 0.01)

Top K

Number

Limits sampling to the K most likely tokens at each step. Set to 0 to disable.

Default: 20 Range: 0 - 100

Min P

Number

Minimum probability threshold relative to the most likely token. Filters out unlikely tokens.

Range: 0 - 1 (step 0.01)

Presence Penalty

Number

Penalizes tokens that have already appeared in the output, encouraging new topics.

Repetition Penalty

Number

Penalizes repeated tokens. Values above 1 discourage repetition; values below 1 encourage it.

Default: 1 Range: 0 - 2 (step 0.01)

Frequency Penalty

Number

Penalizes tokens based on how often they have already appeared, reducing verbatim repetition.

Supported Request Parameters

Parameters currently listed by OpenRouter or the local catalog for this model.

Top P Top K Min P Presence Penalty Repetition Penalty Frequency Penalty

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Community discussion

What people think about Qwen3.6-35B-A3B

Qwen3.6-35B-A3B discussions are most active in r/LocalLLaMA, r/LocalLLM, r/unsloth. Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions.

The strongest match in this snapshot has 2297 upvotes and 709 comments.

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec with 80%+ draft acceptance rate on the benchmark found here: [https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py](https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py)

Here's my PC specs:

OS: CachyOS (HIGHLY recommended)
CPU: AMD Ryzen 7 9700X
RAM: 48GB DDR5-6000 EXPO I
GPU: RTX 4070 Super 12GB

Results with other hardware may vary.

To run llama.cpp with MTP support, you need to build it from source and add a draft PR that hasn't yet been merged with the master branch. You can find a very nice guide on how to do that here and also download the Qwen3.6 MTP GGUF: [https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF](https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF) \- Thanks u/havenoammo!

llama.cpp command:

llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
-fitt 1536 \
-c 131072 \
-n 32768 \
-fa on \
-np 1 \
-ctk q8_0 \
-ctv q8_0 \
-ctkd q8_0 \
-ctvd q8_0 \
-ctxcp 64 \
--no-mmap \
--mlock \
--no-warmup \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0

The most important parameter here is -fitt 1536. Since part of the model is offloaded to CPU because of its size and , this tells llama.cpp to properly balance the load on the GPU/CPU to get the best possible performance, and leaves 1536 MB of free memory for the MTP draft model and KV cache. Since I'm running my dGPU as a secondary GPU (monitor plugged in the iGPU), I can use all the available 12GB VRAM for inference. 1536 might be too small if you use your dGPU as your primary GPU, so test it out first.

You can also try different values for -spec-draft-n-max. I got slightly better tok/sec with 3, but a much better acceptance rate with 2, so the trade off was not worth it. With MTP, you want to maximize speed AND acceptance, so you need to find the best balance between both.

Benchmark results:

mtp-bench.py

code_python        pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=80.8
code_cpp           pred=  58 draft=  40 acc=  37 rate=0.925 tok/s=81.8
explain_concept    pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=70.0
summarize          pred=  53 draft=  40 acc=  32 rate=0.800 tok/s=75.4
qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=77.8
translation        pred=  22 draft=  16 acc=  13 rate=0.812 tok/s=81.9
creative_short     pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=69.2
stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=76.5
long_code_review   pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=73.2

If you have any questions, feel free to ask :)

Cheers.

Open Reddit thread

Spent an evening dialing in Qwen3.6-35B-A3B on consumer hardware. Fun side note: I had **Claude Opus 4.7 (just the $20 sub)** build the config, launch the servers in the background, run the benchmarks, read the VRAM splits from the llama.cpp logs, and iterate on the tuning — basically did the whole thing autonomously. I just told it what hardware I have and what I wanted to run.

Sharing because the common `--cpu-moe` advice is leaving **54% of your speed on the table** on 16GB GPUs.

# Hardware

* **GPU:** RTX 5070 Ti (16GB GDDR7, Blackwell)
* **CPU:** Ryzen 9800X3D (96MB L3 V-Cache)
* **RAM:** 32GB DDR5
* **Stack:** llama.cpp b8829 (CUDA 13.1, Windows x64)
* **Model:** `unsloth/Qwen3.6-35B-A3B-GGUF` — `UD-Q4_K_M` (22.1 GB)

# The finding — --cpu-moe vs --n-cpu-moe N

Everyone’s using `--cpu-moe` which pushes ALL MoE experts to CPU. On a 16GB GPU with a 22GB MoE model that means **only \~1.9 GB of your VRAM gets used** — the other \~12 GB sits idle.

`--n-cpu-moe N` keeps experts of the first N layers on CPU and puts the rest on GPU. With `N=20` on a 40-layer model, the split uses VRAM properly.

# Benchmarks (300-token generation, Q4_K_M)

|Config|Gen t/s|Prompt t/s|VRAM used|
|:-|:-|:-|:-|
|`--cpu-moe` (baseline)|51.2|87.9|3.5 GB|
|`--n-cpu-moe 20`|**78.7**|**100.6**|12.7 GB|
|`--n-cpu-moe 20` \+ `-np 1` \+ 128K ctx|**79.3**|**135.8**|13.2 GB|

**+54% generation speed, +54% prompt speed** vs. naive `--cpu-moe`. Jumping to 128K context is essentially free thanks to `-np 1` dropping recurrent-state memory.

# Startup command that works

llama-server.exe ^
-m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
--n-cpu-moe 20 ^
-ngl 99 ^
-np 1 ^
-fa on ^
-ctk q8_0 -ctv q8_0 ^
-c 131072 ^
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 ^
--presence-penalty 0.0 --repeat-penalty 1.0 ^
--reasoning-budget -1 ^
--host 0.0.0.0 --port 8080

That’s Unsloth’s “Precise Coding” sampling preset. For general use: `--temp 1.0 --presence-penalty 1.5`.

# Gotchas I hit (well, that Opus hit and fixed)

* `-np` **defaults to auto=4 slots.** Wastes memory on recurrent state (\~190 MB). Set `-np 1` for single-user setups (OpenCode etc.).
* `--fit-target` **doesn’t help here** — `-ngl 99` \+ `--n-cpu-moe N` already gives you deterministic control.
* `-ctk q8_0 -ctv q8_0` is nearly lossless and halves your KV cache vs fp16. 128K ctx only costs 1.36 GB VRAM.
* **Qwen3.6 is a hybrid architecture** — only 10 layers are standard attention, the other 40 are Gated Delta Net (recurrent). That’s why KV memory is so small.

# How to tune N for your GPU

Each MoE layer on GPU costs \~530 MB VRAM. Non-MoE weights are \~1.9 GB fixed. For a 40-layer model:

|GPU VRAM|Recommended `N`|
|:-|:-|
|8 GB|stay with `--cpu-moe`|
|12 GB|`N=26`|
|16 GB|`N=20` (sweet spot)|
|24 GB|`N=8` (fits almost everything)|

Start conservative, watch VRAM during a long-context generation, then step `N` down by 2-3 until you have \~2 GB headroom.

# TL;DR

Replace `--cpu-moe` with `--n-cpu-moe 20`, add `-np 1`, and you get **79 t/s + 128K context** on a 5070 Ti. The 9800X3D’s V-Cache carries the CPU side effortlessly.

And Claude Opus 4.7 on the $20 Pro sub is genuinely good enough now to run this kind of hardware-tuning loop end-to-end — launch servers in background, parse logs, iterate — without hand-holding. Kind of wild.

Happy to test other configs if anyone wants comparisons.

**\*\*\*\*\*\*\*\*\*\*\*\*\*EDIT — Thanks to some great comments, the setup got better. Updated findings:**

**1.** `--fit on --fit-ctx 128000 --fit-target 512` **> manual** `--n-cpu-moe 20`

Shoutout to the commenter who recommended the “fit-triple”. It auto-probes VRAM, picks N for you (landed on N=19 here), and adapts if drivers steal VRAM. Slightly faster than my hand-tuned N=20 and zero brain power to maintain. **Caveat:** bare `--fit on` silently drops ctx to 4K — always pair it with `--fit-ctx`.

**2. My original prefill numbers were way too low**

A commenter correctly flagged that \~135 t/s prefill is nonsense for a 5070 Ti. They were right — that was server-side timing including first-token latency. Re-ran with `llama-bench` (3 reps, same config):

|Test|t/s|
|:-|:-|
|pp512|1182|
|pp2048|1644|
|tg128|91.5|

So real prefill is **\~1.2–1.6k t/s**, not 135.

**Final “best command” for 16 GB VRAM + 32 GB RAM :**

llama-server.exe ^
-m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
--fit on ^
--fit-ctx 128000 ^
--fit-target 512 ^
-np 1 ^
-fa on ^
-ctk q8_0 ^
-ctv q8_0 ^
--temp 0.6 ^
--top-p 0.95 ^
--top-k 20 ^
--min-p 0.0 ^
--presence-penalty 0.0 ^
--repeat-penalty 1.0 ^
--reasoning-budget -1 ^
--host 0.0.0.0 ^
--port 8033

Keep the comments coming, every round makes this faster. :D

\*\*\*\*\*

**EDIT 2 — Another commenter’s tip got me one more layer on the GPU:**

Dropping `--fit-target` from 512 → 256 squeezes **one extra MoE layer onto the GPU** (N=18 instead of 19). The commenter also suggested adding `--mlock` alongside `--no-mmap` to lock RAM pages against swap.

Benched both changes vs. the previous EDIT’s config (fit-target 512 + no-mmap):

|Config|pp512|pp2048|tg128|
|:-|:-|:-|:-|
|fit-target 512 + no-mmap|2769|2729|91.5|
|**fit-target 256 + no-mmap + mlock**|**2743**|**2724**|**96.3**|

**+7% generation**, prefill unchanged. Costs nothing — just a smaller VRAM headroom and explicit RAM locking.

**Updated final command:**

llama-server.exe ^
-m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
--fit on ^
--fit-ctx 128000 ^
--fit-target 256 ^
-np 1 ^
-fa on ^
--no-mmap ^
--mlock ^
-ctk q8_0 ^
-ctv q8_0 ^
--temp 0.6 ^
--top-p 0.95 ^
--top-k 20 ^
--min-p 0.0 ^
--presence-penalty 0.0 ^
--repeat-penalty 1.0 ^
--reasoning-budget -1 ^
--host 0.0.0.0 ^
--port 8033

**\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\***

**EDIT 3 — Two more community tips landed big wins:**

**1.** `-ub 2048` **(ubatch size) = +59% prompt-processing at 2K tokens**

Default `-ub` is 512. Bumping it to 2048 (and matching `-b 2048`) lets the GPU process more tokens in parallel per prefill step. Benched (5 reps each):

|ubatch|pp512|pp2048|pp4096|tg128|
|:-|:-|:-|:-|:-|
|512 (default)|2739|2778|—|98.7|
|1024|2689|3689|—|100.5|
|**2048**|2771|**4453**|4417|98.4|
|4096|2736|4427|4866|100.4|

**2048 is the sweet spot** — 59% faster at 2K-prompts, gen untouched. 4096 only helps beyond 2K-prompts (compute buffer saturates otherwise) and eats more VRAM.

**2.** `--chat-template-kwargs "{\"preserve_thinking\": true}"` **for agentic workflows**

Qwen3.6-specific chat template parameter. Default only keeps the latest user turn’s thinking; `preserve_thinking: true` carries thinking traces from all historical messages forward. Turns out Qwen3.6 was specifically trained for this behavior. Benefits:

* Better decision consistency across tool-calling turns
* Fewer redundant re-reasonings → lower token consumption in long agent sessions
* Better KV-cache reuse across turns

**Final final command:**

llama-server.exe ^
-m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
--fit on ^
--fit-ctx 128000 ^
--fit-target 256 ^
-np 1 ^
-fa on ^
--no-mmap ^
--mlock ^
-b 2048 ^
-ub 2048 ^
-ctk q8_0 ^
-ctv q8_0 ^
--temp 0.6 ^
--top-p 0.95 ^
--top-k 20 ^
--min-p 0.0 ^
--presence-penalty 0.0 ^
--repeat-penalty 1.0 ^
--reasoning-budget -1 ^
--chat-template-kwargs "{\"preserve_thinking\": true}" ^
--host 0.0.0.0 ^
--port 8033

**Total benched throughput on 5070 Ti 16 GB + 9800X3D + 32 GB DDR5-6000:**

* **pp512 \~2771 t/s**
* **pp2048 \~4453 t/s**
* **pp4096 \~4417 t/s** (bump `-ub` to 4096 for +10% here if you do long prompts)
* **tg128 \~98 t/s**
* **Context: 128K**

This community keeps delivering. Thank you.

Open Reddit thread
r/LocalLLaMA 106 upvotes 56 comments April 22, 2026
Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config

# Hardware

|Component|Details|
|:-|:-|
|**Machine**|MacBook Pro (Mac14,6)|
|**Chip**|Apple M2 Max — 12-core CPU (8P + 4E)|
|**Memory**|64 GB unified memory|
|**Storage**|512 GB SSD|
|**OS**|macOS 15.7 (Sequoia)|

# AI Agent Setup

I'm using the [**pi coding agent**](https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent) as my primary development assistant. It's a local-first AI coding agent that connects to local models via llama.cpp.

**Model:** `Qwen3.6-35B-A3B` (running via llama.cpp)

# How pi Connects to llama-server

The pi agent communicates with llama-server via the OpenAI-compatible API. Configuration lives in `~/.pi/agent/models.json`:

{
"providers": {
"llama-cpp": {
"baseUrl": "http://127.0.0.1:8080/v1",
"api": "openai-completions",
"apiKey": "ignored",
"models": [{ "id": "Qwen3.6-35B-A3B", "contextWindow": 131072, "maxTokens": 32768 }]
}
}
}

# The Command

llama-server \
-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q5_K_XL \
-c 131072 \
-n 32768 \
--no-context-shift \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--repeat-penalty 1.00 \
--presence-penalty 0.00 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--batch-size 4096 \
--ubatch-size 4096

# Parameter Breakdown

|Flag|Value|Why|
|:-|:-|:-|
|`-hf`|`unsloth/...:UD-Q5_K_XL`|HuggingFace model repo with unsloth's custom UD quantization — good quality/size tradeoff (\~29 GB)|
|`-c 131072`|128K context|This model supports a massive context window — set it high for long documents or extended conversations|
|`-n 32768`|32K output tokens|Allows long single-turn generations without hitting the generation limit|
|`--no-context-shift`|Off|Prevents context shifting during generation — keeps long responses coherent|
|`--chat-template-kwargs`|`preserve_thinking: true`|Keeps the model's reasoning/thinking blocks intact in the output|
|`--batch-size 4096`|4096|Logical batch size — higher = faster prompt processing, needs more memory|
|`--ubatch-size 4096`|4096|Physical batch size — kept equal to logical batch for consistency|

# Sampling Parameters

The sampling parameters (`--temp`, `--top-p`, `--top-k`, `--repeat-penalty`, `--presence-penalty`) are taken directly from [unsloth's recommended config for Qwen3.6](https://unsloth.ai/docs/models/qwen3.6). I use these as-is since they're the official recommendations from the model's creators and produce good results out of the box.

Open Reddit thread

This module is fast and smart can someone do some benchmarks?
It's seems to be real smart.

[https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF](https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF)

Open Reddit thread

So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me.

I am running a 3070 8gb + 64gb DDR4. Pretty lightweight setup so I chose the smallest Q4 unsloth model **Qwen3.6-35B-A3B-UD-IQ4\_XS.gguf** \- which is \~18gb. It does run ok, and with some optimizations in llama.cpp I got about 25-30 tokens/s with a 32k context window.

I did have some problems with looping during thinking so I tried a bigger Q4 model **Qwen3.6-35B-A3B-UD-Q4\_K\_XL.gguf -** \~23gb. To my surprise, this is much faster! With a 128k context window, I am seeing 32 tokens/s.

I ended up using Q5\_K\_S for best quality/speed balance - about 30 tokens/s. Oh, and I'm also using 128k context window. The speed does go down with long context. It's still over 25 at 50k context though! (haven't tested higher yet)

Bottom line - for MoE models like this, experiment with bigger quants than you'd expect to be able to use!

Open Reddit thread
View more discussions →

More models from Qwen

Continue browsing adjacent models from the same provider.

← All AI Models