Gemini 3.1 Pro vs Gemini 2.5 Flash Lite
Compare Gemini 3.1 Pro and Gemini 2.5 Flash Lite across pricing, context window, capabilities, benchmarks, and API access to choose the better fit for long-context workloads versus long-context workloads.
Overview Comparison
Structured side-by-side differences for the highest-signal model metadata.
Provider
The entity that currently provides this model.
Model ID
The routed model identifier exposed by upstream providers.
Input Context Window
The number of tokens supported by the input context window.
Maximum Output Tokens
The number of tokens that can be generated by the model in a single request.
Open Source
Whether the model's code is available for public use.
Release Date
When the model was first released.
Knowledge Cut-off Date
When the model's knowledge was last updated.
API Providers
The providers that currently expose the model through an API.
Modalities
Types of data each model can process or return.
Pricing Comparison
Compare current token pricing before you choose the cheaper or more scalable API option.
Capabilities Comparison
See where each model overlaps, where they differ, and which one supports more of the features you care about.
Benchmark Comparison
Shared benchmark rows make it easier to compare performance where both models have published scores.
| Benchmark | Gemini 3.1 Pro | Gemini 2.5 Flash Lite |
|---|---|---|
|
AIME 2024
American math olympiad problems
|
||
|
ARC-AGI-2
Novel abstract reasoning and pattern recognition
|
||
|
BrowseComp
Complex web browsing and information retrieval
|
||
|
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
|
||
|
HLE
Questions that challenge frontier models across many domains
|
||
|
LiveCodeBench
Real-world coding tasks from recent competitions
|
||
|
MATH-500
Undergraduate and competition-level math problems
|
||
|
MCP-Atlas Tool Use
Structured tool use via Model Context Protocol
|
||
|
MMLU-Pro
Expert knowledge across 14 academic disciplines
|
||
|
MMMLU
Multilingual and multimodal understanding
|
||
|
SciCode
Scientific research coding and numerical methods
|
||
|
SWE-bench Pro
Challenging real-world software engineering tasks
|
||
|
SWE-bench Verified
Real GitHub issues requiring multi-file code fixes
|
||
|
Terminal-Bench 2.0
Agentic coding and terminal command tasks
|
||
|
τ²-bench Retail
Agentic tool use in retail scenarios
|
||
|
τ²-bench Telecom
Agentic tool use in telecom scenarios
|
What Reddit discussions say about Gemini 3.1 Pro vs Gemini 2.5 Flash Lite
Gemini 3.1 Pro and Gemini 2.5 Flash Lite are both surfacing live Reddit discussions, giving this comparison a community layer beyond specs and benchmarks.
The most visible threads right now are clustered in r/GeminiAI, r/Bard, r/singularity.
Hi all,
I'm new to posting on this sub but I have gotten a lot of positive feedback on my build and have been asked to provide a guide.
**Notes:**
* AIOStreams is awesome but it can be challenging/intimidating to set up for beginners. I hope this guide is helpful regardless of your experience level.
* I sometimes say "required" or "optional" but technically everything here is optional. When I say "optional" here, I mean that it doesn't really take too much away from the main aspects of the build to omit it. You could probably figure out ways to replicate much of the build without some of the "required" things but I won't offer guidance on every possible combination/scenario in this guide. Feel free to ask in the comments though.
* All prices are in USD and are current as of posting.
**Key features of my build:**
1. Optimized: Fewer points of failure and increased redundancy without sacrificing performance.
2. Minimalist: Put all of the "heavy lifting" in the background so that I can keep the UX & UI as simple and clean as possible.
3. Aggressive language filtering/sorting for higher probability of getting correct audio & subtitles.
* Note that my build prioritizes English since it is my native language. I provide instructions for changing this.
4. All addons are within AIOStreams to keep everything fully customizable.
5. New approaches I have not found on this sub.
At the core of this build is AIOStreams. To have all of the addons in my build, I use [Midnight's instance](https://aiostreamsfortheweebsstable.midnightignite.me/stremio/configure). This will not be an all-encompassing guide to AIOStreams, just how to replicate my build. If you are unfamiliar with AIOStreams or just getting started, you can find great guides by following that link. However, my hope is that even a beginner could replicate this build using this guide (but may not fully understand AIOStreams in the end).
# Prerequisites
* Required - a willingness to accept that this probably isn't the perfect setup for you and you'll probably want to tweak it.
* Required - Stremio installed and running.
* Required - at least one debrid service.
* I recommend having two for redundancy.
* If it's just for you, I would recommend getting Real-Debrid and/or TorBox.
* If sharing with family/friends, I would recommend Torbox and/or Premiumize as they allow for concurrent streams from different IPs (Real-Debrid does not). This is what I have.
* Required - [TMDB API Key](https://developer.themoviedb.org/docs/getting-started) (free)
* Required - [TVDB API Key](https://www.thetvdb.com/api-information) (free)
* Required - [RPDB API Key](https://ratingposterdb.com/api-key/) (free)
* Required - [Trakt](https://trakt.tv) Account (free)
* Optional - [Debridio](https://debridio.com)
* A great scraper (good backup to Torrentio) and has other features.
* The price is $10/yr but I think it's worth it for most.
* Optional - [Google AI Studio](http://aistudio.google.com) (Gemini) API Key
* It's free (with rate limits) so why not.
* I went ahead and upgraded to Paid Tier 1 so I don't get rate-limited with multiple family members. It's dirt cheap and you get $300 credit for first 90 days (I've used $0.16 this month lol).
Pro tip: have all your API keys easily accessible as you're setting everything up (e.g., in your notes app).
# Getting Started
Head over to Midnight's instance of AIOStreams: [https://aiostreamsfortheweebsstable.midnightignite.me/stremio/configure](https://aiostreamsfortheweebsstable.midnightignite.me/stremio/configure)
Once there, make sure you select "Advanced" setup mode and familiarize yourself with the home page if this is your first time using AIOStreams.
Each section will now follow the tabs on the left (desktop) or top (mobile) of your screen on the AIOStreams website.
# Services
**Step 1:**
Click on the services tab (cloud icon) and select the debrid services you use. For Real-Debrid, TorBox, and Premiumize, this is as simple as pasting your API key found on the respective debrid's website. Here, I select TorBox and Premiumize but you can choose what you like (won't really make a difference).
**Step 2:**
Enter your RPDB, TMDB, and TVDB API keys at the bottom of the page.
# Addons
**Step 1:**
On the services screen, you can select "Next" or click the addons tab which has a puzzle icon to move forward to the addons section.
**Step 2:**
To the right of "Installed" click "Marketplace" so that we can install the addons we want.
**Step 3:**
In no particular order, you can search & install the following scraper addons:
1. Required - Torrentio
* Free - keep default settings.
* This is a popular scraper for torrents (files) to stream and will likely be the main source for files unless it's down.
* I include the other scrapers below for redundancy if torrentio is down or if there is a niche title. Most are free so why not have more options.
2. Required - Comet
* Free - keep default settings.
3. Required - Jackettio
* Free - keep default settings.
4. Required - TorrentGalaxy
* Free - keep default settings.
5. Required - TorrentsDB
* Free - keep default settings.
6. Required - StremThru Torz
* Free - keep default settings.
7. Optional - TorBox Search
* Paid - Requires TorBox API key entered in the "Services" section previously. This is included with all TorBox plans so "free" if you already have the service.
* Good scraper, backups others.
* Keep default settings.
8. Optional - Debridio Scraper
* Paid - Requires that you enter your Debridio API Key. Debridio is a paid service (see details in prereqs above).
* Good scaper, backups others.
* Paste API key, keep default settings.
Note that you can include a free popular scraper MediaFusion but I've had problems with it in this build. With how many scrapers I've already included, it doesn't really add much in my opinion.
**Step 4:**
In the same AIOStreams Marketplace from Step 3, search & install the following list/miscellaneous addons. These are all kinda optional and just really provide lists for the homepage. If you already have your own lists setup, feel free to substitute (also see step 5 if you can't find them in the marketplace). In no particular order:
1. REMOVED - AI Companion (can use Rotten Tomatoes instead maybe, config [here](https://7a82163c306e-rottentomatoes.baby-beamup.club/configure))
* EDIT - I can no longer recommend this addon as it seems like it’s down permanently. I will keep the instructions here in case it comes back online though.
* LLM Provider: select Gemini (OpenAI Compatible)
* LLM Provider API Key: paste your [Google aistudio](http://aistudio.google.com) api key here.
* Preferred search language: your language here (I put English).
* Model name: gemini-2.5-flash-lite (highest rate limits and fast).
* Maximum results: 10 (adjust to your liking)
* Keep default for everything else.
2. RPDB Catalogs
* Keep default.
3. Streaming Catalogs
* Select the services you want. Keep default for everything else.
4. USA TV
* Free - Keep defaults.
5. AI Search
* Paste AI studio API key
* If on a paid AI studio tier, turn off AI Response Caching. Otherwise, probably better to keep checked to avoid hitting rate limits on free tier.
* Paste RPDB api key.
* Language: yours here.
* Gemini Model Name: gemini-flash-latest
* Number of Recommendations: 20 (adjust to your liking)
6. Debridio TV
* Paid
* Paste your debridio api key and select what channels you want.
* Keep defaults for others.
**Step 5:**
AIOStudio addon marketplace doesn't have all stremio addons. However, you can add your own stremio addons by going to the same Marketplace section from steps 3 & 4, scrolling all the way down, and select configure under custom. Then, you paste the manifest url for the addon here (I just keep defaults). Below are the custom addons we'll configure in no particular order:
1. AIOMetadata
* Configure at: [https://aiometadatafortheweebs.midnightignite.me/configure/](https://aiometadatafortheweebs.midnightignite.me/configure/)
* The configuration is pretty straightforward. Add any of the API keys you have and configure the lists/catalogs to your liking.
* Here, I like to include the Gemini API key and integrate my trakt account for nice recs.
* Copy/paste manifest url at the end into the AIOStreams as instructed above.
2. AIOLists
* Configure at: [https://aiolistsfortheweebs.midnightignite.me](https://aiolistsfortheweebs.midnightignite.me)
* Same as AIOMetadata above but this one is easier.
3. IMDB Catalogs
* Configure at: [https://1fe84bc728af-imdb-catalogs.baby-beamup.club/configure](https://1fe84bc728af-imdb-catalogs.baby-beamup.club/configure)
* Just paste your RPDB api key on config site and then paste manifest url into AIOStreams.
**Step 6:**
Sort the lists/catalogs how you prefer. You can toggle individual lists off to hide them from home & discover pages in Stremio.
**Step 7:**
Go to "Installed" and at the bottom of the page, go to Addon Fetching Strategy. Select Dynamic and paste one of the below versions (change the language if non-English):
Version 2.0 (thanks to u/Razzmatazz1414 & u/HeyIntrovert):
This is the most recently updated one, best for most people. It may take slightly longer than V1 on more niche titles (no noticeable difference on new titles).
`((count(cached(regexMatched(resolution(language(quality(totalStreams, 'Bluray REMUX', 'Bluray', 'WEB-DL') 'English') '2160p')))) >= 3 and (count(cached(regexMatched(resolution(totalStreams, '2160p')))) >= 5 or count(cached(regexMatched(resolution(totalStreams, '1080p')))) >= 5) and count(cached(regexMatched(quality(totalStreams, 'Bluray REMUX', 'Bluray', 'WEB-DL', 'WEBRip')))) >= 5) or count(cached(totalStreams)) >= 3 and totalTimeTaken > 7000) or totalTimeTaken > 10000`
Version 2.1:
Use this one if you have a non-English (or English even) language that is not common you want to even more aggressively search for it. It will exhaustively search for your language, meaning if a stream exists with the language, it will find at least one (may not be high quality/resolution though). However, if a stream with your language does not exist, it will keep searching until the timeout condition which means it will take a while. I plan on optimizing this further and making a separate post for our non-English community but I hope this works in the meantime. MAKE SURE TO CHANGE LANGUAGE IF DESIRED.
`(((count(cached(regexMatched(resolution(language(quality(totalStreams, 'Bluray REMUX', 'Bluray', 'WEB-DL') 'English') '2160p')))) >= 3 and (count(cached(regexMatched(resolution(totalStreams, '2160p')))) >= 5 or count(cached(regexMatched(resolution(totalStreams, '1080p')))) >= 5) and count(cached(regexMatched(quality(totalStreams, 'Bluray REMUX', 'Bluray', 'WEB-DL', 'WEBRip')))) >= 5) or count(cached(totalStreams)) >= 3 and totalTimeTaken > 7000) and count(cached(language(totalStreams,'English'))) > 0) or totalTimeTaken > 10000`
Version 1.0:
My original condition. Use this if the above does not work.
`(count(cached(resolution(language(quality(totalStreams, 'Bluray REMUX', 'Bluray', 'WEB-DL', 'WEBRip') 'English') '2160p'))) >= 3 and (count(cached(resolution(totalStreams, '2160p'))) >= 5 or (count(cached(resolution(totalStreams, '2160p'))) > 0 and count(cached(resolution(totalStreams, '1080p'))) >= 5)) and count(cached(quality(totalStreams, 'Bluray REMUX', 'Bluray', 'WEB-DL', 'WEBRip'))) >= 5 and count(cached(language(totalStreams,'English'))) >= 2) or totalTimeTaken > 7000`
This will fire all of the torrent scrapers at once (in parallel) then as soon as there are "enough" files that are "high quality" then all of the searching stops. Often, this just grabs torrentio files and exits immediately. In the end, this makes sure that torrent search is super fast while also being redundant and gets quality streams.
# Filters
These next few sections are the "meat" of the build. Filters is where we tell AIOStreams which streams/files we want to keep/show after searching.
**Step 1:**
Now we move onto the next tab which is filters (funnel icon).
**Step 2:**
In Cache subsection, I like to exclude uncached (this is like excluding RD download). This makes sure I'm just streaming cached files from debrid and I don't have to wait for them to download to debrid.
**Step 3:**
Go to Resolution subsection. I require 2160p through 480p (nothing else with show up).
Select all resolutions in "Preferred Resolutions" then sort to your liking (I do 2160p first to Unknown last).
**Step 4:**
Quality subsection. I exclude CAM, TS, TC, SCR, Unknown.
I setup preferred qualities in the following order: BluRay REMUX, BluRay, WEB-DL, WEBRip, HDRip, HDTV, DVDRip, HC HD-Rip.
**Step 5:**
Encode subsection. I exclude XviD & DivX. I have the preference sorted: AVC, HEVC, AV1, Unknown.
**Step 6:**
Visual tags. Exlcude 3D. My preference order: HDR+DV, DV Only, DV, HDR10+, HDR10, HDR Only, HDR, 10bit, IMAX, SDR, Unknown.
**Step 7:**
Audio tags. My preference order: Atmos, DD+, DD, DTS, DTS-ES, DTS-HD, DTS-HD MA, TrueHD.
**Step 8:**
Language. Adjust this to your liking. My preference order is: English, Multi, Dual Audio, Dubbed, Unknown.
**Step 9:**
Stream Expression. My preference in order is (change language if non-english):
`language(resolution(cached(streams), '2160p'), 'English', 'Multi')`
`language(resolution(cached(streams), '1440p', '1080p'), 'English', 'Multi')`
This lets me put, for example, 1080p content with "for sure" english over 4K content with unknown/other language. This is aggressive and you may want to omit entirely (or change language, of course).
**Step 10:**
Regex. Here I just import Vidhin's regexes as stated on this page. Just go to the bottom of preferred regex patterns, click import, and paste this url: [https://raw.githubusercontent.com/Vidhin05/Releases-Regex/main/merged-anime-regexes.json](https://raw.githubusercontent.com/Vidhin05/Releases-Regex/main/merged-anime-regexes.json)
**Step 11:**
Size. I like to globally cap at 30GB because I find I get buffering over that. Adjust to your liking or omit.
**Step 12:**
Result Limits. I set global limits to 9 and resolution limit to 3. Then I get, for example, 3 4K streams, 3 1080p streams, and 3 720p streams (assuming all exist). This is plenty for me as I've done a lot of work on filtering and sorting and keeps my stream list minimal and simple. Adjust to your liking or omit.
**Step 13:**
Deduplicator. Enable this.
I keep the rest of the settings in the filters section as default.
# Sorting
Here is where we tell AIOStreams how to sort the streams/files found after filtering. This is the order in which they'll be displayed in stremio.
Set sort order type to global and include the following sort criteria: Library, Cached, Stream Expression Matched, Resolution, Language, Quality, Regex Patterns, Visual Tag, Encode, Size, Seeders.
I sort in the order above. This is aggressive with respect to language. Feel free to move language a bit lower if you care less. I found this is a good order for me.
# Formatter
Under Formatter Selection, select Custom. Then, paste this into name template:
`{stream.resolution::exists["{stream.resolution::replace('2160p','4K')}"||"NA"]}{service.cached::isfalse[" Download"||""]}`
Then for description template:
`{stream.seasonEpisode::exists["{stream.seasonEpisode::join('')}{tools.newLine}"||""]}{service.shortName}{service.cached::isfalse[" | ⬇️ {stream.seeders}"||""]}{stream.size::>0[" | {stream.size::bytes}"||""]}{tools.newLine}{stream.languages::exists["{stream.languages::join(', ')}"||"Language Unknown"]}{tools.newLine}{stream.resolution::=2160p::or::stream.resolution::=4K["★★★"||""]}{stream.resolution::=1080p["★★"||""]}{stream.resolution::=720p["★"||""]}{stream.resolution::=2160p::or::stream.resolution::=4K::or::stream.resolution::=1080p::or::stream.resolution::=720p[""||"★"]}{stream.quality::=WEB-DL::or::stream.quality::=BluRay::or::stream.quality::~REMUX["★"||""]}{stream.uLanguageCodes::~EN::or::stream.languageCodes::~EN["★"||""]}`
Here is an example of what it looks like:
https://preview.redd.it/l84vnht3s0bg1.png?width=2868&format=png&auto=webp&s=da9626fa8c4fff3d0557074fa5d9fec0b5da8aa7
I have also been experimenting with replacing the language with quality. Here is the description template for that:
`{stream.seasonEpisode::exists["{stream.seasonEpisode::join('')}{tools.newLine}"||""]}{service.shortName}{service.cached::isfalse[" | ⬇️ {stream.seeders}"||""]}{stream.size::>0[" | {stream.size::bytes}"||""]}{tools.newLine}{stream.quality::exists["{stream.quality}"||""]}{tools.newLine}{stream.resolution::=2160p::or::stream.resolution::=4K["★★★"||""]}{stream.resolution::=1080p["★★"||""]}{stream.resolution::=720p["★"||""]}{stream.resolution::=2160p::or::stream.resolution::=4K::or::stream.resolution::=1080p::or::stream.resolution::=720p[""||"★"]}{stream.quality::=WEB-DL::or::stream.quality::=BluRay::or::stream.quality::~REMUX["★"||""]}{stream.uLanguageCodes::~EN::or::stream.languageCodes::~EN["★"||""]}`
# Proxy
I leave everything as default here.
# Miscellaneous
I just enable pre-cache next episode (just a safety measure) and auto play. Keep everything else as default.
# Save & Install
Create a password and write it down (seriously). Click create and write down your UUID (very seriously). The only way to access/tweak this configuration in the future is via this UUID and Password combo.
Click install and import into Stremio as you normally do with addons!
# Final Notes
Under this build, the only addons I have in Stremio are Cinameta, Local Files, Trakt Integration, OpenSubtitles Pro, and AIOStreams (that we just configured). I personally delete the other addons and also use [this Addon Manager](https://stremio-addon-manager.pages.dev) to remove the popular Cinameta lists (removes from search and home page) and also remove the Trakt lists (we have these elsewhere).
This guide was requested by u/Fwhy_ u/DrZakarySmith u/[Equivalent\_Hawk\_9769](/user/Equivalent_Hawk_9769/) u/[BilgeMongoose](/user/BilgeMongoose/) and others!
Edit: Forgot to add my template to the post, dang! I couldn’t figure out how to get AIOStreams to accept the URL so unfortunately you have to download manually to use it (or copy/paste the json into a text editor for safety). Also idk if it fully works but you can always read the json file. Please let me know if there are problems. [https://drive.proton.me/urls/YYBWZGNXP0#QccY8og0POBf](https://drive.proton.me/urls/YYBWZGNXP0#QccY8og0POBf)
Edit 2: thank you for the amazing feedback, support, and awards! You all are truly who make this community what it is. I’m trying my hardest to respond to everyone’s questions! If I miss you on accident, feel free to DM me!
It is crazy that Qwen3.6 27B now matches Sonnet 4.6 on AA's Agentic Index, overtaking Gemini 3.1 Pro Preview, GPT 5.2 and 5.3 as well as MiniMax 2.7. It made gains across all three indices but the way the Coding Index works, I don't think the gains are as apparent as they should be. The Coding Index only uses Terminal Bench Hard and SciCode which are both strange choices. Cleary the training on the 3.6 models out now has focused on agentic use for OpenClaw/Hermes but it's interesting how close to frontier models such a small model can get. Qwen3.6 122B might be epic. . .
**Deepseek V4** will probably release this week. Since I've already posted quite a lot about it here and I'm very hyped about V4, **I've summarized all the leaks. Everything is just leaked, unconfirmed**! Of course, everything could be different. If you have any new information or updates, please post them here! If you have different views or a different opinion, write them down too.
# DeepSeek V4 - Release
The release was originally expected for mid-February, alongside Gemini 3.1 Pro. However, DeepSeek has been delayed – this is not unusual and has happened multiple times before. The new release strongly points to **March 3rd** (Lantern Festival / 元宵节), but it could also be later in the week. The Financial Times reported on February 28th that V4 is coming "next week," timed to coincide with China's "Two Sessions" (两会) starting March 4th. DeepSeek's release pattern shows that new models often drop on **Tuesdays**. A short technical report is expected to be published simultaneously, with a full engineering report following about a month later.
# DeepSeek Delay History
DeepSeek delays regularly. Here's the pattern:
|Model|Originally Expected|Actual Release|Delay|
|:-|:-|:-|:-|
|DeepSeek-R1|Lite Preview Nov 2024, Full Version Dec 2024|January 20, 2025|\~4-8 weeks|
|DeepSeek-R2|May 2025 (according to reports)|Never released – replaced by R1-0528 update|Cancelled|
|DeepSeek-V3.1|Early Summer 2025 (expected)|August 21, 2025|Several months|
|DeepSeek-V3.2|Fall 2025 (expected)|December 1, 2025 (V3.2-Exp: Sep 29)|Weeks|
|DeepSeek-V4|\~February 17, 2026|\~March 3, 2026?|\~2 weeks|
# Architecture & Specifications – What Can We Expect?
**All unconfirmed! Much of this has been leaked but could turn out differently!**
# V4 Flagship – Main Model
|Specification|DeepSeek V3/V3.2|DeepSeek V4 (Leaks)|
|:-|:-|:-|
|Total Parameters|671B–685B MoE|\~1 Trillion (1T) MoE|
|Active Parameters/Token|\~37B|\~32B (fewer despite a larger model!)|
|Context Window|128K (since Feb '26: 1M)|1 Million Tokens (native)|
|Architecture|MoE + MLA|MoE + MLA + Engram Memory + mHC + DSA Lightning|
|Multimodal|No (text only)|Yes – Text, Image, Video, Audio (native)|
|Expert Routing|Top-2/Top-4 from 256 experts|16 experts active per token (from hundreds)|
|Hardware Optimization|Nvidia H800/H20 (CUDA)|Huawei Ascend + Cambricon (Nvidia secondary!)|
|Training|14.8T Tokens, H800 GPUs|Trained on Nvidia, inference optimized for Huawei|
|License|\-|\-|
|Input Modalities|Text|Text, Image, Video, Audio|
|Output Modalities|Text|Text (Image/Video generation unclear)|
|Estimated Input Price|$0.28/M Tokens|\~$0.14/M Tokens|
|Estimated Output Price|$0.42/M Tokens|\~$0.28/M Tokens|
# New Architecture Features (all backed by papers)
* **Engram Conditional Memory** (Paper: arXiv:2601.07372, Jan 13, 2026): O(1) hash lookup for static knowledge directly in DRAM. Saves GPU computation. 75% dynamic reasoning / 25% static lookups. Needle-in-a-Haystack: 97% vs. 84.2% with standard architectures
* **Manifold-Constrained Hyper-Connections (mHC)**: Solves training stability at 1T+ parameters. Separate paper published in January 2026
* **DSA Lightning Indexer**: Builds on V3.2-Exp's DeepSeek Sparse Attention. Fast preprocessing for 1M-token contexts, \~50% less compute
# DeepSeek V4 Lite (Codename: "sealion-lite")
A lighter variant has leaked alongside the flagship. At least one inference provider is testing the model under strict NDA.
|Specification|V4 Lite (Leak)|
|:-|:-|
|Parameters|\~200 Billion|
|Context Window|1M Tokens (native)|
|Multimodal|Yes (native)|
|Engram Memory|No (according to 36kr, not integrated)|
|vs. V3.2|"Significantly better" than current Web/App|
|Non-Thinking vs. V3.2 Thinking|Non-Thinking mode surpasses V3.2 Thinking mode|
|Status|NDA testing at inference providers|
# SVG Code Leak Examples
* **Xbox Controller**: 54 lines of SVG – highly detailed and efficient
* **Pelican on a Bicycle**: 42 lines of SVG – multi-element scene
According to internal evaluations: V4 Lite outperforms DeepSeek V3.2, Claude Opus 4.6 AND Gemini 3.1 in code optimization and visual accuracy.
# Leaked Benchmarks (NOT verified!)
**⚠️ IMPORTANT: All benchmark numbers come from internal leaks. The "83.7% SWE-bench" graphic circulating on X has been confirmed as FAKE (denied by the Epoch AI/FrontierMath team). The numbers below are the more conservative, more frequently cited leaks.**
|Benchmark|V4 (Leak)|V3.2|V3.2-Exp|Claude Opus 4.6|GPT-5.3 Codex|Qwen 3.5|
|:-|:-|:-|:-|:-|:-|:-|
|HumanEval (Code Gen)|\~90%|–|–|\~88%|**\~93%**|–|
|SWE-bench Verified|**>80%**|\~73.1%|67.8%|80.8%|80.0%|76.4%|
|Needle-in-a-Haystack|97% (Engram)|–|–|–|–|–|
|MMLU-Pro|TBD|85.0|–|85.8|–|–|
|GPQA Diamond|TBD|82.4|–|91.3|–|–|
|AIME 2025|TBD|93.1|–|87.2|–|–|
|Codeforces Rating|TBD|2386|–|2100|–|–|
|BrowseComp|TBD|51.4-67.6|40.1|84.0|–|–|
# Huawei & Hardware – The Geopolitical Dimension
* **Reuters (Feb 25)**: DeepSeek deliberately denied Nvidia and AMD access to the V4 model
* **Huawei Ascend + Cambricon** have early access for inference optimization
* Training was done on Nvidia hardware (H800), but **inference** is optimized for Chinese chips
* For the open-source community on Nvidia GPUs: performance could be **suboptimal** at launch
* This is an unprecedented hardware bet for a frontier model
# Price Comparison (estimated)
|Model|Input/1M Tokens|Output/1M Tokens|
|:-|:-|:-|
|DeepSeek V4 (estimated)|**\~$0.14**|**\~$0.28**|
|DeepSeek V3.2|$0.28|$0.42|
|Kimi K2.5|$0.60|$3.00|
|Gemini 3.1 Pro|$2.00|$12.00|
|Claude Opus 4.6|$5.00|$25.00|
If correct: V4 would be **36x cheaper** than Claude Opus 4.6 on input and **89x cheaper** on output.
# Open Questions
* Does V4 actually generate images/videos or just understand them?
* Will Nvidia GPU users get an optimized version?
* When will the open-source weights be released?
**Sources**: Financial Times, Reuters, CNBC, awesomeagents.ai, nxcode.io, FlashMLA GitHub, r/LocalLLaMA, Geeky Gadgets, 36kr
**Edit 03.03.2026**
The chance that the model will be released this week is relatively high, but not today. It is assumed that Deepseek will be released between March 3 and 5 if it is not published within the next 5 hours today. It will come in the next few days, as it then deviates from the release pattern (in terms of time).
**Edit 03.03.2026 Part 2**
The situation is becoming increasingly heated and tense, with an extremely large number of leaks and sources currently emerging. Collecting them all and verifying their credibility would take a very long time. However, a release is expected this week, with Wednesday or Thursday being the most likely dates.
**Edit 03.03.2026 Part 3 – Evening Update**
March 3rd (Lantern Festival) has passed without a release. However, in Beijing it is currently the early morning of March 4th, meaning the Chinese workday hasn't even started yet. A release on March 4th is still very much possible, especially since China's "Two Sessions" (两会) begin today.
What happened today:
1. **V4 Lite is being silently updated in production.** AIBase reported today that DeepSeek quietly pushed a new V4 Lite version tagged "0302". Community testers report a massive quality jump in logic, code generation, and aesthetics – now reportedly on par with Claude Sonnet 4.6. This strongly suggests DeepSeek is actively fine-tuning V4 models right before the official launch. (Source: AIBase)
2. **36kr published a new article** titled "The Entire Village Anticipates DeepSeek to Join for Dinner" – confirming the entire Chinese tech industry is waiting for V4. (Source: 36kr)
**Edit 04.03.2026 – Why not today, why Thursday is THE day**
March 4 passed without a release – and that makes strategic sense.
**Why not today:**
* CPPCC opening day = all Chinese media focused on politics, V4 would've been buried
* Shanghai Composite dropped 0.98% to 4,082 (4-week low) – bad sentiment to release into
* Beijing evening release window (8-10 PM BJT) has passed
**Why Thursday March 5 is the perfect storm:**
* **NPC opens tomorrow morning** – Premier Li Qiang delivers Government Work Report with AI & tech as centerpiece of the new Five-Year Plan. Morning: politics declares AI a national priority → Evening: DeepSeek delivers the proof
* **BYD "disruptive technology" event same day** – DiPilot 5.0, Blade 2.0, DM 6.0 reveal. Global headline: "China showcases two AI breakthroughs in one day"
* **Market timing** – Shanghai closes 3 PM BJT, evening release gives markets overnight to digest, Friday opens with V4 hype
* **Developer weekend** – Thursday drop = Fri + Sat + Sun to test & benchmark
**Expected release window:**
|Release|Beijing Time|UTC|
|:-|:-|:-|
|R1 (Jan 2025)|\~10-11 PM|\~2-3 PM|
|V3.2 (Nov 2025)|\~12 AM|\~4 PM|
|**V4 (expected)**|**8-11 PM**|**12-3 PM**|
**If Thursday doesn't happen?**
* Friday = bad release day (weekend kills momentum, DeepSeek has never released on a Friday)
* Next window: Monday/Tuesday March 9-10
* But: silent V4 Lite "0302" production update + 36kr's "The Entire Village Anticipates DeepSeek" article suggest we're in final hours, not days
**Edit 05.03.2026**
It has to happen today. Deepseek Web was down for 40 minutes, but it hasn't been down for the last 30 days, and it was the same before the big launch of V3 and R1. In addition, today is the BYD event Deepseek Partner. It will happen in the next few hours, and if not, then Deepseek has missed the best window of opportunity they could ever have had.
**Edit 05.03.2026 Part 2**
**The model will not be released this week or probably next week. Although DeepSee v4 has been ready for a long time and there were really only a few minor issues left, the model would have been released last week or this week. Is there a major delay due to the government, because at the last minute they said that deepseek is not allowed to release the model as long as it does not run on Chinese hardware, but the model was trained on Nvidia, so such a restructuring naturally takes time, because the new technology in V4 was completely for Nvidia and not for Huawei, and I think we still know what happened with R2...**
**Edit 07.03.2026**
When will Deepseek be released? After all the leaks, news, and crisis status, Deepseek V4 will and must come and cannot end like R2. The Chinese government has gone too far with its AI and told the US that it no longer needs it, whereupon Trump, in order not to appear weak, wants to impose a ban that will allow him to control all chip trade (meaning no more chips to China).
However, BYD and China have praised Deepseek too much in recent days. If V4 ended up like R2 and didn't come out at all, China would look extremely foolish, which the government would never allow.
That's why I suspect that Deepseek will receive help from the Chinese government (in recent years, Deepseek's CEO has been in frequent talks with the government and has received support from it) and will no longer adhere to any release pattern, as Deepseek has already missed three good release windows. My guess is that they will release it when it is least expected, which could be this weekend. (V3.2 was released on Sunday) In order to weaken and expose Nvidia and the entire US market with new AI technology.
Deepseek waiting until Claude or other providers are ready is incorrect and highly unlikely. Deepseek has problems and needs to fix them before release. V4 is already 90% complete (Lite has been corrected several times and is said to be just as intelligent as Sonnet 4.6). We also know that Deepseek's CEO is a perfectionist and would never release a half-finished product or leave it unfinished, as was the case with the GLM-5 release
**🚨 UPDATE 11.03.2026 – 22:00 CET – V4 WEIGHTS SPOTTED**
Major development: Chinese quantization expert u/bdsqlsz (青龍聖者) on X was spotted uploading **DeepSeek-V4-INT8** model shards to HuggingFace with the caption "it is coming." The upload shows multiple `model-0...` shards, a `.gitattributes`, and a [`README.md`](http://README.md) — indicating a full model repo creation.
**Why this is significant:**
* u/bdsqlsz is a verified, well-known quantization specialist — not a random account
* INT8 quantization requires access to the **full original weights** first
* Historically, community quants appear **within hours** of official weight releases (V3: same day, R1: same day, V3.2: within 24h)
* This means the official FP8/BF16 weights either already exist on HuggingFace (possibly private/unlisted) or u/bdsqlsz has NDA access
**Full leaked specs now confirmed:**
* \~1 Trillion parameters (MoE), \~32B active per token
* 1M native context window
* Multimodal: text + vision + audio
* Huawei Ascend 910C optimized
* MIT License
**Previous delays explained:** Huawei Ascend inference optimization (only 80% Nvidia efficiency), Blackwell chip fingerprint removal, and CEO Liang Wenfeng's perfectionism. The 40-min web outage on March 5 was likely a deployment test.
**My prediction: Official release within 24-72 hours.** The weights exist. The upload is happening. Keep your monitors running.
⚠️ UPDATE 11.03 – Unverified leak: u/bdsqlsz posted V4-INT8 weight uploads on X. r/LocalLLaMA is split – top comment (193 upvotes) questions authenticity. The file structure looks technically correct and INT8 aligns with Huawei optimization rumors, but previous V4 benchmark leaks in February were confirmed fake. Treat with caution until official deepseek-ai repo appears on HuggingFace."
Will update when it drops. 🚀
The CL-40 was nerfed in Season 3 — complaints dropped. Then it was buffed in Season 4 — complaints **tripled**. They calmed down. Then it was buffed *again* in Season 6 — and complaints tripled *again*. A perfect buff→backlash→calm→buff→backlash cycle, visible across 247,453 Steam reviews. The Sword? Complaints have *doubled* since Season 3 despite multiple nerfs — and 27 players independently suggested the same fix: "just remove it." One 247-hour veteran couldn't take it anymore: *"GET THE LIGHT SWORD OUT OF THE GAME!"* (I feel his pain). Meanwhile, 135,000 players called this game "fun and addictive" — the most praised aspect by a landslide.
I downloaded every single Steam review for THE FINALS (247,453 total, 15 languages, 9 seasons), fed them through a two-stage AI pipeline, and built a [15-page interactive dashboard](https://aryzhkin.github.io/the-finals/) to let you explore it all yourself. Buff cycles, hidden patterns, 440K specific complaints and praise — the entire AI analysis that uncovered all of this cost **$9.30**. Here's what 247K players are actually saying — not 10 Reddit posts, but a quarter million data points.
---
### What I did
- Scraped all 247K reviews via the Steam API (15 languages, Seasons 0 through 9)
- **Stage 1**: AI classified each review into 42 categories (30 negative, 12 positive) — cost: $3.30
- **Stage 2**: AI extracted 440,481 specific complaints, suggestions, and praise — cost: $6.00
- Normalized everything against a database of game entities (weapons, gadgets, abilities) from THE FINALS Wiki
- Parsed 106 patch notes (470 balance changes) from THE FINALS Wiki and mapped them to player complaints
- Built a [15-page interactive dashboard](https://aryzhkin.github.io/the-finals/) — [completely open source](https://github.com/aryzhkin/the-finals)
**Total cost of the entire AI analysis: $9.30.**
---
### Community's Top Pain Points
What do 247K players actually complain about? Here's the all-time ranking alongside a comparison of the last two "three-season windows" (S4–S6 vs S7–S9), normalized per 1,000 reviews:
| # | Issue | Total | S4–S6 /1K | S7–S9 /1K | Trend |
|---|-------|-------|-----------|-----------|-------|
| 1 | Cheating / hackers | 8,327 | 18.9 | 15.0 | ↓ 21% |
| 2 | Matchmaking (skill disparity) | 5,472 | 32.1 | **36.2** | **↑ 13%** |
| 3 | Server crashes | 2,656 | 6.1 | 5.7 | — |
| 4 | Light class: overpowered | 2,249 | 9.5 | 10.1 | ↑ 6% |
| 5 | Server latency / lag | 1,850 | 7.0 | 8.2 | ↑ 17% |
| 6 | Heavy class: overpowered | 1,587 | 3.2 | 3.2 | — |
| 7 | Server disconnects | 1,333 | 1.9 | 3.7 | ↑ 95% |
| 8 | Game design: unbalanced | 1,180 | 4.4 | 4.3 | — |
| 9 | Cloaking Device: overpowered | 1,071 | 0.9 | 0.0 | fixed |
| 10 | Anti-cheat: ineffective | 955 | 2.5 | 1.6 | ↓ 36% |
The all-time ranking is misleading — cheating dominated at launch (5,395 complaints in S1 alone!), but it's down 21% in S7–S9. Cloaking Device — fixed. But **matchmaking keeps climbing** (+13%) and is now the clear #1 issue by a wide margin. Server disconnects have nearly doubled, lag is up too — network infrastructure is losing ground.
---
### Community's Top Requests
| # | Request | Mentions |
|---|---------|----------|
| 1 | More game modes | 1,352 |
| 2 | Region lock | 946 |
| 3 | More maps | 833 |
| 4 | More weapons | 514 |
| 5 | Text chat | 385 |
| 6 | Russian localization | 319 |
Trends: region lock requests **tripled** in recent seasons (S4–S6 → S7–S9), text chat appeared out of nowhere. "More game modes" and "more maps" are declining — and credit to Embark here: TDM was added in S5, maps are updated regularly, and the data shows players noticed. Some requests also shift dramatically depending on playtime — more on that below.
About region lock: if you read the actual reviews, this isn't an abstract request. The vast majority ask for region lock because of cheating on Asian servers. The main voices come from Korean, Japanese, and Thai players. In S1 the request was massive (6.7 per 1,000 reviews), then died down (0.3 in S4), and in S7–S9 it climbed back up — which may indicate a new wave of problems in the region. And an important point: if cheaters are rampant on Asian servers, the anti-cheat vulnerability exists — and other regions are at risk too. This is a systemic problem, not a regional one.
---
### What Players Love (yes, there's a LOT to love)
Before you think this is a hate post — the positive data is massive:
| # | Praise | Mentions |
|---|--------|----------|
| 1 | Fun & addictive gameplay | 135,000+ |
| 2 | Destruction physics | 13,702 |
| 3 | Free-to-play model | 8,298 |
| 4 | Graphics & visuals | 8,033 |
| 5 | Movement system | 6,416 |
| 6 | Gunplay feel | 3,414 |
135K players called this game fun. And what matters: praise is **rock solid** — fun, destruction, and movement didn't budge between S4–S6 and S7–S9. F2P and gunplay even grew (+24% and +17%). The core gameplay loop — destruction, movement, gunplay — is what keeps people coming back. This is the foundation Embark should never touch.
Some of my favorite actual reviews from the dataset:
> *"I was too fat and slow to get to the top of the building to steal the vault, so I just brought the building down to me. 10/10 by far."*
> *"Please, I can't sleep... I can hear someone is stealing my cashout. There is invisible light, Heavy is coming..."* — a 358-hour veteran, probably with PTSD
> *"Just played with my boys for 3 hours straight. Didn't win a damn thing. Had a great time anyway."*
---
### The Juicy Part: Patch Notes vs. Player Complaints
This is where it gets really interesting. I parsed all 106 patch notes from THE FINALS Wiki (470 balance changes across 9 seasons) and mapped them to the actual complaint data. Some patterns are striking:
**CL-40 Grenade Launcher — The Buff-Backlash Cycle**
- S3: nerfed (damage 110→93). Complaints: 7.9 per 1,000 reviews.
- S4: **buffed** (damage 93→117, blast radius 9→30cm). Complaints: **21.0** (+166%).
- S5: no changes. Complaints calmed to 5.7.
- S6: **buffed again** (radius 30→60cm). Complaints: back up to 20.7 by S7.
- A textbook buff→backlash→calm→buff→backlash cycle.
**Sword — Buffs, Nerfs, Rework, and Complaints That Keep Climbing**
- S4: initial buff (lunge ~5m→~6m). Mid-S4 and S6: two nerfs (lunge shortened, secondary damage 140→105)
- S7: major rework — primary damage 74→88, lunge range to 7m, lunge speed +17%
- Despite nerfs, complaints climbed from 29.6/1000 (S3) to **60.7** (S9) — an all-time high
- The S7 rework appears to have accelerated the trend: 32.5 (S6) → 50.8 (S7) → 60.7 (S9)
**Cloaking Device — How a Rework Can Backfire**
- S1–S4: complaints were steadily declining (45.9 → 18.6 per 1000)
- S5: rework — fire and poison no longer break invisibility (previously the main way to reveal a cloaked player) → complaints **surged** to 31.0 (+67%)
- S5–S6: a series of nerfs (duration 133s→27s, increased visibility, added activation delay) → back down to 17.5
- A classic case of "removed counterplay → invisibility became unstoppable → had to roll it back"
**Important disclaimer**: correlation ≠ causation. Complaint changes can also reflect meta shifts, player count changes, or attention shifting to new issues. But when a buff lines up perfectly with a complaint spike, and a nerf lines up with a drop... the pattern is hard to ignore.
You can explore every entity's timeline with patch markers on the dashboard — it's the Patch Notes page.
---
### Newcomers vs. Veterans: Two Different Games
One of the most interesting findings: **what "the community" wants depends entirely on who you ask**.
The dashboard has playtime filters — you can see the data through the eyes of newcomers (0–10h), regulars (50–100h), or hardcore players (500h+). The rankings shift dramatically:
- **Veterans** focus on: balance issues, anti-cheat quality, matchmaking fairness
- **Newcomers** focus on: content variety, server stability, basic accessibility
- The **cohort heatmap** on the dashboard shows approval varying by 10-15 percentage points across playtime brackets
Neither perspective is "wrong." But lumping them together hides the nuance. Retention starts with newcomers — if they quit due to cheaters or confusing UI, they never become veterans. But endgame quality is what keeps veterans engaged.
**"More game modes" — the top request... but which modes exactly?**
"More game modes" is the top request overall (1,352 mentions), but 80% come from players with under 50 hours. Filter to veterans and it drops sharply. And if you read the actual reviews, the picture becomes clear: newcomers come from COD/CS2/Valorant and expect Team Deathmatch. Instead, they find objective-based modes with cashouts and mandatory trios. Typical quotes: *"Why can't I just go team deathmatch and not worry about the money?"*, *"Not friendly to solo players — teammates quit on you"*. They haven't "failed to learn" the modes — they want a **different type of game** inside THE FINALS.
And here's the interesting part: Embark actually did it — **TDM was introduced as an LTM in S5, then made permanent in S6**. What does the data show?
- Requests specifically for "Add: Team Deathmatch" — S1: 54, S3: 12, S5: 9 (some reviews from before the LTM launched). After S5 — **zero**. TDM requests completely disappeared.
- But the general "more modes" request lives on: per 1,000 reviews — S4: 2.7, S5: 2.4, S6: 1.9, S7: 1.8, S8: 1.6, S9: 1.5.
- The downward trend **started long before TDM** (S1: 7.9 → S4: 2.7) — natural filtering: those who didn't accept the game's formula simply left.
Conclusion: TDM solved the specific problem — TDM requests dropped to zero. But "I want more modes" keeps coming, and after TDM was added it's unclear what people actually want — no specifics in the reviews, just a general "more variety." Personally, I think the game has plenty of modes and they're great — but the data says not everyone agrees. If you have ideas about what modes the game actually needs — drop them in the comments, I'm curious to hear.
---
### If I Were Advising Embark (Based on the Data)
**1. Anti-cheat is the #1 priority across ALL player segments.**
8,327 cheating complaints + 955 "anti-cheat ineffective" mentions. It's the top issue for newcomers AND veterans. No amount of new content matters if players feel the matches aren't fair.
**2. Don't touch the holy trinity: destruction, movement, gunplay.**
These three mechanics account for 23,500+ praise mentions. They're the reason 135K people called this game fun. Protect them at all costs.
**3. The Sword keeps getting stronger — and complaints keep climbing.**
60.7 complaints per 1,000 reviews in S9, up from 29.6 in S3. Embark has tried nerfs (S4, S6), but the S7 rework (7m lunge, higher damage) pushed complaints to record highs. The current iteration is the most complained-about version yet.
**4. Servers — a bigger problem than it looks.**
Crashes (2,656) + lag (1,850) + disconnects (1,333) = 5,839 complaints combined. In the table these are three separate rows, but they're really one systemic issue — and it's **bigger** than matchmaking (5,472). Server stability is especially critical for retaining newcomers: if the game crashes in the first few hours, there won't be a second chance.
And a general note on working with data: **listen to different player cohorts separately.** "More game modes" being a top request masks the fact that it's almost exclusively a newcomer ask. Veterans want balance and competitive integrity. Both matter, but they require different solutions.
---
### How It Was Done (for the curious)
- Started with regex-based classification → too many edge cases → switched to AI
- Model: Gemini 2.5 Flash Lite via PayPerQ ($0.07/M input tokens)
- Two-stage pipeline: categorize → extract specific issues
- Game entity data from [THE FINALS Wiki](https://www.thefinals.wiki/wiki/Main_Page)
- ~5 days of work total — 3 days building the pipeline and dashboard, then 2 more days of data quality audits, bug fixes, and polishing (fixing data integrity issues, adding disclaimers, verifying every number)
- **Completely open source** — link below
Fair warning: the dashboard UI isn't perfect — I know there's room for improvement on the design side. But this was a side project that already took way more time than I planned, and honestly I think it turned out pretty decent for a first attempt. The data and the analysis are what matter most here.
The data is current as of the scrape date but I haven't decided yet whether I'll keep it updated going forward. If there's enough interest — I'll set up regular updates and keep the dashboard fresh with new seasons and patches.
---
### What's Inside the Dashboard (15 Pages)
Here's a quick tour so you know what you're clicking into:
1. **Overview** — top-level metrics (247K reviews, approval rate, volume trends), top negative/positive categories with season & playtime filters, review volume timeline
2. **Community Insights** — the granular AI extraction: specific complaints, suggestions, and praise (440K data points), filterable by season & playtime, with optional vote-weighting
3. **Season Health** — approval rate and review volume per season, daily sentiment charts, top complaints/praise per season, recurring cross-season problems
4. **Player Journey** — how sentiment shifts with playtime (0–10h newcomers vs. 500h+ veterans), category heatmaps by playtime bracket, cohort × season approval matrix
5. **Praise vs Complaints** — same game aspects get both love and hate — paired categories show the contrast, plus cohort approval trends across seasons
6. **Entity Tracker** — search any weapon, gadget, or ability and see its mention timeline across seasons with complaint/praise ratio
7. **Category Deep-Dive** — pick any of the 42 categories and see its season trend + playtime distribution + related specific issues
8. **Language Analysis** — approval rates and complaint profiles by review language (15 languages), with deviation-from-global-average charts
9. **Top Reviews** — most helpful and most funny reviews, filterable by season, with a "Random Funny Review" button
10. **Review Explorer** — drill down from any category/issue to read actual player reviews, stratified by playtime bracket
11. **Word Cloud** — visual tag cloud of all categories sized by frequency, colored by sentiment
12. **Review Bombing** — daily/weekly spike detection for negative review surges, worst days table, patch date overlays
13. **Patch Notes** — all 106 patches (470 balance changes) mapped to complaint timelines — the buff→backlash analysis lives here
14. **Methodology** — full transparency on the AI pipeline, model parameters, confidence metrics, and all caveats
15. **About** — data sources, tech stack, credits
---
### Links
- **[Interactive Dashboard](https://aryzhkin.github.io/the-finals/)** — 15 pages of charts, filters, and drill-downs
- **[Source Code](https://github.com/aryzhkin/the-finals)** — scraper, AI pipeline, dashboard, everything
---
**If you were Embark — what would you prioritize first? And if you dig into the dashboard — share what you find, I'm curious what you'll uncover.**
https://preview.redd.it/vfmxgtb46vxg1.png?width=1915&format=png&auto=webp&s=9b7cedec52f05eefaf604699dca8246a259cf713
So my last post blew up, turns out a lot of people hit the same Claude blind-spots problem. Going deeper this time.
Quick recap. Been on the 20x Claude plan running Opus 4.6 / 4.7 exclusively for a while. Last week I tried Codex 5.5 and was shocked by how much Opus had been missing. Pairing them felt like the piece I'd been waiting for.
A week later I'm way past two agents. Current setup, all in tmux:
* 3x Codex CLI, each on a separate ChatGPT Plus account so reset windows don't collide
* Gemini 3.1 Pro Preview
* Kimi K2.6 + DeepSeek V4 Pro, both via OpenCode Go (way cheaper than API keys, and 3x limits on Kimi)
Built a `/work` command in Claude that handles four shapes: plan, implement, major bug, minor bug. For each one it builds a context pack, sends it to 3 reviewers in parallel, waits for consensus.
The thing that actually matters here is *lineage diversity*. Reviewers are picked as 1 Codex + 1 Gemini + 1 OpenCode. Same-family models share blind spots, three Codex sessions reviewing the same code is mostly an echo chamber. Need all three lineages to agree before the gate opens. If they don't, Claude revises and runs it again.
Before any merge, Claude fills out a 4 question checklist (coding principles, architecture drift, tests pass, reviewer consensus) and I pick merge / fix first / override with reason. Catches a lot of *"I think it's done"* moments.
Cost so far is basically $0 on top of the subscriptions I already had.
The thing I keep noticing: Opus by itself is great until it isn't, and the failures are silent. Code looks reasonable, tests pass, but there's a subtle bug or design drift that only shows up later. Having a different model family read the same code fresh catches a startling amount of it.
Happy to share the `/work` prompt and orchestrator if anyone wants to make it their own, let me know.
Edit: Check out [Chorus.codes](http://Chorus.codes) for the latest version of this repo.
Hey everyone, Kazuma here.
V7 is out. Go grab it: **GitHub:** [https://github.com/Arif-salah/Megumin-Suite](https://github.com/Arif-salah/Megumin-Suite)
Before I get into the fun stuff, I need to be real with you guys for a second.
# Real Talk First
I really don't like having to do this. I really don't. But Megumin Suite is a free project, it always has been, and it always will be, and it has been taking up a *huge* chunk of my time. Like, huge. I've got LTC address at the bottom of every post and every readme, and after all this time... basically nobody has ever donated. I am absolutely not guilt-tripping anybody, I promise. I get it. But I thought it was worth being at least a little bit upfront about the situation rather than just pretending it didn't matter.
Even if you genuinely can't afford to contribute financially, and that is absolutely okay, there is one thing that would help me out immensely: **an API key.** Running tests against multiple models is probably one of the biggest roadblocks that I currently face. Right now, I only have complete access to the Gemini 3.1. I can't test against Claude, I can't test against GPT, and sometimes I can't even test against DeepSeek because the server is swamped. If you have a key that you're not fully using and you'd be willing to let me use it that would genuinely help more than you know. DM me on Discord if you're open to it.
Okay. That's out of the way. Let's talk about the actual big update.
EDIT: sorry paypal hate me so no ko-fi link 😞. its ok just hear me rant about it.
**Crypto (LTC):** `LSjf1DczHxs3GEbkoMmi1UWH2GikmXDtis`
# The V7 Engine It Doesn't Feel Like AI Anymore
V7 is a ground-up rewrite. Not a tweak, not a patch. The entire ruleset was rebuilt with one goal: **make the AI stop acting like an assistant.**
If you've used any RP preset before including my older ones you've felt that invisible hand. The AI being too helpful. NPCs agreeing too fast. Conflicts resolving themselves in one turn because the model's base training say "be useful, wrap things up, make the user happy." V7 was designed to remove that instinct.
Here's what it actually does:
**Anti-Assistant Bias** The whole engine is built around the premise that the world does not give a fuck about you. NPCs fight back. They misinterpret you. They hold grudges. They get tired of you halfway through the conversation and just... leave. A good deed does not reset the relationship. An apology does not wipe out What you did. Forgiveness is a process which requires scenes, not words.
**The Knowledge Firewall.** This one is big. All NPCs are in information quarantine. They can only react to physical things they *see* and *hear*, not your internal narration and italicized thoughts. The PC's internal landscape is completely closed. If you write *"I feel pathetic"* as the narration but do not show it externally, nobody notices. The output will always depend on your model Less smart mean more errors.
**Cultural Anchoring.** The AI uses actual culture actual artists' names, actual brands, actual platforms, actual news headlines, actual memes. No more "the popular social media app" or "a famous pop song." In case of setting in 2025, the story should have a real headline on a TV in the background and some character hums a real song. It appears in the text like seasonings – here and there without being forced, not as a list of references but as a texture of a real world. *I used AI for the last sentence sue me*
**Narrative Drive.** The AI does not stop and wait for you. It will always try to derive the story if it start to feel DRY.
**Moral Complexity.** There are no archetypes. There is no good or bad People are grey.
This engine was designed with **DeepSeek V4** in mind, but it runs beautifully on Gemini 3.1 Pro/3 Flash and should work great on Claude and similar high-capability models. There are three variants:
* **V7 Core** The sweet spot in between. Cinematic, grounded, patient.
* **V7 Reality** Complete realism. No plot protection. The consequences have teeth. I personally like this one.
* **V7 Gentle** More subdued, more emotional. For stories that deserve their space.
# Memory Core Save 75% of Your Tokens
This is probably the most practically useful function of all. And honestly, it's insanely easy to use.
The issue here is that you're 400 messages deep in an RP, your context window is overloaded, and the AI starts hallucinating because it's drowning in old text that it can barely make sense of. Or you're throwing away money by sending 120k tokens per message to Claude because you don't want to lose continuity.
Well, Memory Core solves both of those problems. It is a 3-level system for managing your context:
* **Level 1 (Working Memory):** Recent messages. Standard stuff.
* **Level 2 (Short-term):** Old messages are automatically grouped together in 10 Messages chunks and AI-generated summaries of them are created in the background. No work done on your part.
* **Level 3 (Long-term Vault):** The oldest messages are moved to a Vector Database. When it comes time to bring them back, like mentioning a place you Visited 250 messages ago, it does so quietly.
The magical component: **Prompt Interceptor**. This physically deletes the old message from the prompt payload through SillyTavern's native mechanism. Old messages become grayed out in your chat interface You can still visit them and read them but it *won’t be sent to the API*. You aren’t paying for those messages. Your AI won’t have anything confusing to process. But data is preserved: it’s stored in the vault, waiting to be retrieved if necessary.
There's also a built-in **Regex Cleaner** that automatically strips useless tokens from the chat before they even hit the summary pipeline, so you're not wasting storage or context on formatting garbage, HTML artifacts, or other noise. One less thing to worry about.
Two types of search engines: **TF-IDF Keyword Matching**, which is fast and easy to set up, or **Semantic Embeddings** that leverage SillyTavern's native LanceDB integration.
The interface couldn’t be simpler. Head to Tab 10, turn on the switch, and hit "Apply & Extract Pending". That's all there is to it. Even a dummy could do it. And I mean that in the most affectionate way possible.
# NPC Bank Your Characters Have Faces Now
This one's just cool.
The NPC Bank automatically recognizes when the AI introduces a new significant character. It generate their description including name, age, appearance, backstory, personality, secret motivations, their close circle, etc., and stores a comprehensive dossier of them in a persistent database.
From then on, each time that NPC becomes relevant to your story, the system will seamlessly inject their dossier into the prompt. No need to keep track of who's who. The AI will simply *remember* the character because the system provides it with the right information at the right time.
And before you ask: no, it won't spam dossiers just because an NPC's name shows up in the World State block. There's a **Regex filter** specifically designed to prevent false positives, so the system only injects a dossier when the NPC is *actually relevant to the active scene*, not just because their name got mentioned in a status tracker somewhere.
And here's the best part: **AI Portrait.** With just one click, ComfyUI creates a portrait of that NPC entirely based on the AI's physical description of them. Your characters have faces now. And the whole process is fully automated you don't have to do anything. Also, if you use a multimodal model, the system can send the portrait back to the AI as a visual reference.
# Gemini Thinking Stop the Bleed
The Gemini Thinking toggle injects triple `<think>` tags that bypass Google's strict reasoning refusal filters. Clean separation the thinking stays in the thinking block, the prose stays in the prose.
>⚠️ **Important:** If you enable this, go to **AI Response Formatting → Reasoning**, activate **Auto-Parse**, and set the Prefix to `<think>` and Suffix to `</think>`. Otherwise SillyTavern won't know where the thinking ends.
# World State Tracker The Infoblock, But Better
Remember the old `info_block`? It's been completely rebuilt into a proper status dashboard. It now tracks:
* Current date, time, and weather
* PC's physical state (energy, injuries, mood indicators)
* NPC agendas and secrets
* Off-screen activity (what NPCs are doing when you're not looking)
* Unresolved narrative threads
* Current scene phase
It outputs as a collapsible HTML block at the end of each response. Which brings me to...
# NPC Inner Chatter The Spoiler Block
New block. Following every response, the AI generates an additional block that contains the *unfiltered thoughts* of all NPCs involved in the scene. Their true feelings behind the dialogue. The information they're hiding from you. The observations they made but didn't mention.
**My advice: don't look into it.** Both World State Tracker and the Inner Chatter blocks were designed to be read by the AI only, not you. These blocks may contain spoilers, NPC secrets, future story seeds. Once you take a look at them, you will know what's going to happen next and it will spoil the experience for you. Keep them collapsed and let the AI do its job.
However, if you want to I can't stop you.
# Other Cool Features
* **V7 Chain of Thought:** A hardcore 5-step reasoning audit from Ground Truth to Plot Engine, Scene Design, Active Draft, and Correction Loop. The AI needs to justify its actions before even beginning to write anything down. There's also a Lite mode that uses less tokens.
* **Engine Behavior Toggle:** Disable particular V7 behaviors one-by-one (OOC Protocol, Cultural Anchoring, Scene Choreography, and more) while retaining the overall logical consistency.
* **Dynamic Ban List:** Simply click "Analyze Chat" button and the AI will automatically detect sloppy phrases used by itself in the previous 50 messages and ban them for future generations.
* **Story Planner:** Automatically generates at least 10 plot milestones and inserts them into context for the AI to work towards achieving rather than simply responding to your latest message.
* **Prompt Preview:** View the exact text being sent to the API for debugging or just to know how your prompt payload actually looks like.
# Get It
Installation and everything else is on the GitHub. Watch the install video if you need it.
**GitHub:** [https://github.com/Arif-salah/Megumin-Suite](https://github.com/Arif-salah/Megumin-Suite)
**Install Video:** [https://www.youtube.com/watch?v=Q-iaz9mBFrA](https://www.youtube.com/watch?v=Q-iaz9mBFrA)
**Discord:** [https://discord.gg/HkxgN8r3jx](https://discord.gg/HkxgN8r3jx) DM: kazumaoniisan
If you're coming from V6, your profiles should migrate. If something breaks, hit me up on the Discord.
But seriously, if this tool helped you save some time, improved your rp sessions, or even impressed you at least a little bit, please consider donating even just one dollar to the Ko-fi. Or donate an API key. Or just star the repository and share the link somewhere. All of it helps. I will keep working on this project regardless, but it would be nice to know that I am not shouting into the void.
**Crypto (LTC):** `LSjf1DczHxs3GEbkoMmi1UWH2GikmXDtis`
Now if you'll excuse me, I'm going to sleep for approximately 47 hours.
AI tools related to Gemini 3.1 Pro vs Gemini 2.5 Flash Lite
These tools are closely connected to one or both models in this comparison and can help you evaluate real-world fit.
BeautyPlus
BeautyPlus: BeautyPlus is an AI-powered online platform offering a comprehensive suite of image and video editing tools. It features an AI Image Enhancer to improve photo quality, resolution, color, and contrast, and includes advanced functionalities like blurry photo correction, noise reduction, and blemish minimization. Additionally, it integrates Nano Banana Pro, an AI image generator and editor powered by Google Gemini 3 Pro, enabling users to generate images from text, edit existing images with prompts, and combine elements from multiple images. The platform also provides various other tools such as background removers, object removers, AI filters, video enhancers, and more, catering to both professional and casual users for diverse creative needs.
googlegemini.co
googlegemini.co is a free tool for interacting with text and images, powered by the Google Gemini Pro API. It allows you to use Gemini easily without managing your own server or API configurations. Google Gemini is a multimodal AI developed by DeepMind capable of processing text, audio, images, and more. It is optimized for various devices, performs well on AI benchmarks, and is built with a focus on safety and responsible AI practices.
GeminiGoogle.cc
GeminiGoogle.cc is a platform dedicated to showcasing Google's most advanced AI model, Gemini. Built for native multimodality, Gemini reasons across text, images, video, audio, and code. It is available in three versions—Ultra, Pro, and Nano—to support tasks ranging from complex reasoning to on-device efficiency. The site highlights Gemini's performance, including its MMLU benchmarks, and provides examples of its capabilities in image generation, problem-solving, and multimodal analysis.
Summarize and Translate Web Pages - Chrome Extension
The Summarize and Translate Web Pages Chrome extension enables you to summarize and translate web content with a single click. Powered by Google's Gemini AI, this tool provides high-quality summaries and translations for web pages, selected text, YouTube video captions, images, and PDF files.
Which model should you choose?
Use the summary below to decide which model better fits your workflow, budget, and feature requirements.
Gemini 3.1 Pro
Gemini 3.1 Pro is a stronger fit for long-context workloads, reasoning-heavy tasks, tool-augmented workflows.
Gemini 2.5 Flash Lite
Gemini 2.5 Flash Lite is a stronger fit for long-context workloads, reasoning-heavy tasks, tool-augmented workflows.
Choose Gemini 3.1 Pro if you prioritize long-context workloads, reasoning-heavy tasks, tool-augmented workflows. Choose Gemini 2.5 Flash Lite if your workflow depends more on long-context workloads, reasoning-heavy tasks, tool-augmented workflows.
Common questions about Gemini 3.1 Pro vs Gemini 2.5 Flash Lite
What is the main difference between Gemini 3.1 Pro and Gemini 2.5 Flash Lite?
Gemini 3.1 Pro leans toward long-context workloads, reasoning-heavy tasks, tool-augmented workflows, while Gemini 2.5 Flash Lite is better suited to long-context workloads, reasoning-heavy tasks, tool-augmented workflows.
Which model is cheaper: Gemini 3.1 Pro or Gemini 2.5 Flash Lite?
Gemini 2.5 Flash Lite starts lower on input pricing at $0.1000 per 1M input tokens, compared with $2.0000 for Gemini 3.1 Pro.
Which model has the larger context window: Gemini 3.1 Pro or Gemini 2.5 Flash Lite?
Gemini 3.1 Pro is listed with a context window of 1,048,576, while Gemini 2.5 Flash Lite is listed with 1.0M.
How should I evaluate Gemini 3.1 Pro vs Gemini 2.5 Flash Lite for my use case?
This comparison currently includes 16 shared benchmark rows, helping you compare practical performance across overlapping evaluations.