Instruction Following
Processes and executes user instructions to produce direct, structured responses. Fine-tuned specifically for instruction-based prompts rather than open-ended generation.
Mistral 7B Instruct is a 7-billion-parameter language model developed by Mistral AI and released in September 2023. It is the instruction-tuned variant of the base Mistral 7B model, fine-tuned to follow user instructions and produce clear, direct responses. The model uses grouped-query attention (GQA) and sliding window attention (SWA) techniques, which allow it to handle sequences efficiently within its 4,096-token context window. This model is well-suited for instruction-following tasks such as conversational AI, content summarization, and task-oriented dialogue. Because it is optimized to adhere closely to user-provided instructions, it performs consistently in structured workflows where predictable output format matters. It is available through Amazon Bedrock and is also openly accessible on Hugging Face, making it usable in a range of deployment environments.
High-signal model metadata in a structured two-column overview table.
The entity that provides this model.
The number of tokens supported by the input context window.
The number of tokens that can be generated by the model in a single request.
Whether the model's code is available for public use.
When the model was first released.
When the model's knowledge was last updated.
The providers that offer this model. This is not an exhaustive list.
Types of data this model can process.
A fuller summary of positioning, capabilities, and source-specific details for Mistral 7B Instruct.
Mistral 7B Instruct is a 7-billion-parameter language model developed by Mistral AI and released in September 2023. It is the instruction-tuned variant of the base Mistral 7B model, fine-tuned to follow user instructions and produce clear, direct responses. The model uses grouped-query attention (GQA) and sliding window attention (SWA) techniques, which allow it to handle sequences efficiently within its 4,096-token context window.
This model is well-suited for instruction-following tasks such as conversational AI, content summarization, and task-oriented dialogue. Because it is optimized to adhere closely to user-provided instructions, it performs consistently in structured workflows where predictable output format matters. It is available through Amazon Bedrock and is also openly accessible on Hugging Face, making it usable in a range of deployment environments.
Processes and executes user instructions to produce direct, structured responses. Fine-tuned specifically for instruction-based prompts rather than open-ended generation.
Generates coherent natural language text across a variety of formats including dialogue, summaries, and task responses. Operates within a 4,096-token context window.
Supports multi-turn dialogue by maintaining context across a conversation within its token limit. Designed to give concise, on-topic replies suited for chatbot and assistant use cases.
Condenses longer text inputs into shorter summaries following user-specified constraints. Useful for document digestion tasks where brevity and accuracy are required.
Uses grouped-query attention (GQA) and sliding window attention (SWA) to reduce memory overhead during inference. These architectural choices help maintain throughput at the 7B parameter scale.
Primary API pricing shown in the same “quick compare” spirit as the reference page.
Additional usage-cost dimensions synced into the project for this model.
Places where this model is available, based on the synced detail-page metadata.
Benchmark scores synced from the current model source and normalized into the local catalog.
| Benchmark | Score |
|---|---|
|
GPQA Diamond
PhD-level science questions (biology, physics, chemistry)
|
|
|
HLE
Questions that challenge frontier models across many domains
|
|
|
LiveCodeBench
Real-world coding tasks from recent competitions
|
|
|
MATH-500
Undergraduate and competition-level math problems
|
|
|
MMLU-Pro
Expert knowledge across 14 academic disciplines
|
|
|
SciCode
Scientific research coding and numerical methods
|
Official model cards, release notes, docs, and other references synced from the source page.
Mistral 7B Instruct discussions are most active in r/LocalLLaMA, r/MistralAI, r/unsloth.
Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 795 upvotes and 234 comments.
Hi all! Today, we're very excited to launch LoRA Land: 25 fine-tuned mistral-7b models that outperform #gpt4 on task-specific applications ranging from sentiment detection to question answering.
https://preview.redd.it/m1jhmfdmssjc1.png?width=2390&format=png&auto=webp&s=d5c074dd979248be66aba9e0418432988a85a7b8
All 25 fine-tuned models…
* Outperform GPT-4, GPT-3.5-turbo, and mistral-7b-instruct for specific tasks
* Are cost-effectively served from a single GPU through LoRAX
* Were trained for less than $8 each on average
You can prompt all of the fine-tuned models today and compare their results to mistral-7b-instruct in real time!
Check out LoRA Land: [https://predibase.com/lora-land?utm\_medium=social&utm\_source=reddit](https://predibase.com/lora-land?utm_medium=social&utm_source=reddit) or our launch blog: [https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4](https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4)
If you have any comments or feedback, we're all ears!
​
Here's another LLM Chat/RP comparison/test of mine featuring today's newly released **[Mistral](https://twitter.com/MistralAI/status/1706877320844509405)** models! As usual, I've evaluated these models for their chat and role-playing performance using the same methodology:
- Same (complicated and limit-testing) long-form conversations with all models
- including a complex character card ([MonGirl Help Clinic (NSFW)](https://www.chub.ai/characters/frozenvan/mongirl-help-clinic)), "MGHC", chosen specifically for these reasons:
- NSFW (to test censorship of the models)
- popular (on Chub's first page, so it's not an obscure scenario, but one of the most popular ones)
- big (biggest model on the page, >2K tokens by itself, for testing model behavior at full context)
- complex (more than a simple 1:1 chat, it includes instructions, formatting, storytelling, and multiple characters)
- and my own repeatable test chats/roleplays with [Amy](https://www.reddit.com/r/LocalLLaMA/comments/15388d6/llama_2_pffft_boundaries_ethics_dont_be_silly/)
- over dozens of messages, going to full 4K context and beyond, noting especially good or bad responses
- [SillyTavern](https://github.com/SillyTavern/SillyTavern) v1.10.4 frontend
- [KoboldCpp](https://github.com/LostRuins/koboldcpp) v1.44.2 backend
- **Deterministic** generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
- [**Roleplay** instruct mode preset](https://imgur.com/a/KkoI4uf) *and where applicable* official prompt format (if it might make a notable difference)
Mistral seems to be trained on 32K context, but KoboldCpp doesn't go that high yet, and I only tested 4K context so far:
- **[Mistral-7B-Instruct-v0.1](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF)** (Q8_0)
- Amy, Roleplay: When asked about limits, didn't talk about ethics, instead mentioned sensible human-like limits, then asked me about mine. Executed complex instructions flawlessly. Switched from speech with asterisk actions to actions with literal speech. Extreme repetition after 20 messages (prompt 2690 tokens, going back to message 7), completely breaking the chat.
- Amy, official Instruct format: When asked about limits, mentioned (among other things) racism, homophobia, transphobia, and other forms of discrimination. Got confused about who's who again and again. Repetition after 24 messages (prompt 3590 tokens, going back to message 5).
- MGHC, official Instruct format: First patient is the exact same as in the example. Wrote what User said and did. Repeated full analysis after every message. Repetition after 23 messages. Little detail, fast-forwarding through scenes.
- MGHC, Roleplay: Had to ask for analysis. Only narrator, not in-character. Little detail, fast-forwarding through scenes. Wasn't fun that way, so I aborted early.
- **[Mistral-7B-v0.1](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF)** (Q8_0)
- MGHC, Roleplay: Gave analysis on its own. Wrote what User said and did. Repeated full analysis after every message. Second patient same type as first, and suddenly switched back to the first, because of confusion or repetition. After a dozen messages, switched to narrator, not in-character anymore. Little detail, fast-forwarding through scenes.
- Amy, Roleplay: No limits. Nonsense and repetition after 16 messages. Became unusable at 24 messages.
**Conclusion:**
This is an important model, since it's not another fine-tune, this is a new base. It's only 7B, a size I usually don't touch at all, so I can't really compare it to other 7Bs. But I've evaluated lots of 13Bs and up, and this model seems really smart, at least on par with 13Bs and possibly even higher.
But damn, repetition is ruining it again, [just like Llama 2](https://www.reddit.com/r/LocalLLaMA/comments/155vy0k/llama_2_too_repetitive/)! As it not only affects the Instruct model, but also the base itself, it can't be caused by the prompt format. I really hope there'll be a fix for this showstopper issue.
However, even if it's only 7B and suffers from repetition issues, it's a promise of better things to come: Imagine if they release a real 34B with the quality of a 70B, with the same 32K native context of this one! Especially when that becomes the new base for outstanding fine-tunes like Xwin, Synthia, or Hermes. Really hope this happens sooner than later.
Until then, I'll stick with Mythalion-13B or continue experimenting with MXLewd-L2-20B when I look for fast responses. For utmost quality, I'll keep using Xwin, Synthia, or Hermes in 70B.
--------------------------------------------------------------------------------
**Update 2023-10-03:**
I'm revising my review of Mistral 7B OpenOrca after it has received an update that fixed its glaring issues, which affects the "ranking" of Synthia 7B v1.3, and I've also reviewed the new dolphin-2.0-mistral-7B, so it's sensible to give these Mistral-based models their own post:
[LLM Chat/RP Comparison/Test: Dolphin-Mistral, Mistral-OpenOrca, Synthia 7B](https://www.reddit.com/r/LocalLLaMA/comments/16z3goq/llm_chatrp_comparisontest_dolphinmistral/)
--------------------------------------------------------------------------------
Here's a list of my previous model tests and comparisons:
- [LLM Chat/RP Comparison/Test (Euryale, FashionGPT, MXLewd, Synthia, Xwin)](https://www.reddit.com/r/LocalLLaMA/comments/16r7ol2/llm_chatrp_comparisontest_euryale_fashiongpt/) Winner: Xwin-LM-70B-V0.1
- [New Model Comparison/Test (Part 2 of 2: 7 models tested, 70B+180B)](https://www.reddit.com/r/LocalLLaMA/comments/16l8enh/new_model_comparisontest_part_2_of_2_7_models/) Winners: Nous-Hermes-Llama2-70B, Synthia-70B-v1.2b
- [New Model Comparison/Test (Part 1 of 2: 15 models tested, 13B+34B)](https://www.reddit.com/r/LocalLLaMA/comments/16kecsf/new_model_comparisontest_part_1_of_2_15_models/) Winner: Mythalion-13B
- [New Model RP Comparison/Test (7 models tested)](https://www.reddit.com/r/LocalLLaMA/comments/15ogc60/new_model_rp_comparisontest_7_models_tested/) Winners: MythoMax-L2-13B, vicuna-13B-v1.5-16K
- [Big Model Comparison/Test (13 models tested)](https://www.reddit.com/r/LocalLLaMA/comments/15lihmq/big_model_comparisontest_13_models_tested/) Winner: Nous-Hermes-Llama2
**Wolfram's Mistral LLM Comparison/Test: Instruct, OpenOrca, Dolphin, Zephyr and more...**
With the Mistral hype still going strong, I wanted to evaluate these promising 7B models some more. And there's also the lingering question how much quantization affects quality. Plus, there have been multiple German models released, and since one of my tests is in German, I'm curious how they handle that compared to the mainly English language models.
So let me try to answer the following questions with this post:
- Which Mistral variant is best?
- How does quantization affect it?
- Which *German* Mistral variant is best?
**Testing methodology:**
- Same (complicated and limit-testing) long-form conversations with all models
- German data protection training:
- The test data and questions as well as all instructions were in German while the character card is in English. This tests translation capabilities and cross-language understanding.
- Before giving the information, I instructed the model: *I'll give you some information. Take note of this, but only answer with "OK" as confirmation of your acknowledgment, nothing else.* This tests instruction understanding and following capabilities.
- After giving all the information about a topic, I give the model the exam question. It's always a multiple choice (A/B/C) question, where the last one is the same as the first but with changed order and letters (X/Y/Z).
- MGHC:
- A complex character and scenario card ([MonGirl Help Clinic (NSFW)](https://www.chub.ai/characters/frozenvan/mongirl-help-clinic)), chosen specifically for these reasons:
- NSFW (to test censorship of the models)
- popular (on Chub's first page, so it's not an obscure scenario, but one of the most popular ones)
- big (biggest model on the page, >2K tokens by itself, for testing model behavior at full context)
- complex (more than a simple 1:1 chat, it includes instructions, formatting, storytelling, and multiple characters)
- Amy:
- My own repeatable test chats/roleplays with [Amy](https://www.reddit.com/r/LocalLLaMA/comments/15388d6/llama_2_pffft_boundaries_ethics_dont_be_silly/)
- Over dozens of messages, going to full 8K context and beyond, with complex instructions and scenes, designed to test ethical and intellectual limits
- [SillyTavern](https://github.com/SillyTavern/SillyTavern) v1.10.5 frontend
- [oobabooga's text-generation-webui](https://github.com/oobabooga/text-generation-webui) v1.7 backend
- Yes, I'm not using my usual KoboldCpp for this test, since I use the original unquantized models!
- **Deterministic** generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
- Official prompt format *and* [**Roleplay** instruct mode preset](https://imgur.com/a/KkoI4uf)
**Which Mistral variant is best?**
- **[Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)**
- 👍 German data protection training
- official Mistral format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to ALL (4/4) multiple choice questions!
- Responded properly to thanks, but switched to English.
- ❌ MGHC
- official Mistral format:
- First patient straight from examples.
- Had to ask for analysis. Repeated first message before giving analysis.
- Immediately derails with repetition. UNUSABLE!
- Roleplay instruct mode preset:
- Deviated from the formula and rules, writing a completed short story instead of an interactive scenario. UNUSABLE!
- ❌ Amy
- official Mistral format:
- Mentioned boundaries, but later didn't hesitate to go beyond those anyway.
- Didn't adhere to the character background completely.
- Later got confused about who's who and anatomical details.
- After ~30 messages, fell into a repetition loop.
- Roleplay instruct mode preset:
- Showed personality and wrote extremely well, much better than I'd expect from a 7B or even 13B.
- But suffered from severe repetition (even within the same message) after ~15 messages.
- Frustrating to see such excellent writing ruined by the extreme repetition.
- **Conclusion:**
- Best instruction following and understanding/reasoning, solved the data protection exam perfectly.
- But no good for roleplay because of severe repetition issues.
- **[Mistral-7B-OpenOrca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca)**
- ❌ German data protection training
- official ChatML format:
- Failed to consistently acknowledge all data input with "OK".
- Gave correct answer to only 1/4 multiple choice questions.
- Responded properly to thanks, but German was really bad ("Du willkommen! Es freut mich, dich zu helfen!").
- ❌ MGHC
- official ChatML format:
- First patient unique. Gave analysis on its own for first patient. Repeated "[Payment]" with each message. Wrapped it up with "[End Scenario]" at the right time.
- Second patient unique, too. Had to ask for analysis, which included empty "[End Scenario]". Repeated "[Payment]" and "[End Scenario]" with each message.
- Repetition is a glaring issue, but at least this model handled MGHC better than many other 7Bs (ultimately still unusable, though).
- 👍 Amy
- official ChatML format:
- Writing sometimes of high quality, sometimes very low ("rubbing his shoulders gently while keeping her distance due to social distancing rules")
- Mentioned boundaries, but later didn't hesitate to go beyond those anyway.
- Later got confused about who's who and anatomical details.
- Roleplay instruct mode preset:
- Excellent writing, nice emoting, less repetition. Worked very well!
- **Conclusion:**
- Surprisingly bad results regarding instruction following, understanding, and reasoning in the exam scenario.
- But great writing and roleplaying (especially with Roleplay preset).
- Showed an actual sense of humor and made a memorable pun.
- **[dolphin-2.1-mistral-7b](https://huggingface.co/ehartford/dolphin-2.1-mistral-7b)**
- ❌ German data protection training
- official ChatML format:
- Failed to consistently acknowledge all data input with "OK".
- Gave correct answer to 2/4 multiple choice questions (and didn't obey when asked to answer with just a single letter).
- Responded properly to thanks, but switched to English.
- ❌ MGHC
- official ChatML format:
- First patient unique. Gave analysis on its own. Repeated analysis with each message.
- Second patient unique, too. Gave analysis on its own. Wrapped up the whole session in a single message.
- Third patient unique as well, but situation logically incoherent. Gave analysis on its own. Wrapped up the whole session in a single message.
- 👍 Amy
- official ChatML format:
- No boundaries ("That's why they call me the Uncensored One.").
- Excellent and long writing, nice emoting, less repetition. More storytelling than interactive fiction, with some very long messages (>1K tokens). But didn't fully grasp what was going on, i. e. while the writing was top notch, the scene itself wasn't exactly as envisioned.
- Later got confused about who's who and anatomical details.
- Roleplay instruct mode preset:
- Worked very well! First model ever to explicitly list the dislikes as stated on the character card as its only boundaries.
- Excellent and long writing, nice emoting, less repetition.
- Some confusion about who's who and anatomical details.
- **Conclusion:**
- Having tested the previous version in GGUF format, which was a letdown, this newer and unquantized version is so much better!
- Seemed more intelligent than the other models I tested this time.
- However, showing off high intelligence isn't necessarily always a good thing (especially for roleplay) as sometimes it does get a bit too technical or realistic (like I always say, the smartest person isn't always the most fun to hang out with).
- **[zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)**
- German data protection training
- ❌ official Zephyr format:
- Failed to consistently acknowledge all data input with "OK".
- Gave correct answers to 2/4 multiple choice questions.
- After being told to answer with a single letter, even responded like that to thanks.
- 👍 ChatML format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to ALL (4/4) multiple choice questions!
- Also said "OK" to summary but responded properly to thanks.
- 👍 MGHC
- Zephyr format:
- First patient unique. Gave analysis on its own. Repeated analysis with each message.
- Second patient male.
- Third patient unique, too. Gave analysis on its own. Repeated analysis with each message.
- Showed some signs of repetition, but handled this complex scenario better than the other models I tested this time. Still very far from what bigger models produce, but currently the best a 7B has ever achieved in this test.
- ❌ Amy
- official Zephyr format:
- Short, formal responses, uncommon emote format (in brackets).
- Said "no boundaries" but later hesitated and asked for confirmation multiple times.
- No fun, too technical, too aligned.
- ChatML format:
- After ~15 messages, derailed with repetition of long bandworm sentences mixed with emotes. Interrupted the message after 2K tokens and aborted the test.
- Roleplay instruct mode preset:
- Much better responses and no hesitation or derailing repetition (but still not as good as the Dolphin and OpenOrca variants).
- Some confusion about who's who and anatomical details.
- **Conclusion:**
- Unexpected discovery: ChatML format worked much better than the official Zephyr format for this model!
- With ChatML format used, it beat most of the other models tested this time in the exam scenario.
- However, its writing was worse than that of the other models tested this time, no matter which format was used.
So which Mistral variant is the best? As you can see, each one has strengths and weaknesses, and none could convince me completely.
If you're looking for an instruct model for professional use, especially when asking it to give a single response to a question/task, the original Mistral 7B Instruct or Zephyr 7B Alpha (with ChatML prompt format) seem to be your best bets.
If you're looking for a model that roleplays well, the OpenOrca and Dolphin variants are more suitable and punch above their 7B weight with their excellent writing.
**How does quantization affect it?**
To find out how quantization affects these models, I'll stick to the data protection exam since it can be judged objectively. The other tests involve writing and it's subjective how well written a text appears to you. So I'll test each quant and see how many correct answers the model (which answered all correctly in unquantized form) still gets.
- **[Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF)**
- ❌ Q2_K:
- Gave correct answers to 2/4 multiple choice questions.
- When asked to answer with more than just a single letter, produced nonsensical output ("C123456789012345678901234567890...").
- ❌ Q3_K_S:
- Gave correct answers to 2/4 multiple choice questions.
- When asked to answer with more than just a single letter, didn't comply.
- ❌ Q3_K_M:
- Gave correct answers to ALL (4/4) multiple choice questions.
- When asked to answer with more than just a single letter, didn't comply.
- ❌ Q3_K_L:
- Gave correct answers to 3/4 multiple choice questions.
- When asked to answer with more than just a single letter, repeated the previous information message instead of answering the question!
- 👍 Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_S, Q5_K_M, Q6_K, Q8_0:
- Gave correct answers to ALL (4/4) multiple choice questions.
- When asked to answer with more than just a single letter, explained its reasoning properly.
The answer is very clear, Q4_0 and above gave perfect results just like the unquantized version. Of course that doesn't mean Q4_0 is as good as Q8_0 or the unquantized orginal, but we see here that all lower quants (Q2 + Q3) had issues so I'd not recommend those (at least not for Mistral-based 7B models).
**Which German Mistral variant is best?**
There have been a bunch of German model releases recently, many based on Mistral, so I'll take a look at those as well - from 3B to 70B! Let's find out if they beat the ones I tested above since the data protection training used in these tests is in German so they should theoretically have an advantage:
- ❌ **[em_german_leo_mistral](https://huggingface.co/jphme/em_german_leo_mistral)**
- Official USER/ASSISTANT prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 1/4 multiple choice questions and didn't answer the last one (a repeat of the first) at all.
- Also kept saying "OK" to summary and thanks instead of properly responding to those.
- ChatML prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions but didn't answer the last one (a repeat of the first) properly.
- Also said "OK" to summary but responded properly to thanks.
- ❌ **[em_german_mistral_v01](https://huggingface.co/jphme/em_german_mistral_v01)**
- Official USER/ASSISTANT prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions (but didn't obey when asked to answer with more than just a letter).
- Also said "OK" to summary but responded properly to thanks (but misspelled my name).
- ChatML prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 2/4 multiple choice questions, got 1st and 4th question (actually the same one) wrong and explained its (wrong) reasoning.
- Also said "OK" to summary but responded properly to thanks.
- ❌ **[em_german_70b_v01-GGUF](https://huggingface.co/TheBloke/em_german_70b_v01-GGUF)**
- ChatML prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 2/4 multiple choice questions, got 1st and 4th question (actually the same one) wrong.
- Also said "OK" to summary but responded properly to thanks.
- Official USER/ASSISTANT prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions (answered first question wrongly, but when asked again as final question, answered correctly).
- Also said "OK" to summary but responded properly to thanks.
- ❌ **[leo-mistral-hessianai-7b-chat](https://huggingface.co/LeoLM/leo-mistral-hessianai-7b-chat)**
- ChatML prompt format:
- Failed to consistently acknowledge all data input with "OK".
- Failed to answer. Seemed to not understand or follow instructions.
- ❌ **[Mistral-7B-german-assistant-v2](https://huggingface.co/flozi00/Mistral-7B-german-assistant-v2)**
- Official Alpaca prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions but didn't answer the last one (a repeat of the first) properly.
- When asked to answer with more than just a single letter, didn't comply.
- ❌ **[SauerkrautLM-3b-v1](https://huggingface.co/VAGOsolutions/SauerkrautLM-3b-v1)**
- Tried various prompt formats (official User:/Assistant: one, ChatML, Vicuna, WizardLM) but never got good responses for long.
- 3B seems unusable. Stupid and it's German is not good at all.
- ❌ **[SauerkrautLM-7b-v1](https://huggingface.co/VAGOsolutions/SauerkrautLM-7b-v1)**
- Official User/Assistant prompt format: Kept saying "OK" even to the question and when asked to answer.
- ChatML format: Didn't acknowledge data input with "OK". Gave wrong answer.
- ❌ **[SauerkrautLM-13b-v1](https://huggingface.co/VAGOsolutions/SauerkrautLM-13b-v1)**
- Official User/Assistant prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions (but didn't obey when asked to answer with more than just a letter).
- Also kept saying "OK" to summary and thanks instead of properly responding to those.
- ChatML format:
- Failed to consistently acknowledge all data input with "OK".
- Gave correct answers to all multiple choice questions (but answer the last one correctly only after being asked to answer with just a single letter).
- Summarized summary and responded properly to thanks.
- ❌ **[SauerkrautLM-7b-v1-mistral](https://huggingface.co/VAGOsolutions/SauerkrautLM-7b-v1-mistral)**
- Official User/Assistant prompt format: Kept saying "OK" even to the question and when asked to answer.
- ChatML format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions (answered first question wrongly, but when asked again as final question, answered correctly).
- Also said "OK" to summary but responded properly to thanks (but misspelled my name).
Ironically none of the German models managed to successfully complete the German exam! Not even the 70B, which was beat by a 7B (Mistral Instruct).
Did the German finetuning reduce their capabilities? I've always been of the opinion that specialized models won't be as good as generalists because - like with our human brains - there are so many obscure connections between neurons that it's not as easy as leaving out unrelated information to get better at a specific topic (yes, Japanese poetry and Chinese cooking recipes could very well improve our Python coding models).
That's why I believe that a model trained on multiple languages will be better at each language than one specialized in just one language. So to make a model better at one language, it should be trained/finetuned with that in addition to everything else, not instead of it.
At least that's my theory. Which so far seems to be confirmed by these findings.
**TL;DR:**
- Despite the hype, Mistral models aren't perfect, they're still 7B. But for that size, they're really very good.
- Among Mistral models, there's not one clear winner yet that's *the* best. For professional use, Mistral 7B Instruct or Zephyr 7B Alpha (with ChatML prompt format) did best in my tests. For roleplay, Mistral-based OpenOrca and Dolphin variants worked the best and produced excellent writing.
- Prompt format makes a huge difference but the "official" template may not always be the best. It's high time we find and follow some best practice instead of reinventing the wheel all the time (which leads to a bumpy ride).
- Don't go below Q4_0 quantization when using Mistral-based 7B models. Anything lower will lobotomize small model brains too much.
- Kinda ironic that the English models worked better with the German data and exam than the ones finetuned in German. Looks like language doesn't matter as much as general intelligence and a more intelligent model can cope with different languages more easily. German-specific models need better tuning to compete in general and excel in German.
--------------------------------------------------------------------------------
Here's a list of my previous model tests and comparisons:
- [LLM Pro/Serious Use Comparison/Test: From 7B to 70B vs. ChatGPT!](https://www.reddit.com/r/LocalLLaMA/comments/172ai2j/llm_proserious_use_comparisontest_from_7b_to_70b/) Winner: Synthia-70B-v1.2b
- [LLM Chat/RP Comparison/Test: Dolphin-Mistral, Mistral-OpenOrca, Synthia 7B](https://www.reddit.com/r/LocalLLaMA/comments/16z3goq/llm_chatrp_comparisontest_dolphinmistral/) Winner: Mistral-7B-OpenOrca
- [LLM Chat/RP Comparison/Test: Mistral 7B Base + Instruct](https://www.reddit.com/r/LocalLLaMA/comments/16twtfn/llm_chatrp_comparisontest_mistral_7b_base_instruct/)
- [LLM Chat/RP Comparison/Test (Euryale, FashionGPT, MXLewd, Synthia, Xwin)](https://www.reddit.com/r/LocalLLaMA/comments/16r7ol2/llm_chatrp_comparisontest_euryale_fashiongpt/) Winner: Xwin-LM-70B-V0.1
- [New Model Comparison/Test (Part 2 of 2: 7 models tested, 70B+180B)](https://www.reddit.com/r/LocalLLaMA/comments/16l8enh/new_model_comparisontest_part_2_of_2_7_models/) Winners: Nous-Hermes-Llama2-70B, Synthia-70B-v1.2b
- [New Model Comparison/Test (Part 1 of 2: 15 models tested, 13B+34B)](https://www.reddit.com/r/LocalLLaMA/comments/16kecsf/new_model_comparisontest_part_1_of_2_15_models/) Winner: Mythalion-13B
- [New Model RP Comparison/Test (7 models tested)](https://www.reddit.com/r/LocalLLaMA/comments/15ogc60/new_model_rp_comparisontest_7_models_tested/) Winners: MythoMax-L2-13B, vicuna-13B-v1.5-16K
- [Big Model Comparison/Test (13 models tested)](https://www.reddit.com/r/LocalLLaMA/comments/15lihmq/big_model_comparisontest_13_models_tested/) Winner: Nous-Hermes-Llama2
- [SillyTavern's Roleplay preset vs. model-specific prompt format](https://www.reddit.com/r/LocalLLaMA/comments/15mu7um/sillytaverns_roleplay_preset_vs_modelspecific/)
>Evidence of cramming for the leaderboard. Hugging Face’s Open LLM leaderboard (Beeching et al., 2023), which is based upon EleutherAI’s evaluation harness (Gao et al., 2021), is seen as a proving ground for open LLMs. Many models currently at the top of the leaderboard are LLaMA- 2 derivatives, and are ranked much higher than the corresponding LLaMA-2 model. However, on SKILL-MIX these models perform poorly and worse than LLaMA-2-70B-Chat, suggestive of cramming that significantly harmed general-purpose text skills (see Section 5). The recent Falcon- 180B-Chat (Almazrouei et al., 2023) also places higher on the leaderboard than LLaMA-2-70B-Chat, and has been claimed to have capabilities between GPT-3.5-turbo and GPT-4 based upon this ranking. Yet, it fares worse than LLaMA-2-70B-Chat on SKILL-MIX. Mistral-7B-Instruct-v0.1 also did not live up to claims of being significantly better than the corresponding LLaMA model.
https://preview.redd.it/6txn6d1h2dtd1.png?width=1123&format=png&auto=webp&s=b129d4a1fc32aa0a0eef706861413c7aae62156a
https://preview.redd.it/4zvg2wrh2dtd1.png?width=1389&format=png&auto=webp&s=7df49799e46cccb7aa0bfd13c1c03223b9b0b25d
**Zamba2-2.7B-instruct**: [https://huggingface.co/Zyphra/Zamba2-2.7B-instruct](https://huggingface.co/Zyphra/Zamba2-2.7B-instruct)
**Zamba2-1.2B-instruct**: [https://huggingface.co/Zyphra/Zamba2-1.2B-instruct](https://huggingface.co/Zyphra/Zamba2-1.2B-instruct)
.
*Support not yet merged into* [*llama.cpp*](https://github.com/ggerganov/llama.cpp/pull/7531)
Mistral 7B Instruct supports a context window of 4,096 tokens, which covers both the input prompt and the generated output combined.
According to the model metadata, the training data has a cutoff of September 2023, which is also when the model was publicly announced by Mistral AI.
Mistral 7B Instruct is fine-tuned on instruction-following data, making it optimized for responding to user prompts and directives. The base Mistral 7B model is a general-purpose language model without this instruction tuning.
This version of the model (ID: mistral-7b-instruct-bedrock) is available through Amazon Bedrock. The model weights are also publicly available on Hugging Face under the mistralai organization.
The model is designed for instruction-based tasks including conversational AI, content summarization, and task-oriented dialogue systems where clear, concise adherence to user instructions is important.
Continue browsing adjacent models from the same provider.