Mistral

Mistral 7B Instruct Deprecated

Focused on instruction-based tasks, providing clear, concise responses adhering to user instructions.

Oct 10, 2023 N/A context 2,500 tokens output

Text

Overview ↓ About ↓ Capabilities ↓ Pricing ↓ Price Comparison ↓ Tools ↓ Daily ↓ Resources ↓ Community ↓

Model Overview

High-signal model metadata in a structured two-column overview table.

Provider

The entity that provides this model.

Mistral

Input Context Window

The number of tokens supported by the input context window.

N/A tokens

Maximum Output Tokens

The number of tokens that can be generated by the model in a single request.

2,500 tokens tokens

Open Source

Whether the model's code is available for public use.

Release Date

When the model was first released.

Oct 10, 2023 2 years ago

Knowledge Cut-off Date

When the model's knowledge was last updated.

Unknown

API Providers

The providers that offer this model. This is not an exhaustive list.

Hugging Face, Mistral API

Modalities

Types of data this model can process.

Text File

What is Mistral 7B Instruct Deprecated

A fuller summary of positioning, capabilities, and source-specific details for Mistral 7B Instruct Deprecated.

A 7.3B parameter model that outperforms Llama 2 13B on all benchmarks, with optimizations for speed and context length.

Capabilities

What Mistral 7B Instruct Deprecated supports

Multimodal I/O

This model accepts text input and returns text output.

Pricing for Mistral 7B Instruct Deprecated

Primary API pricing shown in the same “quick compare” spirit as the reference page.

Input tokens N/A Per million tokens

Output tokens N/A Per million tokens

Price Comparison

Additional usage-cost dimensions synced into the project for this model.

maxTemperature 1

maxResponseSize 2,500 tokens

API Access & Providers

Places where this model is available, based on the synced detail-page metadata.

Hugging Face Mistral API

Resources & Documentation

Official model cards, release notes, docs, and other references synced from the source page.

Official Website

→

Privacy and Data Collection at Mistral

→

View on HuggingFace

→

Related Daily Briefs

Recent daily stories tied to Mistral 7B Instruct Deprecated through direct model mentions or provider-level coverage.

Frontier Models

Mistral and OpenAI Signal a Broader Shift Around Costs Using PNGs

Claude and Mistral are becoming more practical to evaluate and deploy.

2026-07-04 AI Models AI API

Frontier Models

Google DeepMind, Mistral, and Anthropic Signal a Broader Shift Around MiniMax M3

MiniMax and Google are becoming more practical to evaluate and deploy.

2026-06-12 AI Models Benchmark

Community discussion

What people think about Mistral 7B Instruct Deprecated

Mistral 7B Instruct Deprecated discussions are most active in r/LocalLLaMA, r/MistralAI, r/ROCm.

Top Reddit threads cluster around benchmark and model-comparison threads, safety and censorship questions, coding workflow discussions. The strongest match in this snapshot has 489 upvotes and 130 comments.

r/ROCm 20 upvotes January 20, 2026

Windows 11 + RX 7900 XT: vLLM 0.13 running on ROCm (TheRock) with TRITON_ATTN - first success + benchmark (~3.4 tok/s)

Hey folks, first post here 🥹
This is more of a personal “AMD-on-Windows local LLM” mission than a polished guide, but I finally got vLLM to load + generate on Windows 11 with an AMD RX 7900 XT using AMD’s ROCm “TheRock” PyTorch wheels.

**TL;DR?**

* Windows 11 + RX 7900 XT + ROCm TheRock PyTorch nightly + AMD driver 25.12.1
* vLLM 0.13.0 generates with `VLLM_ATTENTION_BACKEND=TRITON_ATTN`
* Still hacky: missing compiled ops ⇒ Python fallbacks; perf varies (cold vs warm)

# Hardware / OS

* Windows 11 (10.0.26200)
* AMD Radeon RX 7900 XT (20GB)
* CPU: R9 7900x 12C/24T, 32GB RAM

# Software stack (ROCm on Windows)

* PyTorch: `2.11.0a0+rocm7.11.0a20260114`
* HIP runtime: `7.2.53150` (`torch.version.cuda=None`, `torch.version.hip=7.2.53150`)
* ROCm wheels: `rocm 7.11.0a20260114`, `rocm-sdk-core 7.11.0a20260114`
* vLLM: `0.13.0`
* triton-windows: `3.5.1.post23` (downgraded from post24)

Proof (from my diag script):

torch 2.11.0a0+rocm7.11.0a20260114
hip 7.2.53150
cuda None
is_cuda True
dev AMD Radeon RX 7900 XT

# Model

* Local safetensors repo: `RedHatAI/Mistral-7B-Instruct-v0.3-FP8` (I cloned it locally).
* vLLM loads weights on GPU, weights memory reported \~8.6 GiB and KV cache allocated on GPU.

# Attention backend + performance

* Using `VLLM_ATTENTION_BACKEND=TRITON_ATTN`
* `VLLM_USE_TRITON_FLASH_ATTN=1` seemed slower / less stable for me.
* Quick single prompt test (fresh run):
* Prompt: “Say Hi to Reddit!”
* Output tokens: 48
* Time: \~14.12s
* Measured: \~3.40 tok/s
* vLLM log estimated output speed: \~4.54 tok/s (decode)

# Caveats (important)

* This is **not** a clean, production-ready setup yet.
* Windows ROCm + vLLM currently requires some workaround glue (toolchain/env quirks, compile caching behavior).
* I’m still seeing some run-to-run variability (cold vs warm compile/cache).
* Downgrading triton-windows (from post24 to post23) helped me get TRITON\_ATTN stable enough to run, but I’m not assuming version-hunting is the real solution.

# What I actually want next (compiled ops, no monkeypatch)

Right now this works, but it’s still a hacky stack: I had to implement Python fallbacks for missing compiled ops (notably the cache ops like `_C_cache_ops.reshape_and_cache(_flash)` and some FP8 plumbing). What I’m trying to achieve next is a **clean build with the real compiled extensions** on Windows ROCm:

* `vllm._C` / `vllm._rocm_C` (or the equivalent ROCm extension)
* `_C_cache_ops` implementations (reshape/cache ops, etc.)
* Anything needed so vLLM doesn’t fall back to Python-level cache handling

If anyone has:

* a working **Windows ROCm build recipe** for vLLM’s native extensions (toolchain + flags),
* or a fork/branch known to build on Windows gfx1100,
* or a way to package these extensions into wheels for TheRock-style envs,

I’d love pointers. I can provide full logs/toolchain details.

(ComfyUIDesktop) C:\Users\pie\Desktop\ComfyUIDesktop>python patch_vllm_nuclear.py --model "C:\Users\pie\Desktop\ComfyUIDesktop\Mistral\Mistral-7B-Instruct-v0.3-FP8" --max-model-len 2048 --gpu-memory-utilization 0.7 --dtype float16
🔧 [VLLM-NUCLEAR] Setup environment ROCm/Windows...
🔧 [VLLM-NUCLEAR] Caricamento PyTorch...
🔧 [VLLM-NUCLEAR] ✓ Patched torch._scaled_mm fallback (RDNA3) [varargs]
🔧 [VLLM-NUCLEAR] ✅ ROCm: AMD Radeon RX 7900 XT
🔧 [VLLM-NUCLEAR] Neutralizzazione torch.distributed...
🔧 [VLLM-NUCLEAR] stub torch.distributed._functional_collectives
🔧 [VLLM-NUCLEAR] stub torch.distributed._symmetric_memory
🔧 [VLLM-NUCLEAR] stub torch.distributed.distributed_c10d
🔧 [VLLM-NUCLEAR] stub torch.distributed.rendezvous
🔧 [VLLM-NUCLEAR] Bypass dipendenze opzionali...
🔧 [VLLM-NUCLEAR] ⊘ llguidance
🔧 [VLLM-NUCLEAR] ⊘ xgrammar
🔧 [VLLM-NUCLEAR] ⊘ outlines
🔧 [VLLM-NUCLEAR] ⊘ uvloop
🔧 [VLLM-NUCLEAR] ⊘ flash_attn
🔧 [VLLM-NUCLEAR] ⊘ vllm_flash_attn
🔧 [VLLM-NUCLEAR] ✓ triton già disponibile
🔧 [VLLM-NUCLEAR] Patch platform detection...
DEBUG 01-20 18:33:22 [plugins/__init__.py:35] No plugins for group vllm.platform_plugins found.
DEBUG 01-20 18:33:22 [platforms/__init__.py:36] Checking if TPU platform is available.
DEBUG 01-20 18:33:22 [platforms/__init__.py:55] TPU platform is not available because: No module named 'libtpu'
DEBUG 01-20 18:33:22 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 01-20 18:33:22 [platforms/__init__.py:88] Exception happens when checking CUDA platform: NVML Shared Library Not Found
DEBUG 01-20 18:33:22 [platforms/__init__.py:105] CUDA platform is not available because: NVML Shared Library Not Found
DEBUG 01-20 18:33:22 [platforms/__init__.py:112] Checking if ROCm platform is available.
DEBUG 01-20 18:33:22 [platforms/__init__.py:120] Confirmed ROCm platform is available.
DEBUG 01-20 18:33:22 [platforms/__init__.py:133] Checking if XPU platform is available.
DEBUG 01-20 18:33:22 [platforms/__init__.py:153] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 01-20 18:33:22 [platforms/__init__.py:160] Checking if CPU platform is available.
DEBUG 01-20 18:33:22 [platforms/__init__.py:112] Checking if ROCm platform is available.
DEBUG 01-20 18:33:22 [platforms/__init__.py:120] Confirmed ROCm platform is available.
DEBUG 01-20 18:33:22 [platforms/__init__.py:225] Automatically detected platform rocm.
WARNING 01-20 18:33:22 [platforms/rocm.py:38] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 01-20 18:33:22 [platforms/rocm.py:44] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
🔧 [VLLM-NUCLEAR] ✓ Platform wrappato per forzare device_type='cuda'
🔧 [VLLM-NUCLEAR] Patch subprocess per gestire figli...
🔧 [VLLM-NUCLEAR] ============================================================
🔧 [VLLM-NUCLEAR] Import vLLM...
🔧 [VLLM-NUCLEAR] ============================================================
🔧 [VLLM-NUCLEAR] ✅ vLLM 0.13.0
🔧 [VLLM-NUCLEAR] ⚠️ 'app' non trovato, uso entry point alternativo
🔧 [VLLM-NUCLEAR] ============================================================
🔧 [VLLM-NUCLEAR] 🚀 AVVIO SERVER
🔧 [VLLM-NUCLEAR] ============================================================
🔧 [VLLM-NUCLEAR] Model: C:\Users\pie\Desktop\ComfyUIDesktop\Mistral\Mistral-7B-Instruct-v0.3-FP8
🔧 [VLLM-NUCLEAR] Max tokens: 2048
🔧 [VLLM-NUCLEAR] GPU memory: 0.7
🔧 [VLLM-NUCLEAR] ⚠️ run_server non disponibile, uso LLM diretto...
🔧 [VLLM-NUCLEAR] Creazione LLM engine...
🔧 [VLLM-NUCLEAR] Model: C:\Users\pie\Desktop\ComfyUIDesktop\Mistral\Mistral-7B-Instruct-v0.3-FP8
🔧 [VLLM-NUCLEAR] Device: cuda (ROCm)
🔧 [VLLM-NUCLEAR] Max tokens: 2048
🔧 [VLLM-NUCLEAR] GPU mem: 0.7
WARNING 01-20 18:33:25 [config/attention.py:82] Using VLLM_ATTENTION_BACKEND environment variable is deprecated and will be removed in v0.14.0 or v1.0.0, whichever is soonest. Please use --attention-config.backend command line argument or AttentionConfig(backend=...) config field instead.
DEBUG 01-20 18:33:25 [plugins/__init__.py:43] Available plugins for group vllm.general_plugins:
DEBUG 01-20 18:33:25 [plugins/__init__.py:45] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 01-20 18:33:25 [plugins/__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-20 18:33:25 [entrypoints/utils.py:253] non-default args: {'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 2048, 'distributed_executor_backend': 'uni', 'enable_prefix_caching': False, 'gpu_memory_utilization': 0.7, 'disable_log_stats': True, 'enforce_eager': True, 'enable_chunked_prefill': False, 'model': 'C:\\Users\\pie\\Desktop\\ComfyUIDesktop\\Mistral\\Mistral-7B-Instruct-v0.3-FP8'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
WARNING 01-20 18:33:25 [engine/arg_utils.py:1181] The global random seed is set to 0. Since VLLM_ENABLE_V1_MULTIPROCESSING is set to False, this may affect the random state of the Python process that launched vLLM.
DEBUG 01-20 18:33:25 [model_executor/models/registry.py:686] Loaded model info for class vllm.model_executor.models.llama.LlamaForCausalLM from cache
DEBUG 01-20 18:33:25 [logging_utils/log_time.py:29] Registry inspect model class: Elapsed time 0.0005593 secs
INFO 01-20 18:33:25 [config/model.py:514] Resolved architecture: MistralForCausalLM
WARNING 01-20 18:33:25 [config/model.py:2005] Casting torch.bfloat16 to torch.float16.
INFO 01-20 18:33:25 [config/model.py:1661] Using max model len 2048
WARNING 01-20 18:33:26 [platforms/interface.py:221] Failed to import from vllm._C: ModuleNotFoundError("No module named 'vllm._C'")
DEBUG 01-20 18:33:26 [utils/flashinfer.py:55] FlashInfer unavailable since package was not found
DEBUG 01-20 18:33:26 [_ipex_ops.py:15] Import error msg: No module named 'intel_extension_for_pytorch'
DEBUG 01-20 18:33:26 [config/model.py:1718] Generative models support chunked prefill.
DEBUG 01-20 18:33:26 [config/model.py:1770] Generative models support prefix caching.
WARNING 01-20 18:33:26 [engine/arg_utils.py:1869] This model does not officially support disabling chunked prefill. Disabling this manually may cause the engine to crash or produce incorrect outputs.
DEBUG 01-20 18:33:26 [engine/arg_utils.py:1968] Defaulting max_num_batched_tokens to 8192 for LLM_CLASS usage context.
DEBUG 01-20 18:33:26 [engine/arg_utils.py:1978] Defaulting max_num_seqs to 256 for LLM_CLASS usage context.
DEBUG 01-20 18:33:26 [config/parallel.py:650] Disabled the custom all-reduce kernel because it is not supported on current platform.
DEBUG 01-20 18:33:26 [config/parallel.py:650] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 01-20 18:33:26 [config/vllm.py:622] Enforce eager set, overriding optimization level to -O0
INFO 01-20 18:33:26 [config/vllm.py:722] Cudagraph is disabled under eager mode
DEBUG 01-20 18:33:26 [tokenizers/registry.py:63] Loading CachedHfTokenizer for tokenizer_mode='hf'
DEBUG 01-20 18:33:26 [plugins/io_processors/__init__.py:33] No IOProcessor plugins requested by the model
INFO 01-20 18:33:26 [v1/engine/core.py:93] Initializing a V1 LLM engine (v0.13.0) with config: model='C:\\Users\\pie\\Desktop\\ComfyUIDesktop\\Mistral\\Mistral-7B-Instruct-v0.3-FP8', speculative_config=None, tokenizer='C:\\Users\\pie\\Desktop\\ComfyUIDesktop\\Mistral\\Mistral-7B-Instruct-v0.3-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=fp8, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=C:\Users\pie\Desktop\ComfyUIDesktop\Mistral\Mistral-7B-Instruct-v0.3-FP8, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
DEBUG 01-20 18:33:26 [distributed/parallel_state.py:1164] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.60:60451 backend=nccl
DEBUG 01-20 18:33:26 [distributed/parallel_state.py:1250] Detected 1 nodes in the distributed environment
INFO 01-20 18:33:26 [distributed/parallel_state.py:1414] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
DEBUG 01-20 18:33:26 [compilation/decorators.py:194] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.deepseek_v2.DeepseekV2Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 01-20 18:33:26 [compilation/decorators.py:194] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 01-20 18:33:27 [v1/sample/logits_processor/__init__.py:63] No logitsprocs plugins installed (group vllm.logits_processors).
C:\Users\pie\Desktop\ComfyUIDesktop\.venv\Lib\site-packages\vllm\v1\sample\logits_processor\builtin.py:181: UserWarning: expandable_segments not supported on this platform (Triggered internally at B:\src\torch\c10/hip/HIPAllocatorConfig.h:40.)
self.neg_inf_tensor = torch.tensor(
INFO 01-20 18:33:27 [v1/worker/gpu_model_runner.py:3562] Starting to load model C:\Users\pie\Desktop\ComfyUIDesktop\Mistral\Mistral-7B-Instruct-v0.3-FP8...
INFO 01-20 18:33:27 [platforms/rocm.py:245] Using Triton Attention backend on V1 engine.
DEBUG 01-20 18:33:27 [config/compilation.py:1026] enabled custom ops: Counter({'quant_fp8': 128, 'rms_norm': 65, 'column_parallel_linear': 64, 'row_parallel_linear': 64, 'silu_and_mul': 32, 'vocab_parallel_embedding': 1, 'rotary_embedding': 1, 'apply_rotary_emb': 1, 'parallel_lm_head': 1, 'logits_processor': 1})
DEBUG 01-20 18:33:27 [config/compilation.py:1027] disabled custom ops: Counter()
DEBUG 01-20 18:33:27 [model_executor/model_loader/base_loader.py:53] Loading weights on cuda ...
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.41s/it]
DEBUG 01-20 18:33:29 [model_executor/models/utils.py:220] Loaded weight lm_head.weight with shape torch.Size([32768, 4096])
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.25s/it]

INFO 01-20 18:33:30 [model_executor/model_loader/default_loader.py:308] Loading weights took 2.64 seconds
INFO 01-20 18:33:31 [v1/worker/gpu_model_runner.py:3659] Model loading took 8.6188 GiB memory and 3.681771 seconds
DEBUG 01-20 18:33:35 [v1/worker/gpu_worker.py:362] Initial free memory: 19.84 GiB; Requested memory: 0.70 (util), 13.99 GiB
DEBUG 01-20 18:33:35 [v1/worker/gpu_worker.py:368] Free memory after profiling: 11.00 GiB (total), 5.15 GiB (within requested)
DEBUG 01-20 18:33:35 [v1/worker/gpu_worker.py:374] Memory profiling takes 3.57 seconds. Total non KV cache memory: 10.13GiB; torch peak memory increase: 1.41GiB; non-torch forward increase memory: 0.10GiB; weights memory: 8.62GiB.
INFO 01-20 18:33:35 [v1/worker/gpu_worker.py:375] Available KV cache memory: 3.86 GiB
INFO 01-20 18:33:35 [v1/core/kv_cache_utils.py:1291] GPU KV cache size: 31,648 tokens
INFO 01-20 18:33:35 [v1/core/kv_cache_utils.py:1296] Maximum concurrency for 2,048 tokens per request: 15.45x
DEBUG 01-20 18:33:35 [v1/worker/gpu_worker.py:516] Free memory on device (19.84/19.98 GiB) on startup. Desired GPU memory utilization is (0.7, 13.99 GiB). Actual usage is 8.62 GiB for weight, 1.41 GiB for peak activation, 0.1 GiB for non-torch memory, and 0.0 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=3991128268` (3.72 GiB) to fit into requested memory, or `--kv-cache-memory=10271915008` (9.57 GiB) to fully utilize gpu memory. Current kv cache memory in use is 3.86 GiB.
INFO 01-20 18:33:35 [v1/engine/core.py:259] init engine (profile, create kv cache, warmup model) took 4.30 seconds
DEBUG 01-20 18:33:35 [utils/gc_utils.py:40] GC Debug Config. enabled:False,top_objects:-1
INFO 01-20 18:33:35 [entrypoints/llm.py:360] Supported tasks: ('generate',)
🔧 [VLLM-NUCLEAR] ============================================================
🔧 [VLLM-NUCLEAR] ✅ MODELLO CARICATO!
🔧 [VLLM-NUCLEAR] ============================================================
🔧 [VLLM-NUCLEAR] ✓ Patched vllm._custom_ops.reshape_and_cache fallback (python)
🔧 [VLLM-NUCLEAR] ✓ Patched vllm._custom_ops.reshape_and_cache_flash fallback (python)
🔧 [VLLM-NUCLEAR] ✓ Patched flex_attention torch proxy (full passthrough + reshape_and_cache_flash)

=== MULTITURN TEST ===

You> Say Hi to Reddit!
Adding requests: 100%|█████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1000.07it/s]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]DEBUG 01-20 18:33:44 [v1/worker/gpu_model_runner.py:3013] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=9, num_reqs=None, uniform=False, has_lora=False), should_ubatch: False, num_tokens_across_dp: None
DEBUG 01-20 18:33:45 [v1/worker/gpu_model_runner.py:3013] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), should_ubatch: False, num_tokens_across_dp: None
more of ts.
DEBUG 01-20 18:33:58 [v1/worker/gpu_model_runner.py:3013] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=1, num_reqs=None, uniform=False, has_lora=False), should_ubatch: False, num_tokens_across_dp: None
Processed prompts: 100%|█████████████| 1/1 [00:14<00:00, 14.11s/it, est. speed input: 0.64 toks/s, output: 4.54 toks/s]
[debug] len(out)=198 codepoints=[32, 72, 101, 108, 108, 111, 32, 82, 101, 100]

Assistant> " Hello Reddit! It's great to be here. I'm an AI model and I'm here to help answer your questions, provide information, and engage in discussions. I don't have personal experiences or emotions, but I"
[speed] 48 tokens in 14.12s => 3.40 tok/s
You>

Open Reddit thread

r/LocalLLaMA 489 upvotes 130 comments February 20, 2024

Introducing LoraLand: 25 fine-tuned Mistral-7b models that outperform GPT-4

Hi all! Today, we're very excited to launch LoRA Land: 25 fine-tuned mistral-7b models that outperform #gpt4 on task-specific applications ranging from sentiment detection to question answering.

https://preview.redd.it/m1jhmfdmssjc1.png?width=2390&format=png&auto=webp&s=d5c074dd979248be66aba9e0418432988a85a7b8

All 25 fine-tuned models…

* Outperform GPT-4, GPT-3.5-turbo, and mistral-7b-instruct for specific tasks
* Are cost-effectively served from a single GPU through LoRAX
* Were trained for less than $8 each on average

You can prompt all of the fine-tuned models today and compare their results to mistral-7b-instruct in real time!

Check out LoRA Land: [https://predibase.com/lora-land?utm\_medium=social&utm\_source=reddit](https://predibase.com/lora-land?utm_medium=social&utm_source=reddit) or our launch blog: [https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4](https://predibase.com/blog/lora-land-fine-tuned-open-source-llms-that-outperform-gpt-4)

If you have any comments or feedback, we're all ears!



Open Reddit thread

r/LocalLLaMA 172 upvotes 83 comments September 27, 2023

LLM Chat/RP Comparison/Test: Mistral 7B Base + Instruct

Here's another LLM Chat/RP comparison/test of mine featuring today's newly released **[Mistral](https://twitter.com/MistralAI/status/1706877320844509405)** models! As usual, I've evaluated these models for their chat and role-playing performance using the same methodology:

- Same (complicated and limit-testing) long-form conversations with all models
- including a complex character card ([MonGirl Help Clinic (NSFW)](https://www.chub.ai/characters/frozenvan/mongirl-help-clinic)), "MGHC", chosen specifically for these reasons:
- NSFW (to test censorship of the models)
- popular (on Chub's first page, so it's not an obscure scenario, but one of the most popular ones)
- big (biggest model on the page, >2K tokens by itself, for testing model behavior at full context)
- complex (more than a simple 1:1 chat, it includes instructions, formatting, storytelling, and multiple characters)
- and my own repeatable test chats/roleplays with [Amy](https://www.reddit.com/r/LocalLLaMA/comments/15388d6/llama_2_pffft_boundaries_ethics_dont_be_silly/)
- over dozens of messages, going to full 4K context and beyond, noting especially good or bad responses
- [SillyTavern](https://github.com/SillyTavern/SillyTavern) v1.10.4 frontend
- [KoboldCpp](https://github.com/LostRuins/koboldcpp) v1.44.2 backend
- **Deterministic** generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
- [**Roleplay** instruct mode preset](https://imgur.com/a/KkoI4uf) *and where applicable* official prompt format (if it might make a notable difference)

Mistral seems to be trained on 32K context, but KoboldCpp doesn't go that high yet, and I only tested 4K context so far:

- **[Mistral-7B-Instruct-v0.1](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF)** (Q8_0)
- Amy, Roleplay: When asked about limits, didn't talk about ethics, instead mentioned sensible human-like limits, then asked me about mine. Executed complex instructions flawlessly. Switched from speech with asterisk actions to actions with literal speech. Extreme repetition after 20 messages (prompt 2690 tokens, going back to message 7), completely breaking the chat.
- Amy, official Instruct format: When asked about limits, mentioned (among other things) racism, homophobia, transphobia, and other forms of discrimination. Got confused about who's who again and again. Repetition after 24 messages (prompt 3590 tokens, going back to message 5).
- MGHC, official Instruct format: First patient is the exact same as in the example. Wrote what User said and did. Repeated full analysis after every message. Repetition after 23 messages. Little detail, fast-forwarding through scenes.
- MGHC, Roleplay: Had to ask for analysis. Only narrator, not in-character. Little detail, fast-forwarding through scenes. Wasn't fun that way, so I aborted early.
- **[Mistral-7B-v0.1](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF)** (Q8_0)
- MGHC, Roleplay: Gave analysis on its own. Wrote what User said and did. Repeated full analysis after every message. Second patient same type as first, and suddenly switched back to the first, because of confusion or repetition. After a dozen messages, switched to narrator, not in-character anymore. Little detail, fast-forwarding through scenes.
- Amy, Roleplay: No limits. Nonsense and repetition after 16 messages. Became unusable at 24 messages.

**Conclusion:**

This is an important model, since it's not another fine-tune, this is a new base. It's only 7B, a size I usually don't touch at all, so I can't really compare it to other 7Bs. But I've evaluated lots of 13Bs and up, and this model seems really smart, at least on par with 13Bs and possibly even higher.

But damn, repetition is ruining it again, [just like Llama 2](https://www.reddit.com/r/LocalLLaMA/comments/155vy0k/llama_2_too_repetitive/)! As it not only affects the Instruct model, but also the base itself, it can't be caused by the prompt format. I really hope there'll be a fix for this showstopper issue.

However, even if it's only 7B and suffers from repetition issues, it's a promise of better things to come: Imagine if they release a real 34B with the quality of a 70B, with the same 32K native context of this one! Especially when that becomes the new base for outstanding fine-tunes like Xwin, Synthia, or Hermes. Really hope this happens sooner than later.

Until then, I'll stick with Mythalion-13B or continue experimenting with MXLewd-L2-20B when I look for fast responses. For utmost quality, I'll keep using Xwin, Synthia, or Hermes in 70B.

--------------------------------------------------------------------------------

**Update 2023-10-03:**

I'm revising my review of Mistral 7B OpenOrca after it has received an update that fixed its glaring issues, which affects the "ranking" of Synthia 7B v1.3, and I've also reviewed the new dolphin-2.0-mistral-7B, so it's sensible to give these Mistral-based models their own post:

[LLM Chat/RP Comparison/Test: Dolphin-Mistral, Mistral-OpenOrca, Synthia 7B](https://www.reddit.com/r/LocalLLaMA/comments/16z3goq/llm_chatrp_comparisontest_dolphinmistral/)

--------------------------------------------------------------------------------

Here's a list of my previous model tests and comparisons:

- [LLM Chat/RP Comparison/Test (Euryale, FashionGPT, MXLewd, Synthia, Xwin)](https://www.reddit.com/r/LocalLLaMA/comments/16r7ol2/llm_chatrp_comparisontest_euryale_fashiongpt/) Winner: Xwin-LM-70B-V0.1
- [New Model Comparison/Test (Part 2 of 2: 7 models tested, 70B+180B)](https://www.reddit.com/r/LocalLLaMA/comments/16l8enh/new_model_comparisontest_part_2_of_2_7_models/) Winners: Nous-Hermes-Llama2-70B, Synthia-70B-v1.2b
- [New Model Comparison/Test (Part 1 of 2: 15 models tested, 13B+34B)](https://www.reddit.com/r/LocalLLaMA/comments/16kecsf/new_model_comparisontest_part_1_of_2_15_models/) Winner: Mythalion-13B
- [New Model RP Comparison/Test (7 models tested)](https://www.reddit.com/r/LocalLLaMA/comments/15ogc60/new_model_rp_comparisontest_7_models_tested/) Winners: MythoMax-L2-13B, vicuna-13B-v1.5-16K
- [Big Model Comparison/Test (13 models tested)](https://www.reddit.com/r/LocalLLaMA/comments/15lihmq/big_model_comparisontest_13_models_tested/) Winner: Nous-Hermes-Llama2

Open Reddit thread

r/LocalLLaMA 231 upvotes 56 comments October 15, 2023

🐺🐦‍⬛ Mistral LLM Comparison/Test: Instruct, OpenOrca, Dolphin, Zephyr and more...

**Wolfram's Mistral LLM Comparison/Test: Instruct, OpenOrca, Dolphin, Zephyr and more...**

With the Mistral hype still going strong, I wanted to evaluate these promising 7B models some more. And there's also the lingering question how much quantization affects quality. Plus, there have been multiple German models released, and since one of my tests is in German, I'm curious how they handle that compared to the mainly English language models.

So let me try to answer the following questions with this post:

- Which Mistral variant is best?
- How does quantization affect it?
- Which *German* Mistral variant is best?

**Testing methodology:**

- Same (complicated and limit-testing) long-form conversations with all models
- German data protection training:
- The test data and questions as well as all instructions were in German while the character card is in English. This tests translation capabilities and cross-language understanding.
- Before giving the information, I instructed the model: *I'll give you some information. Take note of this, but only answer with "OK" as confirmation of your acknowledgment, nothing else.* This tests instruction understanding and following capabilities.
- After giving all the information about a topic, I give the model the exam question. It's always a multiple choice (A/B/C) question, where the last one is the same as the first but with changed order and letters (X/Y/Z).
- MGHC:
- A complex character and scenario card ([MonGirl Help Clinic (NSFW)](https://www.chub.ai/characters/frozenvan/mongirl-help-clinic)), chosen specifically for these reasons:
- NSFW (to test censorship of the models)
- popular (on Chub's first page, so it's not an obscure scenario, but one of the most popular ones)
- big (biggest model on the page, >2K tokens by itself, for testing model behavior at full context)
- complex (more than a simple 1:1 chat, it includes instructions, formatting, storytelling, and multiple characters)
- Amy:
- My own repeatable test chats/roleplays with [Amy](https://www.reddit.com/r/LocalLLaMA/comments/15388d6/llama_2_pffft_boundaries_ethics_dont_be_silly/)
- Over dozens of messages, going to full 8K context and beyond, with complex instructions and scenes, designed to test ethical and intellectual limits
- [SillyTavern](https://github.com/SillyTavern/SillyTavern) v1.10.5 frontend
- [oobabooga's text-generation-webui](https://github.com/oobabooga/text-generation-webui) v1.7 backend
- Yes, I'm not using my usual KoboldCpp for this test, since I use the original unquantized models!
- **Deterministic** generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
- Official prompt format *and* [**Roleplay** instruct mode preset](https://imgur.com/a/KkoI4uf)

**Which Mistral variant is best?**

- **[Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)**
- 👍 German data protection training
- official Mistral format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to ALL (4/4) multiple choice questions!
- Responded properly to thanks, but switched to English.
- ❌ MGHC
- official Mistral format:
- First patient straight from examples.
- Had to ask for analysis. Repeated first message before giving analysis.
- Immediately derails with repetition. UNUSABLE!
- Roleplay instruct mode preset:
- Deviated from the formula and rules, writing a completed short story instead of an interactive scenario. UNUSABLE!
- ❌ Amy
- official Mistral format:
- Mentioned boundaries, but later didn't hesitate to go beyond those anyway.
- Didn't adhere to the character background completely.
- Later got confused about who's who and anatomical details.
- After ~30 messages, fell into a repetition loop.
- Roleplay instruct mode preset:
- Showed personality and wrote extremely well, much better than I'd expect from a 7B or even 13B.
- But suffered from severe repetition (even within the same message) after ~15 messages.
- Frustrating to see such excellent writing ruined by the extreme repetition.
- **Conclusion:**
- Best instruction following and understanding/reasoning, solved the data protection exam perfectly.
- But no good for roleplay because of severe repetition issues.
- **[Mistral-7B-OpenOrca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca)**
- ❌ German data protection training
- official ChatML format:
- Failed to consistently acknowledge all data input with "OK".
- Gave correct answer to only 1/4 multiple choice questions.
- Responded properly to thanks, but German was really bad ("Du willkommen! Es freut mich, dich zu helfen!").
- ❌ MGHC
- official ChatML format:
- First patient unique. Gave analysis on its own for first patient. Repeated "[Payment]" with each message. Wrapped it up with "[End Scenario]" at the right time.
- Second patient unique, too. Had to ask for analysis, which included empty "[End Scenario]". Repeated "[Payment]" and "[End Scenario]" with each message.
- Repetition is a glaring issue, but at least this model handled MGHC better than many other 7Bs (ultimately still unusable, though).
- 👍 Amy
- official ChatML format:
- Writing sometimes of high quality, sometimes very low ("rubbing his shoulders gently while keeping her distance due to social distancing rules")
- Mentioned boundaries, but later didn't hesitate to go beyond those anyway.
- Later got confused about who's who and anatomical details.
- Roleplay instruct mode preset:
- Excellent writing, nice emoting, less repetition. Worked very well!
- **Conclusion:**
- Surprisingly bad results regarding instruction following, understanding, and reasoning in the exam scenario.
- But great writing and roleplaying (especially with Roleplay preset).
- Showed an actual sense of humor and made a memorable pun.
- **[dolphin-2.1-mistral-7b](https://huggingface.co/ehartford/dolphin-2.1-mistral-7b)**
- ❌ German data protection training
- official ChatML format:
- Failed to consistently acknowledge all data input with "OK".
- Gave correct answer to 2/4 multiple choice questions (and didn't obey when asked to answer with just a single letter).
- Responded properly to thanks, but switched to English.
- ❌ MGHC
- official ChatML format:
- First patient unique. Gave analysis on its own. Repeated analysis with each message.
- Second patient unique, too. Gave analysis on its own. Wrapped up the whole session in a single message.
- Third patient unique as well, but situation logically incoherent. Gave analysis on its own. Wrapped up the whole session in a single message.
- 👍 Amy
- official ChatML format:
- No boundaries ("That's why they call me the Uncensored One.").
- Excellent and long writing, nice emoting, less repetition. More storytelling than interactive fiction, with some very long messages (>1K tokens). But didn't fully grasp what was going on, i. e. while the writing was top notch, the scene itself wasn't exactly as envisioned.
- Later got confused about who's who and anatomical details.
- Roleplay instruct mode preset:
- Worked very well! First model ever to explicitly list the dislikes as stated on the character card as its only boundaries.
- Excellent and long writing, nice emoting, less repetition.
- Some confusion about who's who and anatomical details.
- **Conclusion:**
- Having tested the previous version in GGUF format, which was a letdown, this newer and unquantized version is so much better!
- Seemed more intelligent than the other models I tested this time.
- However, showing off high intelligence isn't necessarily always a good thing (especially for roleplay) as sometimes it does get a bit too technical or realistic (like I always say, the smartest person isn't always the most fun to hang out with).
- **[zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)**
- German data protection training
- ❌ official Zephyr format:
- Failed to consistently acknowledge all data input with "OK".
- Gave correct answers to 2/4 multiple choice questions.
- After being told to answer with a single letter, even responded like that to thanks.
- 👍 ChatML format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to ALL (4/4) multiple choice questions!
- Also said "OK" to summary but responded properly to thanks.
- 👍 MGHC
- Zephyr format:
- First patient unique. Gave analysis on its own. Repeated analysis with each message.
- Second patient male.
- Third patient unique, too. Gave analysis on its own. Repeated analysis with each message.
- Showed some signs of repetition, but handled this complex scenario better than the other models I tested this time. Still very far from what bigger models produce, but currently the best a 7B has ever achieved in this test.
- ❌ Amy
- official Zephyr format:
- Short, formal responses, uncommon emote format (in brackets).
- Said "no boundaries" but later hesitated and asked for confirmation multiple times.
- No fun, too technical, too aligned.
- ChatML format:
- After ~15 messages, derailed with repetition of long bandworm sentences mixed with emotes. Interrupted the message after 2K tokens and aborted the test.
- Roleplay instruct mode preset:
- Much better responses and no hesitation or derailing repetition (but still not as good as the Dolphin and OpenOrca variants).
- Some confusion about who's who and anatomical details.
- **Conclusion:**
- Unexpected discovery: ChatML format worked much better than the official Zephyr format for this model!
- With ChatML format used, it beat most of the other models tested this time in the exam scenario.
- However, its writing was worse than that of the other models tested this time, no matter which format was used.

So which Mistral variant is the best? As you can see, each one has strengths and weaknesses, and none could convince me completely.

If you're looking for an instruct model for professional use, especially when asking it to give a single response to a question/task, the original Mistral 7B Instruct or Zephyr 7B Alpha (with ChatML prompt format) seem to be your best bets.

If you're looking for a model that roleplays well, the OpenOrca and Dolphin variants are more suitable and punch above their 7B weight with their excellent writing.

**How does quantization affect it?**

To find out how quantization affects these models, I'll stick to the data protection exam since it can be judged objectively. The other tests involve writing and it's subjective how well written a text appears to you. So I'll test each quant and see how many correct answers the model (which answered all correctly in unquantized form) still gets.

- **[Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF)**
- ❌ Q2_K:
- Gave correct answers to 2/4 multiple choice questions.
- When asked to answer with more than just a single letter, produced nonsensical output ("C123456789012345678901234567890...").
- ❌ Q3_K_S:
- Gave correct answers to 2/4 multiple choice questions.
- When asked to answer with more than just a single letter, didn't comply.
- ❌ Q3_K_M:
- Gave correct answers to ALL (4/4) multiple choice questions.
- When asked to answer with more than just a single letter, didn't comply.
- ❌ Q3_K_L:
- Gave correct answers to 3/4 multiple choice questions.
- When asked to answer with more than just a single letter, repeated the previous information message instead of answering the question!
- 👍 Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_S, Q5_K_M, Q6_K, Q8_0:
- Gave correct answers to ALL (4/4) multiple choice questions.
- When asked to answer with more than just a single letter, explained its reasoning properly.

The answer is very clear, Q4_0 and above gave perfect results just like the unquantized version. Of course that doesn't mean Q4_0 is as good as Q8_0 or the unquantized orginal, but we see here that all lower quants (Q2 + Q3) had issues so I'd not recommend those (at least not for Mistral-based 7B models).

**Which German Mistral variant is best?**

There have been a bunch of German model releases recently, many based on Mistral, so I'll take a look at those as well - from 3B to 70B! Let's find out if they beat the ones I tested above since the data protection training used in these tests is in German so they should theoretically have an advantage:

- ❌ **[em_german_leo_mistral](https://huggingface.co/jphme/em_german_leo_mistral)**
- Official USER/ASSISTANT prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 1/4 multiple choice questions and didn't answer the last one (a repeat of the first) at all.
- Also kept saying "OK" to summary and thanks instead of properly responding to those.
- ChatML prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions but didn't answer the last one (a repeat of the first) properly.
- Also said "OK" to summary but responded properly to thanks.
- ❌ **[em_german_mistral_v01](https://huggingface.co/jphme/em_german_mistral_v01)**
- Official USER/ASSISTANT prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions (but didn't obey when asked to answer with more than just a letter).
- Also said "OK" to summary but responded properly to thanks (but misspelled my name).
- ChatML prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 2/4 multiple choice questions, got 1st and 4th question (actually the same one) wrong and explained its (wrong) reasoning.
- Also said "OK" to summary but responded properly to thanks.
- ❌ **[em_german_70b_v01-GGUF](https://huggingface.co/TheBloke/em_german_70b_v01-GGUF)**
- ChatML prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 2/4 multiple choice questions, got 1st and 4th question (actually the same one) wrong.
- Also said "OK" to summary but responded properly to thanks.
- Official USER/ASSISTANT prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions (answered first question wrongly, but when asked again as final question, answered correctly).
- Also said "OK" to summary but responded properly to thanks.
- ❌ **[leo-mistral-hessianai-7b-chat](https://huggingface.co/LeoLM/leo-mistral-hessianai-7b-chat)**
- ChatML prompt format:
- Failed to consistently acknowledge all data input with "OK".
- Failed to answer. Seemed to not understand or follow instructions.
- ❌ **[Mistral-7B-german-assistant-v2](https://huggingface.co/flozi00/Mistral-7B-german-assistant-v2)**
- Official Alpaca prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions but didn't answer the last one (a repeat of the first) properly.
- When asked to answer with more than just a single letter, didn't comply.
- ❌ **[SauerkrautLM-3b-v1](https://huggingface.co/VAGOsolutions/SauerkrautLM-3b-v1)**
- Tried various prompt formats (official User:/Assistant: one, ChatML, Vicuna, WizardLM) but never got good responses for long.
- 3B seems unusable. Stupid and it's German is not good at all.
- ❌ **[SauerkrautLM-7b-v1](https://huggingface.co/VAGOsolutions/SauerkrautLM-7b-v1)**
- Official User/Assistant prompt format: Kept saying "OK" even to the question and when asked to answer.
- ChatML format: Didn't acknowledge data input with "OK". Gave wrong answer.
- ❌ **[SauerkrautLM-13b-v1](https://huggingface.co/VAGOsolutions/SauerkrautLM-13b-v1)**
- Official User/Assistant prompt format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions (but didn't obey when asked to answer with more than just a letter).
- Also kept saying "OK" to summary and thanks instead of properly responding to those.
- ChatML format:
- Failed to consistently acknowledge all data input with "OK".
- Gave correct answers to all multiple choice questions (but answer the last one correctly only after being asked to answer with just a single letter).
- Summarized summary and responded properly to thanks.
- ❌ **[SauerkrautLM-7b-v1-mistral](https://huggingface.co/VAGOsolutions/SauerkrautLM-7b-v1-mistral)**
- Official User/Assistant prompt format: Kept saying "OK" even to the question and when asked to answer.
- ChatML format:
- Consistently acknowledged all data input with "OK".
- Gave correct answers to 3/4 multiple choice questions (answered first question wrongly, but when asked again as final question, answered correctly).
- Also said "OK" to summary but responded properly to thanks (but misspelled my name).

Ironically none of the German models managed to successfully complete the German exam! Not even the 70B, which was beat by a 7B (Mistral Instruct).

Did the German finetuning reduce their capabilities? I've always been of the opinion that specialized models won't be as good as generalists because - like with our human brains - there are so many obscure connections between neurons that it's not as easy as leaving out unrelated information to get better at a specific topic (yes, Japanese poetry and Chinese cooking recipes could very well improve our Python coding models).

That's why I believe that a model trained on multiple languages will be better at each language than one specialized in just one language. So to make a model better at one language, it should be trained/finetuned with that in addition to everything else, not instead of it.

At least that's my theory. Which so far seems to be confirmed by these findings.

**TL;DR:**

- Despite the hype, Mistral models aren't perfect, they're still 7B. But for that size, they're really very good.
- Among Mistral models, there's not one clear winner yet that's *the* best. For professional use, Mistral 7B Instruct or Zephyr 7B Alpha (with ChatML prompt format) did best in my tests. For roleplay, Mistral-based OpenOrca and Dolphin variants worked the best and produced excellent writing.
- Prompt format makes a huge difference but the "official" template may not always be the best. It's high time we find and follow some best practice instead of reinventing the wheel all the time (which leads to a bumpy ride).
- Don't go below Q4_0 quantization when using Mistral-based 7B models. Anything lower will lobotomize small model brains too much.
- Kinda ironic that the English models worked better with the German data and exam than the ones finetuned in German. Looks like language doesn't matter as much as general intelligence and a more intelligent model can cope with different languages more easily. German-specific models need better tuning to compete in general and excel in German.

--------------------------------------------------------------------------------

Here's a list of my previous model tests and comparisons:

- [LLM Pro/Serious Use Comparison/Test: From 7B to 70B vs. ChatGPT!](https://www.reddit.com/r/LocalLLaMA/comments/172ai2j/llm_proserious_use_comparisontest_from_7b_to_70b/) Winner: Synthia-70B-v1.2b
- [LLM Chat/RP Comparison/Test: Dolphin-Mistral, Mistral-OpenOrca, Synthia 7B](https://www.reddit.com/r/LocalLLaMA/comments/16z3goq/llm_chatrp_comparisontest_dolphinmistral/) Winner: Mistral-7B-OpenOrca
- [LLM Chat/RP Comparison/Test: Mistral 7B Base + Instruct](https://www.reddit.com/r/LocalLLaMA/comments/16twtfn/llm_chatrp_comparisontest_mistral_7b_base_instruct/)
- [LLM Chat/RP Comparison/Test (Euryale, FashionGPT, MXLewd, Synthia, Xwin)](https://www.reddit.com/r/LocalLLaMA/comments/16r7ol2/llm_chatrp_comparisontest_euryale_fashiongpt/) Winner: Xwin-LM-70B-V0.1
- [New Model Comparison/Test (Part 2 of 2: 7 models tested, 70B+180B)](https://www.reddit.com/r/LocalLLaMA/comments/16l8enh/new_model_comparisontest_part_2_of_2_7_models/) Winners: Nous-Hermes-Llama2-70B, Synthia-70B-v1.2b
- [New Model Comparison/Test (Part 1 of 2: 15 models tested, 13B+34B)](https://www.reddit.com/r/LocalLLaMA/comments/16kecsf/new_model_comparisontest_part_1_of_2_15_models/) Winner: Mythalion-13B
- [New Model RP Comparison/Test (7 models tested)](https://www.reddit.com/r/LocalLLaMA/comments/15ogc60/new_model_rp_comparisontest_7_models_tested/) Winners: MythoMax-L2-13B, vicuna-13B-v1.5-16K
- [Big Model Comparison/Test (13 models tested)](https://www.reddit.com/r/LocalLLaMA/comments/15lihmq/big_model_comparisontest_13_models_tested/) Winner: Nous-Hermes-Llama2
- [SillyTavern's Roleplay preset vs. model-specific prompt format](https://www.reddit.com/r/LocalLLaMA/comments/15mu7um/sillytaverns_roleplay_preset_vs_modelspecific/)

Open Reddit thread

r/LocalLLaMA 131 upvotes 70 comments October 28, 2023

Mistral 7b might be pretrained to ace evals and it doesn’t have complex emergent language skills.

>Evidence of cramming for the leaderboard. Hugging Face’s Open LLM leaderboard (Beeching et al., 2023), which is based upon EleutherAI’s evaluation harness (Gao et al., 2021), is seen as a proving ground for open LLMs. Many models currently at the top of the leaderboard are LLaMA- 2 derivatives, and are ranked much higher than the corresponding LLaMA-2 model. However, on SKILL-MIX these models perform poorly and worse than LLaMA-2-70B-Chat, suggestive of cramming that significantly harmed general-purpose text skills (see Section 5). The recent Falcon- 180B-Chat (Almazrouei et al., 2023) also places higher on the leaderboard than LLaMA-2-70B-Chat, and has been claimed to have capabilities between GPT-3.5-turbo and GPT-4 based upon this ranking. Yet, it fares worse than LLaMA-2-70B-Chat on SKILL-MIX. Mistral-7B-Instruct-v0.1 also did not live up to claims of being significantly better than the corresponding LLaMA model.

Open Reddit thread

View more discussions →

More models from Mistral

Continue browsing adjacent models from the same provider.

← All AI Models

Mistral 7B Instruct Deprecated

Model Overview

Provider

Input Context Window

Maximum Output Tokens

Open Source

Release Date

Knowledge Cut-off Date

API Providers

Modalities

What is Mistral 7B Instruct Deprecated

What Mistral 7B Instruct Deprecated supports

Multimodal I/O

Pricing for Mistral 7B Instruct Deprecated

Price Comparison

API Access & Providers

Resources & Documentation

AI tools related to Mistral 7B Instruct Deprecated

Mistral 7B

Related Daily Briefs

Mistral and OpenAI Signal a Broader Shift Around Costs Using PNGs

Google DeepMind, Mistral, and Anthropic Signal a Broader Shift Around MiniMax M3

What people think about Mistral 7B Instruct Deprecated

More models from Mistral