VRAM Requirements for Popular LLMs
Complete reference for VRAM needs across Llama 3, Mistral, Mixtral, Qwen, and other popular models at various quantization levels.
Table of Contents
1. Understanding VRAM Requirements
VRAM usage for LLMs depends on three factors: model parameters, quantization level, and context length (KV cache). This guide provides baseline requirements assuming 4K context. Longer contexts add ~1-2GB per 4K tokens for 7B models, scaling with model size.
Formula approximation: VRAM (GB) ≈ Parameters (B) × Bits / 8 + KV Cache + Overhead
2. Llama 3 Family
Llama 3 8B: FP16: 16GB | Q8: 9GB | Q4_K_M: 5GB
Llama 3 70B: FP16: 140GB | Q8: 75GB | Q4_K_M: 40GB
Llama 3 405B: FP16: 810GB | Q8: 430GB | Q4_K_M: 230GB (multi-node required)
3. Mistral & Mixtral
Mistral 7B: FP16: 14GB | Q8: 8GB | Q4_K_M: 4.5GB
Mixtral 8x7B: FP16: 90GB | Q8: 48GB | Q4_K_M: 26GB (MoE architecture)
Mixtral 8x22B: FP16: 280GB | Q8: 150GB | Q4_K_M: 80GB
4. Qwen Family
Qwen 2 7B: FP16: 14GB | Q8: 8GB | Q4_K_M: 4.5GB
Qwen 2 72B: FP16: 145GB | Q8: 77GB | Q4_K_M: 42GB
Qwen 2.5 Coder 32B: FP16: 64GB | Q8: 34GB | Q4_K_M: 19GB
5. Other Popular Models
Phi-3 Mini (3.8B): FP16: 8GB | Q4: 2.5GB - Great for limited hardware
CodeLlama 34B: FP16: 68GB | Q4_K_M: 20GB
DeepSeek Coder 33B: FP16: 66GB | Q4_K_M: 19GB
Yi 34B: FP16: 68GB | Q4_K_M: 20GB
Command R+ (104B): FP16: 208GB | Q4_K_M: 60GB
6. GPU Recommendations by Model Size
7B models: RTX 3060 12GB, RTX 4060 Ti 16GB, or any 8GB+ GPU
13B models: RTX 3090/4090 (24GB) or RTX 4080 (16GB with Q4)
34B models: RTX 4090 (24GB) with aggressive quantization, or dual GPUs
70B models: Dual RTX 4090, A100 40GB, or 48GB+ workstation cards
100B+ models: Multi-GPU setups, A100 80GB, H100, or cloud instances
◈ Related Guides
Need Help Choosing Hardware?
Compare specs and pricing for all AI hardware in our catalog.
Open Compare Tool →