VRAM Calculator for Local LLMs
Estimate exact VRAM needed for model weights plus KV cache — the part most calculators skip, and the reason a model that loads fine at low context can still run out of memory once you ask it something long.
Find these in the model's config.json on Hugging Face
(num_hidden_layers, num_key_value_heads, head_dim).
Estimated VRAM usage
- Model weights
- —
- KV cache
- —
- Overhead (CUDA/framework)
- —
- Total
- —
Max context length that fits your VRAM:
—
How this is calculated
Total VRAM is the sum of three components, calculated separately rather than estimated as a single multiplier:
Model weights
Parameter count × effective bytes-per-parameter for the chosen quantization. GGUF K-quants use mixed precision and block-wise scale factors, so this isn't a flat bit-count division — the values used here are calibrated against real GGUF file sizes.
KV cache
2 × layers × KV heads × head dimension × context length × 2 bytes × batch size. This is the term that grows with context length and is usually why a model that loads fine at short context runs out of memory on long conversations.
Overhead
A fixed ~0.6GB buffer for CUDA context initialization and inference framework overhead. Actual overhead varies slightly by backend (llama.cpp, vLLM, etc.).
Verified against a real Qwen3-14B Q8_0 load on an RTX 3060 12GB: predicted weight size alone (≈15.7GB) correctly exceeds 12GB, matching the observed out-of-memory result, while Q4_K_M (≈9.2GB) correctly fits, leaving modest headroom for context.
FAQ
Why does this calculator give different numbers than other VRAM calculators?
Most calculators only account for model weight size and ignore KV cache, which scales with context length. A model that fits comfortably at 2K context can run out of memory at 32K context purely from KV cache growth. This calculator separates both terms so you can see exactly where your VRAM is going.
What's the difference between Q4_K_M and Q8_0?
Q4_K_M uses roughly 0.55 bytes per parameter with mixed-precision blocks, producing smaller files with a modest quality tradeoff. Q8_0 uses roughly 1 byte per parameter and is close to lossless compared to full FP16 precision. Q4_K_M is the practical default for most consumer GPUs; Q8_0 is worth it if you have VRAM to spare.
Does KV cache quantization change these numbers?
This calculator assumes KV cache stored at FP16 (2 bytes per value), which is the common default in llama.cpp and Ollama. Some setups support quantized KV cache (e.g. Q8 or Q4 KV cache) to reduce this further, which would lower the KV cache portion of the estimate proportionally.