Question 1

Why does this calculator give different numbers than other VRAM calculators?

Accepted Answer

Most calculators only account for model weight size and ignore KV cache, which scales with context length. A model that fits comfortably at 2K context can run out of memory at 32K context purely from KV cache growth. This calculator separates both terms so you can see exactly where your VRAM is going.

Question 2

What's the difference between Q4_K_M and Q8_0?

Accepted Answer

Q4_K_M uses roughly 0.55 bytes per parameter with mixed-precision blocks, producing smaller files with a modest quality tradeoff. Q8_0 uses roughly 1 byte per parameter and is close to lossless compared to full FP16 precision. Q4_K_M is the practical default for most consumer GPUs; Q8_0 is worth it if you have VRAM to spare.

Question 3

Does KV cache quantization change these numbers?

Accepted Answer

This calculator assumes KV cache stored at FP16 (2 bytes per value), which is the common default in llama.cpp and Ollama. Some setups support quantized KV cache (e.g. Q8 or Q4 KV cache) to reduce this further, which would lower the KV cache portion of the estimate proportionally.

VRAM Calculator for Local LLMs

Estimated VRAM usage

How this is calculated

Model weights

KV cache

Overhead

FAQ

Why does this calculator give different numbers than other VRAM calculators?

What's the difference between Q4_K_M and Q8_0?

Does KV cache quantization change these numbers?