How Much VRAM & Hardware You Need for Local LLMs

Local LLM Hardware Requirements: VRAM, RAM & GPU

Before you install anything, you need to know whether your hardware can run the models worth running. The answer is almost always yes — but the model size, inference speed, and quantisation level you can use depend entirely on your VRAM, RAM, and CPU.

[IMAGE: Chart comparing GPU VRAM tiers — 8 GB, 16 GB, 24 GB — and which local LLM models they can run]


How much VRAM do you need for a local LLM?

For GPU inference, you need enough VRAM to hold the entire model in memory. A 7B parameter model in Q4 quantisation requires approximately 4–5 GB of VRAM. A 7B model at full precision (FP16) requires approximately 14 GB. Most developers run quantised models: the quality loss from Q4 quantisation is minor, while the VRAM savings are substantial. A GPU with 6–8 GB VRAM (RTX 3060, RTX 3070) handles any Q4-quantised 7B model comfortably.

All VRAM figures below are estimates based on standard quantisation levels. Exact values vary by model architecture and quantisation method. Verify against model cards and current community benchmarks before purchasing hardware.


VRAM requirements by model

Model Parameters Min VRAM (Q4) Recommended VRAM Notes
Phi-3 Mini 3.8B ~3 GB 4 GB Best for very low VRAM
Gemma 2B 2B ~2 GB 3 GB Lightweight; lower quality
Mistral 7B 7B ~5 GB 6–8 GB Apache 2.0; fast inference
CodeLlama 7B 7B ~5 GB 6–8 GB Coding-focused
Gemma 7B 7B ~5 GB 6–8 GB Strong all-rounder
Llama 3 8B 8B ~5 GB 6–8 GB Meta’s best small model
CodeLlama 13B 13B ~9 GB 10–12 GB Better coding quality
Llama 3 70B 70B ~40 GB 48 GB+ Multi-GPU or high-end workstation
Mixtral 8x7B ~46.7B total ~26 GB 32 GB+ MoE architecture (12.9B active per token)

Quantisation tiers explained:
Q4_K_M — 4-bit quantisation with medium K-quantisation; best balance of quality and size; the default for most Ollama pulls
Q5_K_M — 5-bit; higher quality, more VRAM; use if your GPU has headroom
Q8_0 — 8-bit; near full-precision quality; roughly double the VRAM of Q4
FP16 — Full half-precision; maximum quality; only practical for GPUs with 16 GB+ VRAM


How much RAM do you need for local LLMs?

RAM matters most when running in CPU-only mode, where the model loads into system RAM instead of VRAM. The rule of thumb:

For GPU inference: System RAM matters less — the bottleneck is VRAM. 16 GB of system RAM is sufficient for most GPU-accelerated setups.

For CPU-only inference: System RAM must hold the entire model. A Q4-quantised 7B model needs approximately 6–8 GB of RAM; plan for 16 GB total to leave headroom for your OS and other processes.

RAM minimums by scenario:

Scenario Minimum RAM Recommended RAM
GPU inference, 7B model 8 GB system RAM 16 GB
CPU-only, 7B Q4 model 12 GB 16 GB
CPU-only, 13B Q4 model 20 GB 32 GB
Multi-model or RAG pipelines 16 GB 32–64 GB

What hardware do I need for local LLMs?

Three tiers cover the range from getting started to running large models comfortably:

Entry tier — getting started

Target: Developers without a dedicated GPU, or with an older GPU

  • CPU: Any modern multi-core (Intel i5/i7 or AMD Ryzen 5/7 from the last 4–5 years)
  • RAM: 16 GB
  • GPU: None required (or integrated graphics)
  • Storage: 20 GB free SSD space
  • What you can run: 7B Q4 models in CPU-only mode; 2–3B models at reasonable speed
  • Speed expectation: 2–6 tokens per second on CPU; functional but slow for interactive chat

Mid tier — the developer workstation sweet spot

Target: Developers with a mid-range discrete GPU

  • CPU: Modern multi-core
  • RAM: 16–32 GB
  • GPU: NVIDIA RTX 3060 (12 GB), RTX 3070 (8 GB), RTX 4060 (8 GB), or Apple M-series with 16 GB unified memory
  • Storage: 50 GB free SSD space
  • What you can run: Any 7B model (Q4 or Q5); 13B models with Q4 quantisation
  • Speed expectation: 30–60+ tokens per second on GPU; fully interactive

Power tier — serious local inference

Target: Teams needing larger models or dedicated inference servers

  • CPU: High-core-count workstation CPU
  • RAM: 32–64 GB
  • GPU: NVIDIA RTX 4090 (24 GB), A100 (40/80 GB), multiple GPUs with NVLink, or high-VRAM consumer cards
  • Storage: 100+ GB NVMe SSD
  • What you can run: 13B models at full Q8, 34B models at Q4, 70B models with multi-GPU setups
  • Speed expectation: Varies by model; 13B at Q4 on an RTX 4090 runs at 70–100+ tokens per second

[IMAGE: Developer workstation with dedicated GPU set up for local LLM inference and model testing]


Can you run a local LLM without a GPU?

Yes. CPU-only inference works, and for many use cases it’s entirely practical. The trade-off is speed: GPU inference runs 10–30× faster than CPU inference for equivalent models. On a modern 8-core CPU, a Q4-quantised 7B model typically runs at 2–8 tokens per second — slow for interactive chat but fine for batch processing, one-shot prompts, and automation pipelines.

Best runtimes for CPU-only inference:
llama.cpp — the underlying engine; maximum CPU performance with manual thread tuning
Ollama — uses llama.cpp internally; runs in CPU mode automatically when no GPU is detected

Best models for CPU-only inference:
– Phi-3 Mini (3.8B Q4) — fastest; surprising quality for its size
– Gemma 2B (Q4) — lightweight; good for simple tasks
– Any 7B model at Q4_K_M quantisation — slower but higher quality

For the full CPU-only setup guide including running a local LLM without a GPU, including llama.cpp configuration and which quantisation levels work best on CPU, see the dedicated guide.


How quantisation lowers hardware requirements

Quantisation reduces the numerical precision of the model’s weights, making the model file smaller and reducing the memory needed to run it. A 7B model stored at FP16 (16-bit float) uses approximately 14 GB. The same model at Q4 (4-bit integer) uses approximately 4–5 GB — a 65–70% reduction with acceptable quality loss for most tasks.

The practical implication: Quantisation is why a GPU with 6 GB VRAM can run a useful 7B model, even though the “full” model would need far more. This is not a workaround or a compromise — it’s the standard way developers run local LLMs in 2026.

GGUF format: Most quantised models are distributed in GGUF format (the format used by llama.cpp and Ollama). When you run ollama pull codellama, Ollama downloads the GGUF file. You can also download GGUF files directly from HuggingFace and load them into llama.cpp manually.

Choosing a quantisation level:
– Use Q4_K_M as your default — best quality-to-size ratio
– Move to Q5_K_M if you have VRAM headroom and want better output quality
– Use Q8_0 only if your GPU has abundant VRAM (16+ GB) and quality is paramount
– Avoid Q2 quantisation — quality degradation is too severe for most tasks

Once you know what your hardware can run, use the guide to set up your local AI stack once you know your hardware and pull the right models for your tier.

To see which coding models fit your hardware in a side-by-side format, see the coding LLM roundup.


Frequently asked questions

How much VRAM do I need for a local LLM?

For a useful 7B model in Q4 quantisation, you need 5–6 GB of VRAM at minimum; 8 GB is comfortable. A GPU with 8 GB VRAM (RTX 3070, RTX 4060, RX 6700 XT) runs any 7B model and many 13B models at reduced quantisation. For 13B models at full Q4, target 10–12 GB VRAM.

How much RAM do I need for a local LLM?

If you’re running with a GPU, 16 GB of system RAM is sufficient for most setups — the model loads into VRAM, not RAM. If you’re running CPU-only, you need enough RAM to hold the entire model. A Q4-quantised 7B model needs approximately 6–8 GB of RAM, so plan for 16 GB total to maintain headroom for your OS and other processes.

Can I run a local LLM without a GPU?

Yes. Using CPU-only inference with a quantised model (GGUF Q4 format) via llama.cpp or Ollama, you can run a 7B model on any modern machine with 16 GB of RAM. Inference will be significantly slower than GPU-accelerated inference, but it is functional. Apple Silicon Macs are a notable exception — the unified memory architecture allows fast GPU-like inference without a discrete GPU.


Last updated: 2026. Verify VRAM figures and token-per-second estimates against current model cards and community benchmarks before purchasing hardware.

Leave a Comment