Local LLM Hardware Requirements: VRAM, RAM & GPU
Before you install anything, you need to know whether your hardware can run the models worth running. The answer is almost always yes — but the model size, inference speed, and quantisation level you can use depend entirely on your VRAM, RAM, and CPU.
[IMAGE: Chart comparing GPU VRAM tiers — 8 GB, 16 GB, 24 GB — and which local LLM models they can run]
How much VRAM do you need for a local LLM?
For GPU inference, you need enough VRAM to hold the entire model in memory. A 7B parameter model in Q4 quantisation requires approximately 4–5 GB of VRAM. A 7B model at full precision (FP16) requires approximately 14 GB. Most developers run quantised models: the quality loss from Q4 quantisation is minor, while the VRAM savings are substantial. A GPU with 6–8 GB VRAM (RTX 3060, RTX 3070) handles any Q4-quantised 7B model comfortably.
All VRAM figures below are estimates based on standard quantisation levels. Exact values vary by model architecture and quantisation method. Verify against model cards and current community benchmarks before purchasing hardware.
VRAM requirements by model
| Model | Parameters | Min VRAM (Q4) | Recommended VRAM | Notes |
|---|---|---|---|---|
| Phi-3 Mini | 3.8B | ~3 GB | 4 GB | Best for very low VRAM |
| Gemma 2B | 2B | ~2 GB | 3 GB | Lightweight; lower quality |
| Mistral 7B | 7B | ~5 GB | 6–8 GB | Apache 2.0; fast inference |
| CodeLlama 7B | 7B | ~5 GB | 6–8 GB | Coding-focused |
| Gemma 7B | 7B | ~5 GB | 6–8 GB | Strong all-rounder |
| Llama 3 8B | 8B | ~5 GB | 6–8 GB | Meta’s best small model |
| CodeLlama 13B | 13B | ~9 GB | 10–12 GB | Better coding quality |
| Llama 3 70B | 70B | ~40 GB | 48 GB+ | Multi-GPU or high-end workstation |
| Mixtral 8x7B | ~46.7B total | ~26 GB | 32 GB+ | MoE architecture (12.9B active per token) |
Quantisation tiers explained:
– Q4_K_M — 4-bit quantisation with medium K-quantisation; best balance of quality and size; the default for most Ollama pulls
– Q5_K_M — 5-bit; higher quality, more VRAM; use if your GPU has headroom
– Q8_0 — 8-bit; near full-precision quality; roughly double the VRAM of Q4
– FP16 — Full half-precision; maximum quality; only practical for GPUs with 16 GB+ VRAM
How much RAM do you need for local LLMs?
RAM matters most when running in CPU-only mode, where the model loads into system RAM instead of VRAM. The rule of thumb:
For GPU inference: System RAM matters less — the bottleneck is VRAM. 16 GB of system RAM is sufficient for most GPU-accelerated setups.
For CPU-only inference: System RAM must hold the entire model. A Q4-quantised 7B model needs approximately 6–8 GB of RAM; plan for 16 GB total to leave headroom for your OS and other processes.
RAM minimums by scenario:
| Scenario | Minimum RAM | Recommended RAM |
|---|---|---|
| GPU inference, 7B model | 8 GB system RAM | 16 GB |
| CPU-only, 7B Q4 model | 12 GB | 16 GB |
| CPU-only, 13B Q4 model | 20 GB | 32 GB |
| Multi-model or RAG pipelines | 16 GB | 32–64 GB |
What hardware do I need for local LLMs?
Three tiers cover the range from getting started to running large models comfortably:
Entry tier — getting started
Target: Developers without a dedicated GPU, or with an older GPU
- CPU: Any modern multi-core (Intel i5/i7 or AMD Ryzen 5/7 from the last 4–5 years)
- RAM: 16 GB
- GPU: None required (or integrated graphics)
- Storage: 20 GB free SSD space
- What you can run: 7B Q4 models in CPU-only mode; 2–3B models at reasonable speed
- Speed expectation: 2–6 tokens per second on CPU; functional but slow for interactive chat
Mid tier — the developer workstation sweet spot
Target: Developers with a mid-range discrete GPU
- CPU: Modern multi-core
- RAM: 16–32 GB
- GPU: NVIDIA RTX 3060 (12 GB), RTX 3070 (8 GB), RTX 4060 (8 GB), or Apple M-series with 16 GB unified memory
- Storage: 50 GB free SSD space
- What you can run: Any 7B model (Q4 or Q5); 13B models with Q4 quantisation
- Speed expectation: 30–60+ tokens per second on GPU; fully interactive
Power tier — serious local inference
Target: Teams needing larger models or dedicated inference servers
- CPU: High-core-count workstation CPU
- RAM: 32–64 GB
- GPU: NVIDIA RTX 4090 (24 GB), A100 (40/80 GB), multiple GPUs with NVLink, or high-VRAM consumer cards
- Storage: 100+ GB NVMe SSD
- What you can run: 13B models at full Q8, 34B models at Q4, 70B models with multi-GPU setups
- Speed expectation: Varies by model; 13B at Q4 on an RTX 4090 runs at 70–100+ tokens per second
[IMAGE: Developer workstation with dedicated GPU set up for local LLM inference and model testing]
Can you run a local LLM without a GPU?
Yes. CPU-only inference works, and for many use cases it’s entirely practical. The trade-off is speed: GPU inference runs 10–30× faster than CPU inference for equivalent models. On a modern 8-core CPU, a Q4-quantised 7B model typically runs at 2–8 tokens per second — slow for interactive chat but fine for batch processing, one-shot prompts, and automation pipelines.
Best runtimes for CPU-only inference:
– llama.cpp — the underlying engine; maximum CPU performance with manual thread tuning
– Ollama — uses llama.cpp internally; runs in CPU mode automatically when no GPU is detected
Best models for CPU-only inference:
– Phi-3 Mini (3.8B Q4) — fastest; surprising quality for its size
– Gemma 2B (Q4) — lightweight; good for simple tasks
– Any 7B model at Q4_K_M quantisation — slower but higher quality
For the full CPU-only setup guide including running a local LLM without a GPU, including llama.cpp configuration and which quantisation levels work best on CPU, see the dedicated guide.
How quantisation lowers hardware requirements
Quantisation reduces the numerical precision of the model’s weights, making the model file smaller and reducing the memory needed to run it. A 7B model stored at FP16 (16-bit float) uses approximately 14 GB. The same model at Q4 (4-bit integer) uses approximately 4–5 GB — a 65–70% reduction with acceptable quality loss for most tasks.
The practical implication: Quantisation is why a GPU with 6 GB VRAM can run a useful 7B model, even though the “full” model would need far more. This is not a workaround or a compromise — it’s the standard way developers run local LLMs in 2026.
GGUF format: Most quantised models are distributed in GGUF format (the format used by llama.cpp and Ollama). When you run ollama pull codellama, Ollama downloads the GGUF file. You can also download GGUF files directly from HuggingFace and load them into llama.cpp manually.
Choosing a quantisation level:
– Use Q4_K_M as your default — best quality-to-size ratio
– Move to Q5_K_M if you have VRAM headroom and want better output quality
– Use Q8_0 only if your GPU has abundant VRAM (16+ GB) and quality is paramount
– Avoid Q2 quantisation — quality degradation is too severe for most tasks
Once you know what your hardware can run, use the guide to set up your local AI stack once you know your hardware and pull the right models for your tier.
To see which coding models fit your hardware in a side-by-side format, see the coding LLM roundup.
Frequently asked questions
How much VRAM do I need for a local LLM?
For a useful 7B model in Q4 quantisation, you need 5–6 GB of VRAM at minimum; 8 GB is comfortable. A GPU with 8 GB VRAM (RTX 3070, RTX 4060, RX 6700 XT) runs any 7B model and many 13B models at reduced quantisation. For 13B models at full Q4, target 10–12 GB VRAM.
How much RAM do I need for a local LLM?
If you’re running with a GPU, 16 GB of system RAM is sufficient for most setups — the model loads into VRAM, not RAM. If you’re running CPU-only, you need enough RAM to hold the entire model. A Q4-quantised 7B model needs approximately 6–8 GB of RAM, so plan for 16 GB total to maintain headroom for your OS and other processes.
Can I run a local LLM without a GPU?
Yes. Using CPU-only inference with a quantised model (GGUF Q4 format) via llama.cpp or Ollama, you can run a 7B model on any modern machine with 16 GB of RAM. Inference will be significantly slower than GPU-accelerated inference, but it is functional. Apple Silicon Macs are a notable exception — the unified memory architecture allows fast GPU-like inference without a discrete GPU.
Last updated: 2026. Verify VRAM figures and token-per-second estimates against current model cards and community benchmarks before purchasing hardware.