How to Run an LLM Locally (Step-by-Step Guide)
Running an LLM locally means the model lives and executes on your own hardware — no API calls, no data leaving your network, no usage costs per token. The good news: you almost certainly already have hardware capable enough to get started. This guide walks you through the full process from nothing to a working local LLM in four steps.
[IMAGE: Ollama runtime installed and running a local LLM on an existing developer workstation]
How to run an LLM locally — quick start (5 steps)
For developers who want to move fast before reading the detail:
- Check your hardware — you need at least 8 GB of RAM; a GPU with 6+ GB VRAM is recommended but not required
- Install a runtime — use Ollama as your local LLM runtime (easiest) or llama.cpp (more control)
- Pull a model —
ollama pull llama3orollama pull codellamafor coding tasks - Run a prompt —
ollama run llama3 - Connect a UI — point Open WebUI or a similar chat interface at
http://localhost:11434
That’s the short version. The sections below expand each step with options, troubleshooting, and recommendations for real-world developer workstations.
What you need before you start
Before installing anything, verify your setup against these minimum requirements.
Hardware minimums:
– RAM: 8 GB minimum; 16 GB recommended. The model must fit in memory.
– Storage: 5–50 GB free space depending on the model size (a 7B model in Q4 quantisation is roughly 4–5 GB)
– GPU: Optional but strongly recommended. A GPU with 6–8 GB VRAM (NVIDIA RTX 3060 class or better) enables GPU-accelerated inference that is substantially faster than CPU-only mode.
Software prerequisites:
– A supported OS: Linux, macOS, or Windows 10/11
– For NVIDIA GPU acceleration: CUDA drivers installed (check with nvidia-smi)
– For AMD GPU acceleration: ROCm (Linux only)
– Git (for llama.cpp compilation path)
Not sure if your hardware qualifies? The local LLM hardware requirements guide breaks down VRAM requirements model by model, including which models run acceptably on CPU-only machines.
How to run an LLM locally, step by step
[IMAGE: Diagram of local LLM infrastructure stack showing runtime, model layer, and chat UI on existing hardware]
Step 1 — Choose and install a runtime (Ollama or llama.cpp)
A runtime is the software that loads the model weights into memory and handles inference. You have two primary options:
Ollama (recommended for most developers)
– Wraps llama.cpp under the hood with a clean CLI and REST API
– One-command install on Linux, macOS, and Windows
– Manages model downloads, storage, and versioning for you
– Exposes a local API at http://localhost:11434 compatible with the OpenAI API schema
Install Ollama:
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download the installer from https://ollama.com/download
llama.cpp (recommended when you need maximum control)
– Raw C++ inference library — no abstraction layer
– More configuration options (threads, batch size, GPU layer offloading)
– Requires compilation from source on Linux/macOS; pre-built binaries available for Windows
– Better for embedding in custom pipelines or running in low-overhead environments
For most developers setting up a local LLM for the first time, Ollama is the right choice. It abstracts away the complexity without hiding the important controls.
Step 2 — Pull a model
Once your runtime is installed, pull a model from the registry:
# General purpose (good starting point)
ollama pull llama3
# Coding-focused
ollama pull codellama
# Lightweight — useful on lower-VRAM machines
ollama pull gemma:2b
Ollama downloads the model in GGUF quantised format and stores it locally. You can list available models with ollama list.
Model size guidance:
– 7B models — best balance of quality and speed on consumer hardware
– 13B models — noticeably better reasoning; need 10–12 GB VRAM or 16+ GB RAM for CPU
– 3B/2B models — fast and light; quality drops noticeably for complex tasks
Step 3 — Run your first prompt
Start an interactive session:
ollama run codellama
You’ll see a >>> prompt. Type a coding question or a natural language instruction. Press Enter twice to submit.
To run a one-shot prompt non-interactively:
echo "Write a Python function that reverses a string" | ollama run codellama
To use the REST API directly (useful for scripting and tool integration):
curl http://localhost:11434/api/generate -d '{
"model": "codellama",
"prompt": "Write a Python function that reverses a string",
"stream": false
}'
The API response is JSON. The response field contains the model’s output.
Step 4 — Connect a UI or API
The CLI is useful for testing, but most developers want a chat interface or IDE integration for regular use.
Chat UI options:
- Open WebUI — the most popular self-hosted chat UI; runs as a Docker container and connects directly to Ollama. Visit
http://localhost:3000once running.
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data --name open-webui \
--restart always ghcr.io/open-webui/open-webui:main - Chatbot UI — a lightweight React-based alternative
- Msty — desktop app with a clean native UI, no Docker required
IDE integration:
- Continue.dev — VS Code and JetBrains extension; connects to your local Ollama API for inline completions and chat
- Tabby — self-hosted AI coding assistant with Ollama backend support
- Configure both by pointing them at
http://localhost:11434in their settings
Running a local LLM on an existing workstation
Most developer workstations built in the last 3–4 years are capable of running a useful 7B model. The typical successful setup:
- CPU-only (no discrete GPU): Use a Q4-quantised 7B model via Ollama. Expect slower inference than GPU mode — often workable for batch tasks, one-shot prompts, and light chat, but usually too slow for IDE-style completions.
- iGPU (integrated graphics): Some Ollama builds support iGPU acceleration on Apple Silicon and AMD integrated graphics. Apple Silicon Macs (M1/M2/M3) in particular run 7B models very fast on unified memory.
- Discrete GPU (NVIDIA): The most common performant setup. An RTX 3060 (12 GB) can run a 7B model entirely in VRAM for fast inference. An RTX 3080 (10 GB) is tighter — use quantised models.
- Multi-GPU: Ollama supports GPU layer splitting across multiple cards, useful if you have two older GPUs with combined VRAM.
Setting up a private local LLM for internal tools
If your goal is to build internal developer tooling — a private code review bot, an internal docs assistant, or a CI/CD helper — the local LLM stack adds one more layer: an API integration.
The Ollama API is intentionally compatible with the OpenAI API schema. This means any tool built against the OpenAI SDK can be pointed at your local Ollama instance with minimal changes:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the library; value is ignored by Ollama
)
response = client.chat.completions.create(
model="codellama",
messages=[{"role": "user", "content": "Explain this function: def foo(x): return x * 2"}]
)
print(response.choices[0].message.content)
For the privacy and data control reasoning behind this approach, see the guide on why developers self-host LLMs for privacy.
Common setup problems and how to fix them
“Model not found” or download fails
– Check your internet connection; Ollama downloads from the Ollama registry
– Try ollama pull <model> manually to see the error output
– Verify disk space — a 7B model needs ~4–5 GB
Out of memory / model crashes on load
– Your VRAM or RAM is too small for the model variant you pulled
– Switch to a smaller quantisation: ollama pull codellama:7b-code-q4_K_M
– Or switch to a smaller model: ollama pull gemma:2b
Slow inference (< 1 token/sec)
– Ollama is running in CPU-only mode when you expected GPU acceleration
– Check: ollama ps — if it shows 100% CPU, the GPU is not being used
– Verify your CUDA drivers are installed and up to date: nvidia-smi
– On Linux, ensure Ollama was installed after CUDA — reinstall if needed
Port 11434 already in use
– Another Ollama instance is running: pkill ollama then restart
– Or change the port: set OLLAMA_HOST=0.0.0.0:11435 before starting Ollama
Open WebUI can’t connect to Ollama
– On macOS/Linux, use --network=host instead of --add-host
– On Windows with Docker Desktop, use host.docker.internal as the Ollama host
Frequently asked questions
How do you run an LLM locally?
Install a runtime (Ollama is the easiest option), pull a model with ollama pull <model-name>, and run it with ollama run <model-name>. The entire process takes under 10 minutes on a machine with a stable internet connection and adequate disk space. No cloud account or API key required.
What is the easiest way to run a local LLM?
Ollama is the easiest way to run a local LLM in 2026. A single installation command, a single ollama pull to download a model, and a single ollama run to start it. The CLI is clean, the REST API is well-documented, and the model library covers all the major open-weights models including Llama 3, CodeLlama, Gemma, and Mistral.
Can I run a local LLM on an existing workstation?
Yes — as long as your workstation has at least 8 GB of RAM and adequate free disk space. A GPU with 6+ GB VRAM will give you much faster inference, but CPU-only inference is possible using quantised models. Most recent developer workstations with 16 GB RAM or a modest GPU are capable of running a useful small local model.
Last updated: 2026.