How to Run Gemma Locally (Step-by-Step for Devs)
Gemma is Google DeepMind’s open-weights model family — and one of the best all-round local LLMs available for developer workflows in 2026. It’s compact enough to run on consumer hardware, strong enough to handle real coding tasks, and genuinely capable at the mixed natural-language-plus-code work that fills a developer’s day. This guide gets you from nothing to a running Gemma instance in under 10 minutes.
[IMAGE: Gemma model running locally via Ollama command in a developer terminal window]
How to run Gemma locally — quick start
For developers who want Gemma running immediately:
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh(Linux/macOS) or download from ollama.com on Windows - Pull Gemma:
ollama pull gemma - Run it:
ollama run gemma - Test it: Ask a coding question at the
>>>prompt - Integrate: Point your IDE extension or chat UI at
http://localhost:11434
That’s the complete path. The sections below expand each step and cover workflow integration.
Prerequisites: hardware and software for Gemma
Hardware minimums:
– GPU inference (recommended): GPU with 6 GB VRAM minimum for Gemma 7B. RTX 3060, RTX 3070, AMD RX 6700, or Apple Silicon M-series all work well.
– CPU-only inference: 16 GB system RAM; Gemma 7B at Q4 quantisation uses ~5 GB RAM for the model weights
Software:
– Ollama (easiest path — handles runtime, model management, and API)
– Or llama.cpp (for lower-level control and custom quantisation)
– Docker (optional, for running Open WebUI as a chat interface)
Which Gemma variant to pull:
– ollama pull gemma — Gemma 7B (default); best quality for most tasks
– ollama pull gemma:2b — Gemma 2B; faster, lower quality; useful on machines with < 4 GB VRAM
– ollama pull gemma2 — Gemma 2 family, available in 2B, 9B, and 27B sizes; improved architecture over the original Gemma family
Running Gemma step by step
Step 1 — Install Ollama (or llama.cpp)
Ollama (recommended):
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download and run the installer
Verify Ollama is running:
ollama --version
You should see a version string. If you get “command not found”, check that /usr/local/bin is in your PATH on Linux/macOS.
llama.cpp (alternative):
If you prefer llama.cpp directly:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Then download a Gemma GGUF file from HuggingFace and run:
“`./main -m gemma-7b-q4_k_m.gguf -p “Write a Python function that parses JSON” -n 200
---
### Step 2 — Pull the Gemma model
ollama pull gemma
Ollama downloads the Gemma 7B model in Q4_K_M quantisation by default (~5 GB). The download runs once; subsequent runs load from your local disk.
To see available Gemma variants:
ollama show gemma
To pull a smaller variant:
Smaller, faster
ollama pull gemma:2b
The default `gemma` model in Ollama is instruction-tuned and works for code prompts out of the box. If you want the newer Gemma 2 family, use `ollama pull gemma2` and choose a size that fits your hardware.
---
### Step 3 — Run and test Gemma
Start an interactive session:
ollama run gemma
Test with a coding prompt:
Write a Python function that reads a CSV file and returns a list of dictionaries
Gemma should return a clean, working function with appropriate imports and a docstring.
Test with a code explanation task:
Explain what this code does:
def memoize(fn):
cache = {}
def wrapper(args):
if args not in cache:
cache[args] = fn(args)
return cache[args]
return wrapper
This kind of mixed reasoning + code task is where Gemma tends to perform well compared to models fine-tuned exclusively on code.
---
### Step 4 — Integrate Gemma into your dev workflow
**IDE integration via Ollama API:**
The Ollama API at `http://localhost:11434` is OpenAI API-compatible. Use it with:
- **Continue.dev** (VS Code / JetBrains): In the Continue config, add a model entry with provider `ollama` and model `gemma`. You get inline code completions and a sidebar chat powered by your local Gemma instance.
- **Tabby**: Self-hosted coding assistant; configure the model backend to point at your Ollama endpoint.
**Chat UI:**
Run Open WebUI pointing at your local Ollama
docker run -d \
-p 3000:8080 \
–add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
–name open-webui –restart always \
ghcr.io/open-webui/open-webui:main
Navigate to `http://localhost:3000`. Select Gemma from the model dropdown.
**Script integration:**
```python
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "gemma",
"prompt": "Write a Python script to watch a directory for file changes",
"stream": False
})
print(response.json()["response"])
For the broader best local LLMs for coding roundup that compares Gemma against CodeLlama, Mistral, and others, see the full comparison guide.
Gemma vs CodeLlama: which is better for coding?
This question comes up often enough to deserve its own section. The short answer: CodeLlama for pure code generation; Gemma for mixed developer workflows.
| Dimension | Gemma 7B | CodeLlama 7B |
|---|---|---|
| Code completion | ★★★★☆ | ★★★★★ |
| Code explanation | ★★★★★ | ★★★☆☆ |
| Instruction following | ★★★★★ | ★★★★☆ |
| Fill-in-the-middle | Limited | ✅ Native FIM support |
| General reasoning | ★★★★★ | ★★★☆☆ |
| Min VRAM (Q4) | ~5 GB | ~5 GB |
| Licence | Gemma ToU | Llama 2 Community |
Ratings are qualitative. Verify against current benchmarks and your own coding tasks before treating them as definitive.
Use Gemma when: You need the model to explain code, write documentation, answer technical questions, reason about architecture decisions, or handle conversations that blend code with natural language context.
Use CodeLlama when: Your primary use case is raw code generation, fill-in-the-middle completions inside existing functions, or specialised Python work (using the CodeLlama-Python variant).
For the full head-to-head, see the full Gemma vs CodeLlama coding comparison.
[IMAGE: Side-by-side comparison of Gemma vs CodeLlama output for a local coding task]
Using Gemma in a local development workflow
Daily development tasks Gemma handles well:
- Code review assistant: Paste a diff or a function and ask Gemma to review it for bugs, style issues, or security concerns
- Docstring generation: Pass a function body and ask for a Google-style or NumPy-style docstring
- Test generation: Describe the expected behaviour of a function; ask Gemma to write unit tests
- Refactoring suggestions: Ask Gemma to suggest improvements to a specific function without rewriting it entirely
- Understanding unfamiliar code: Paste a function you inherited and ask for a plain-English explanation
Running Gemma in a run Gemma through Ollama pipeline:
Ollama’s API compatibility with the OpenAI SDK means you can drop Gemma into any tool that supports OpenAI-compatible endpoints. This includes LangChain, LlamaIndex, and most LLM application frameworks:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="gemma",
messages=[{"role": "user", "content": "Review this Python function for potential bugs."}]
)
Gemma for documentation pipelines:
Gemma’s strong natural language capability makes it effective for documentation generation at scale. Feed it a module’s function signatures and docstrings, and ask it to generate a README section or API reference. The output quality for documentation tasks is noticeably better than code-first models.
Frequently asked questions
How do I run Gemma locally?
Install Ollama (curl -fsSL https://ollama.com/install.sh | sh on Linux/macOS or download from ollama.com on Windows), then run ollama pull gemma to download the model and ollama run gemma to start an interactive session. The entire setup takes under 10 minutes on a machine with a stable internet connection and adequate disk space (~5 GB for the model).
How much VRAM does Gemma need?
Gemma 7B in Q4 quantisation requires approximately 5 GB of VRAM for GPU inference. A GPU with 6 GB VRAM (RTX 3060, GTX 1070 Ti) handles it comfortably. The Gemma 2B variant runs on as little as 2–3 GB VRAM. If running CPU-only, plan for 16 GB system RAM to hold the 7B model with headroom.
Is Gemma good for coding?
Yes, with nuance. Gemma 7B performs well on code generation, code explanation, and mixed coding-plus-reasoning tasks. It’s particularly strong at explaining code and generating documentation — tasks where natural language quality matters as much as syntactic correctness. For specialised code completion and fill-in-the-middle tasks, CodeLlama is the better choice. Many developers use Gemma for their daily coding workflow and reserve CodeLlama for IDE-level inline completion.
Last updated: 2026.