How to Run Gemma Locally (Step-by-Step for Devs)

Gemma is Google DeepMind’s open-weights model family — and one of the best all-round local LLMs available for developer workflows in 2026. It’s compact enough to run on consumer hardware, strong enough to handle real coding tasks, and genuinely capable at the mixed natural-language-plus-code work that fills a developer’s day. This guide gets you from nothing to a running Gemma instance in under 10 minutes.

[IMAGE: Gemma model running locally via Ollama command in a developer terminal window]

How to run Gemma locally — quick start

For developers who want Gemma running immediately:

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh (Linux/macOS) or download from ollama.com on Windows
Pull Gemma: ollama pull gemma
Run it: ollama run gemma
Test it: Ask a coding question at the >>> prompt
Integrate: Point your IDE extension or chat UI at http://localhost:11434

That’s the complete path. The sections below expand each step and cover workflow integration.

Prerequisites: hardware and software for Gemma

Hardware minimums:
– GPU inference (recommended): GPU with 6 GB VRAM minimum for Gemma 7B. RTX 3060, RTX 3070, AMD RX 6700, or Apple Silicon M-series all work well.
– CPU-only inference: 16 GB system RAM; Gemma 7B at Q4 quantisation uses ~5 GB RAM for the model weights

Software:
– Ollama (easiest path — handles runtime, model management, and API)
– Or llama.cpp (for lower-level control and custom quantisation)
– Docker (optional, for running Open WebUI as a chat interface)

Which Gemma variant to pull:
– ollama pull gemma — Gemma 7B (default); best quality for most tasks
– ollama pull gemma:2b — Gemma 2B; faster, lower quality; useful on machines with < 4 GB VRAM
– ollama pull gemma2 — Gemma 2 family, available in 2B, 9B, and 27B sizes; improved architecture over the original Gemma family

Running Gemma step by step

Step 1 — Install Ollama (or llama.cpp)

Ollama (recommended):

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download and run the installer

Verify Ollama is running:

ollama --version

You should see a version string. If you get “command not found”, check that /usr/local/bin is in your PATH on Linux/macOS.

llama.cpp (alternative):
If you prefer llama.cpp directly:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Then download a Gemma GGUF file from HuggingFace and run:
“`./main -m gemma-7b-q4_k_m.gguf -p “Write a Python function that parses JSON” -n 200

---

### Step 2 — Pull the Gemma model

ollama pull gemma

Ollama downloads the Gemma 7B model in Q4_K_M quantisation by default (~5 GB). The download runs once; subsequent runs load from your local disk.

To see available Gemma variants:

ollama show gemma

To pull a smaller variant:

Smaller, faster

ollama pull gemma:2b

The default `gemma` model in Ollama is instruction-tuned and works for code prompts out of the box. If you want the newer Gemma 2 family, use `ollama pull gemma2` and choose a size that fits your hardware.

---

### Step 3 — Run and test Gemma

Start an interactive session:

ollama run gemma

Test with a coding prompt:

Write a Python function that reads a CSV file and returns a list of dictionaries

Gemma should return a clean, working function with appropriate imports and a docstring.

Test with a code explanation task:

Explain what this code does:
def memoize(fn):
cache = {}
def wrapper(args):
if args not in cache:
cache[args] = fn(args)
return cache[args]
return wrapper

This kind of mixed reasoning + code task is where Gemma tends to perform well compared to models fine-tuned exclusively on code.

---

### Step 4 — Integrate Gemma into your dev workflow

**IDE integration via Ollama API:**

The Ollama API at `http://localhost:11434` is OpenAI API-compatible. Use it with:

- **Continue.dev** (VS Code / JetBrains): In the Continue config, add a model entry with provider `ollama` and model `gemma`. You get inline code completions and a sidebar chat powered by your local Gemma instance.
- **Tabby**: Self-hosted coding assistant; configure the model backend to point at your Ollama endpoint.

**Chat UI:**

Run Open WebUI pointing at your local Ollama

docker run -d \
-p 3000:8080 \
–add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
–name open-webui –restart always \
ghcr.io/open-webui/open-webui:main

Navigate to `http://localhost:3000`. Select Gemma from the model dropdown.

**Script integration:**
```python
import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "gemma",
    "prompt": "Write a Python script to watch a directory for file changes",
    "stream": False
})
print(response.json()["response"])

For the broader best local LLMs for coding roundup that compares Gemma against CodeLlama, Mistral, and others, see the full comparison guide.

Gemma vs CodeLlama: which is better for coding?

This question comes up often enough to deserve its own section. The short answer: CodeLlama for pure code generation; Gemma for mixed developer workflows.

Dimension	Gemma 7B	CodeLlama 7B
Code completion	★★★★☆	★★★★★
Code explanation	★★★★★	★★★☆☆
Instruction following	★★★★★	★★★★☆
Fill-in-the-middle	Limited	✅ Native FIM support
General reasoning	★★★★★	★★★☆☆
Min VRAM (Q4)	~5 GB	~5 GB
Licence	Gemma ToU	Llama 2 Community

Ratings are qualitative. Verify against current benchmarks and your own coding tasks before treating them as definitive.

Use Gemma when: You need the model to explain code, write documentation, answer technical questions, reason about architecture decisions, or handle conversations that blend code with natural language context.

Use CodeLlama when: Your primary use case is raw code generation, fill-in-the-middle completions inside existing functions, or specialised Python work (using the CodeLlama-Python variant).

For the full head-to-head, see the full Gemma vs CodeLlama coding comparison.

[IMAGE: Side-by-side comparison of Gemma vs CodeLlama output for a local coding task]

Using Gemma in a local development workflow

Daily development tasks Gemma handles well:

Code review assistant: Paste a diff or a function and ask Gemma to review it for bugs, style issues, or security concerns
Docstring generation: Pass a function body and ask for a Google-style or NumPy-style docstring
Test generation: Describe the expected behaviour of a function; ask Gemma to write unit tests
Refactoring suggestions: Ask Gemma to suggest improvements to a specific function without rewriting it entirely
Understanding unfamiliar code: Paste a function you inherited and ask for a plain-English explanation

Running Gemma in a run Gemma through Ollama pipeline:

Ollama’s API compatibility with the OpenAI SDK means you can drop Gemma into any tool that supports OpenAI-compatible endpoints. This includes LangChain, LlamaIndex, and most LLM application frameworks:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="gemma",
    messages=[{"role": "user", "content": "Review this Python function for potential bugs."}]
)

Gemma for documentation pipelines:

Gemma’s strong natural language capability makes it effective for documentation generation at scale. Feed it a module’s function signatures and docstrings, and ask it to generate a README section or API reference. The output quality for documentation tasks is noticeably better than code-first models.

Frequently asked questions

How do I run Gemma locally?

Install Ollama (curl -fsSL https://ollama.com/install.sh | sh on Linux/macOS or download from ollama.com on Windows), then run ollama pull gemma to download the model and ollama run gemma to start an interactive session. The entire setup takes under 10 minutes on a machine with a stable internet connection and adequate disk space (~5 GB for the model).

How much VRAM does Gemma need?

Gemma 7B in Q4 quantisation requires approximately 5 GB of VRAM for GPU inference. A GPU with 6 GB VRAM (RTX 3060, GTX 1070 Ti) handles it comfortably. The Gemma 2B variant runs on as little as 2–3 GB VRAM. If running CPU-only, plan for 16 GB system RAM to hold the 7B model with headroom.

Is Gemma good for coding?

Yes, with nuance. Gemma 7B performs well on code generation, code explanation, and mixed coding-plus-reasoning tasks. It’s particularly strong at explaining code and generating documentation — tasks where natural language quality matters as much as syntactic correctness. For specialised code completion and fill-in-the-middle tasks, CodeLlama is the better choice. Many developers use Gemma for their daily coding workflow and reserve CodeLlama for IDE-level inline completion.

Last updated: 2026.

How to Run Gemma Locally for Development Workflows