Ollama for Local AI: Best Models for Coding & Setup
Ollama is the fastest way to get a large language model running on your own machine. One command installs it, one command downloads a model, and one command starts an interactive session. This guide covers what Ollama is actually used for, the best models for coding, and how to install it on Linux, Windows, and Mac.
[IMAGE: Ollama model list command showing best coding models available including CodeLlama, Gemma, and Mistral]
What is Ollama used for?
Ollama is a local LLM runtime that lets you download, manage, and run open-weights language models entirely on your own hardware. It wraps the llama.cpp inference engine with a clean CLI and a REST API compatible with the OpenAI API schema, making it easy to run models like CodeLlama, Gemma, Llama 3, and Mistral without cloud accounts, API keys, or data leaving your machine.
Common uses include: coding assistance and code completion in an IDE, internal documentation Q&A, private chat assistants, scripting and automation tools, and prototyping LLM-powered applications.
How to run Ollama locally — quick start
This block gets you to a running model in under 5 minutes:
- Install Ollama for your OS (see install sections below)
- Pull a model:
ollama pull codellama - Run it:
ollama run codellama - Type a prompt at the
>>>prompt and press Enter twice - Exit with
/byeor Ctrl+D
The Ollama API is available at http://localhost:11434 once the service is running. Use it to integrate with IDEs, custom scripts, or chat UIs.
Install Ollama on Linux
curl -fsSL https://ollama.com/install.sh | sh
This script detects your distribution, installs the Ollama binary, and sets up a systemd service so Ollama starts automatically on boot.
Verify the installation:
ollama --version
ollama list
NVIDIA GPU support on Linux: Ollama auto-detects CUDA if your drivers are installed. Run nvidia-smi to confirm your GPU is visible before starting Ollama. If CUDA isn’t detected, reinstall Ollama after installing the NVIDIA drivers.
AMD GPU support on Linux: Ollama supports AMD GPUs via ROCm. The install script detects ROCm if it’s present.
Install Ollama on Windows
- Download the Windows installer from ollama.com/download
- Run the
.exeinstaller — no admin rights required - Ollama installs as a background service and runs in the system tray
- Open a command prompt or PowerShell and run:
ollama run llama3
Windows GPU support: NVIDIA CUDA is supported. Ensure you have the latest NVIDIA drivers installed before running Ollama. AMD GPU support on Windows is limited — check the Ollama GitHub releases page for current status.
Does Ollama work on Mac?
Yes. Ollama has native macOS support and is one of the best platforms for local LLM inference, particularly on Apple Silicon (M1, M2, M3, M4).
- Download the macOS app from ollama.com/download
- Move it to your Applications folder and launch it
- Ollama runs in the menu bar; open Terminal and run
ollama pull gemma
Apple Silicon performance: Ollama uses Metal GPU acceleration on Apple Silicon Macs. The unified memory architecture means the GPU and CPU share RAM — an M2 MacBook Pro with 16 GB unified memory can run a 7B model with fast inference. This is one of the most performant CPU/GPU setups for local LLMs relative to cost.
Intel Mac: Ollama runs in CPU-only mode on Intel Macs. Performance is slower but functional for 7B models with Q4 quantisation.
[IMAGE: Ollama running a local LLM on macOS for coding and development automation tasks]
Best Ollama models for coding
These are the models available through Ollama that deliver the best results for coding tasks in 2026:
| Model | Pull command | Min VRAM | Coding focus | Licence |
|---|---|---|---|---|
| CodeLlama 7B | ollama pull codellama |
~6 GB | ★★★★★ — code completion, FIM | Llama 2 Community |
| DeepSeek Coder 7B | ollama pull deepseek-coder |
~6 GB | ★★★★☆ — code-first training | DeepSeek |
| Gemma 7B | ollama pull gemma |
~6 GB | ★★★★☆ — coding + reasoning | Gemma ToU |
| Mistral 7B | ollama pull mistral |
~6 GB | ★★★☆☆ — general + scripting | Apache 2.0 |
| Phi-3 Mini | ollama pull phi3 |
~4 GB | ★★★☆☆ — lightweight coding | MIT |
| Llama 3 8B | ollama pull llama3 |
~6 GB | ★★★☆☆ — general; good at code | Llama 3 Community |
Coding focus ratings are qualitative. Verify against current community benchmarks and your own workload before treating them as definitive.
Recommendation by use case:
– Inline code completion / IDE assistant: codellama (7B or 13B)
– Code + technical Q&A mixed workflow: gemma or llama3
– Maximum licence permissiveness: mistral (Apache 2.0)
– Limited hardware (< 6 GB VRAM): phi3
For the broader best local LLMs for coding comparison including models beyond the Ollama library, see the full roundup.
Ollama model comparison for developers
Understanding the key differences between Ollama models helps you make the right choice for your workflow:
Context window: A larger context window means the model can hold more of your codebase in “memory” during a session. Llama 3 and Mistral support longer contexts than earlier CodeLlama variants.
Quantisation levels: Every Ollama model is available in multiple quantisation levels. The default pull gives you a balanced quantisation (typically Q4_K_M). For lower VRAM usage at reduced quality, append a tag:
ollama pull codellama:7b-code-q4_0
For higher quality at the cost of more VRAM:
ollama pull codellama:13b-code-q5_K_M
Model tags: Run ollama show <model> to see the model’s parameters, quantisation, and context window. Run ollama list to see all downloaded models.
Using Ollama for internal automation
Ollama’s REST API makes it practical to build internal automation tools that use a local LLM as a reasoning layer. Because the API mirrors the OpenAI API schema, existing code written against the OpenAI SDK often needs only a base_url change:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by library; ignored by Ollama
)
response = client.chat.completions.create(
model="codellama",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Review this function and suggest improvements: def add(a,b): return a+b"}
]
)
print(response.choices[0].message.content)
Practical automation use cases:
– Automated code review comments — pipe a diff to the model and get structured feedback
– Commit message generation — send a git diff and ask for a conventional commit message
– Documentation generation — pass a function signature and docstring skeleton; get it filled in
– Internal Slack bot — route questions to the local Ollama API instead of the OpenAI API
– Pre-commit hooks — lint code against a style guide using the model as a reasoning layer
For the full local AI infrastructure setup guide covering UI setup, Docker deployment, and multi-user configurations, see the infrastructure guide.
For setting up Gemma specifically through Ollama, see the guide to run Gemma through Ollama for your dev workflow.
Frequently asked questions
What is Ollama used for?
Ollama is used to run open-weights LLMs locally on your own hardware. Developers use it for coding assistance, internal chat assistants, automation pipelines, and LLM application prototyping. It provides a CLI and a REST API, making it easy to integrate with IDEs, scripts, and applications without sending data to cloud providers.
What models does Ollama support?
Ollama supports most major open-weights models including Llama 3 (Meta), CodeLlama (Meta), Gemma (Google), Mistral, Phi-3 (Microsoft), DeepSeek Coder, and dozens more. Run ollama list to see locally downloaded models, or browse the full registry at ollama.com/library.
Does Ollama work on Mac and Windows?
Yes to both. Ollama has a native macOS application with Metal GPU acceleration — particularly fast on Apple Silicon. On Windows, Ollama installs as a background service with NVIDIA CUDA support. Both platforms allow you to run 7B models in interactive speeds on consumer hardware.
Last updated: 2026.