Best Ollama Models for Coding + How to Use Ollama

Ollama for Local AI: Best Models for Coding & Setup

Ollama is the fastest way to get a large language model running on your own machine. One command installs it, one command downloads a model, and one command starts an interactive session. This guide covers what Ollama is actually used for, the best models for coding, and how to install it on Linux, Windows, and Mac.

[IMAGE: Ollama model list command showing best coding models available including CodeLlama, Gemma, and Mistral]


What is Ollama used for?

Ollama is a local LLM runtime that lets you download, manage, and run open-weights language models entirely on your own hardware. It wraps the llama.cpp inference engine with a clean CLI and a REST API compatible with the OpenAI API schema, making it easy to run models like CodeLlama, Gemma, Llama 3, and Mistral without cloud accounts, API keys, or data leaving your machine.

Common uses include: coding assistance and code completion in an IDE, internal documentation Q&A, private chat assistants, scripting and automation tools, and prototyping LLM-powered applications.


How to run Ollama locally — quick start

This block gets you to a running model in under 5 minutes:

  1. Install Ollama for your OS (see install sections below)
  2. Pull a model: ollama pull codellama
  3. Run it: ollama run codellama
  4. Type a prompt at the >>> prompt and press Enter twice
  5. Exit with /bye or Ctrl+D

The Ollama API is available at http://localhost:11434 once the service is running. Use it to integrate with IDEs, custom scripts, or chat UIs.


Install Ollama on Linux

curl -fsSL https://ollama.com/install.sh | sh

This script detects your distribution, installs the Ollama binary, and sets up a systemd service so Ollama starts automatically on boot.

Verify the installation:

ollama --version
ollama list

NVIDIA GPU support on Linux: Ollama auto-detects CUDA if your drivers are installed. Run nvidia-smi to confirm your GPU is visible before starting Ollama. If CUDA isn’t detected, reinstall Ollama after installing the NVIDIA drivers.

AMD GPU support on Linux: Ollama supports AMD GPUs via ROCm. The install script detects ROCm if it’s present.


Install Ollama on Windows

  1. Download the Windows installer from ollama.com/download
  2. Run the .exe installer — no admin rights required
  3. Ollama installs as a background service and runs in the system tray
  4. Open a command prompt or PowerShell and run: ollama run llama3

Windows GPU support: NVIDIA CUDA is supported. Ensure you have the latest NVIDIA drivers installed before running Ollama. AMD GPU support on Windows is limited — check the Ollama GitHub releases page for current status.


Does Ollama work on Mac?

Yes. Ollama has native macOS support and is one of the best platforms for local LLM inference, particularly on Apple Silicon (M1, M2, M3, M4).

  1. Download the macOS app from ollama.com/download
  2. Move it to your Applications folder and launch it
  3. Ollama runs in the menu bar; open Terminal and run ollama pull gemma

Apple Silicon performance: Ollama uses Metal GPU acceleration on Apple Silicon Macs. The unified memory architecture means the GPU and CPU share RAM — an M2 MacBook Pro with 16 GB unified memory can run a 7B model with fast inference. This is one of the most performant CPU/GPU setups for local LLMs relative to cost.

Intel Mac: Ollama runs in CPU-only mode on Intel Macs. Performance is slower but functional for 7B models with Q4 quantisation.

[IMAGE: Ollama running a local LLM on macOS for coding and development automation tasks]


Best Ollama models for coding

These are the models available through Ollama that deliver the best results for coding tasks in 2026:

Model Pull command Min VRAM Coding focus Licence
CodeLlama 7B ollama pull codellama ~6 GB ★★★★★ — code completion, FIM Llama 2 Community
DeepSeek Coder 7B ollama pull deepseek-coder ~6 GB ★★★★☆ — code-first training DeepSeek
Gemma 7B ollama pull gemma ~6 GB ★★★★☆ — coding + reasoning Gemma ToU
Mistral 7B ollama pull mistral ~6 GB ★★★☆☆ — general + scripting Apache 2.0
Phi-3 Mini ollama pull phi3 ~4 GB ★★★☆☆ — lightweight coding MIT
Llama 3 8B ollama pull llama3 ~6 GB ★★★☆☆ — general; good at code Llama 3 Community

Coding focus ratings are qualitative. Verify against current community benchmarks and your own workload before treating them as definitive.

Recommendation by use case:
Inline code completion / IDE assistant: codellama (7B or 13B)
Code + technical Q&A mixed workflow: gemma or llama3
Maximum licence permissiveness: mistral (Apache 2.0)
Limited hardware (< 6 GB VRAM): phi3

For the broader best local LLMs for coding comparison including models beyond the Ollama library, see the full roundup.


Ollama model comparison for developers

Understanding the key differences between Ollama models helps you make the right choice for your workflow:

Context window: A larger context window means the model can hold more of your codebase in “memory” during a session. Llama 3 and Mistral support longer contexts than earlier CodeLlama variants.

Quantisation levels: Every Ollama model is available in multiple quantisation levels. The default pull gives you a balanced quantisation (typically Q4_K_M). For lower VRAM usage at reduced quality, append a tag:

ollama pull codellama:7b-code-q4_0

For higher quality at the cost of more VRAM:

ollama pull codellama:13b-code-q5_K_M

Model tags: Run ollama show <model> to see the model’s parameters, quantisation, and context window. Run ollama list to see all downloaded models.


Using Ollama for internal automation

Ollama’s REST API makes it practical to build internal automation tools that use a local LLM as a reasoning layer. Because the API mirrors the OpenAI API schema, existing code written against the OpenAI SDK often needs only a base_url change:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by library; ignored by Ollama
)

response = client.chat.completions.create(
    model="codellama",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Review this function and suggest improvements: def add(a,b): return a+b"}
    ]
)
print(response.choices[0].message.content)

Practical automation use cases:
Automated code review comments — pipe a diff to the model and get structured feedback
Commit message generation — send a git diff and ask for a conventional commit message
Documentation generation — pass a function signature and docstring skeleton; get it filled in
Internal Slack bot — route questions to the local Ollama API instead of the OpenAI API
Pre-commit hooks — lint code against a style guide using the model as a reasoning layer

For the full local AI infrastructure setup guide covering UI setup, Docker deployment, and multi-user configurations, see the infrastructure guide.

For setting up Gemma specifically through Ollama, see the guide to run Gemma through Ollama for your dev workflow.


Frequently asked questions

What is Ollama used for?

Ollama is used to run open-weights LLMs locally on your own hardware. Developers use it for coding assistance, internal chat assistants, automation pipelines, and LLM application prototyping. It provides a CLI and a REST API, making it easy to integrate with IDEs, scripts, and applications without sending data to cloud providers.

What models does Ollama support?

Ollama supports most major open-weights models including Llama 3 (Meta), CodeLlama (Meta), Gemma (Google), Mistral, Phi-3 (Microsoft), DeepSeek Coder, and dozens more. Run ollama list to see locally downloaded models, or browse the full registry at ollama.com/library.

Does Ollama work on Mac and Windows?

Yes to both. Ollama has a native macOS application with Metal GPU acceleration — particularly fast on Apple Silicon. On Windows, Ollama installs as a background service with NVIDIA CUDA support. Both platforms allow you to run 7B models in interactive speeds on consumer hardware.


Last updated: 2026.

Leave a Comment