Skip to content

Run Locally with ollama

Use this guide to review your local changes with a local ollama model — zero API cost, zero egress, no keys required. The CLI reviews your git diff and prints the findings; to post reviews on real pull requests, use the GitHub Action.

Prerequisites

  • lgtmaybe installed (pip install lgtmaybe)
  • ollama installed and running
  • A local git repository with changes to review

Pull the model you want

ollama pull qwen3.6:27b        # strong all-round coding model
ollama pull gemma4:e4b         # smaller — for devices with limited RAM

List available models:

ollama list

Run the review

From inside the repo, on the branch you want reviewed:

lgtmaybe review \
  --provider ollama \
  --model qwen3.6:27b \
  --api-base http://localhost:11434

This diffs your current branch against the default branch and prints the findings. Add --working to review only your uncommitted edits, or --base <ref> to diff against a different base.

Use a remote ollama instance

If ollama runs on another machine (e.g. a Tailscale peer):

lgtmaybe review \
  --provider ollama \
  --model qwen3.6:27b \
  --api-base http://100.x.x.x:11434

No authentication is added — ollama has no built-in auth. Ensure network access is restricted at the host or firewall level.

Inside the GitHub Action's container

The Action runs lgtmaybe in a container, so ollama on the runner host is reached at host.docker.internal rather than localhost. Set it in .lgtmaybe.yml, since the Action reads its provider settings from config:

provider: ollama
model: qwen3.6:27b
api_base: http://host.docker.internal:11434

Get findings as JSON

The CLI prints a readable listing by default and never posts anywhere. Add --json for a machine-readable array you can pipe into other tooling:

lgtmaybe review \
  --provider ollama \
  --model qwen3.6:27b \
  --api-base http://localhost:11434 \
  --json

Let an AI agent apply the fixes

--format agent prints the findings as correction instructions an AI coding agent (such as Claude Code) can read and apply, so you can review and fix a branch locally before opening a PR. See Fix findings with an AI agent.

Slow models and timeouts

Local models are slow, especially large ones on CPU, so lgtmaybe gives ollama a long default per-request timeout (300 seconds) automatically — you don't need to set anything for a normal run. (Cloud providers default to 60 s.)

If a big model still times out — you'll see litellm.Timeout: Connection timed out after 300.0 seconds — raise it explicitly:

# CLI flag (seconds):
lgtmaybe review --provider ollama --model qwen3.6:35b \
  --api-base http://localhost:11434 --timeout 900
# or in .lgtmaybe.yml (also how the GitHub Action picks it up):
provider: ollama
model: qwen3.6:35b
timeout: 900

The review fans out one call per category. lgtmaybe runs those serially for ollama (a single ollama instance serves one request at a time, so firing them concurrently would only make each wait and time out). The trade-off is wall-clock time — a slow model takes roughly categories × per-call time. To go faster, narrow the lenses with categories: in .lgtmaybe.yml (e.g. just security and correctness), use a smaller model, or give ollama more GPU. If you have the VRAM to truly serve requests in parallel, raise OLLAMA_NUM_PARALLEL on the ollama server — lgtmaybe still issues ollama calls one at a time, but a faster server shortens each.

Troubleshooting

Connection refused on port 11434 — ensure ollama serve is running and the --api-base URL is reachable.

Model not found — run ollama pull <model> before using it.

review incomplete — the model returned no usable output — every category call timed out or returned output that wasn't valid JSON. Raise --timeout, try a model that follows instructions more reliably, or check LITELLM_LOG=DEBUG output for the underlying error. lgtmaybe reports this (and exits non-zero) rather than pretending the PR is clean.

For a large diff this can mean the prompt plus the findings don't fit in ollama's context window and the output gets truncated. lgtmaybe runs ollama with a generous context (num_ctx of 16384) and structured JSON output (it also disables "thinking" so reasoning models like qwen3.x emit the findings directly), which covers most reviews.

For a big multi-file change ("vibe-coded" commits across many files), raise the context window with --num-ctx so the whole diff and the findings fit — this is ollama-only (hosted providers manage their context window server-side and ignore it):

# A large multi-file diff on a local model — more time and more context:
lgtmaybe review --provider ollama --model qwen3.6:35b \
  --api-base http://localhost:11434 --timeout 900 --num-ctx 32768
# or in .lgtmaybe.yml (also how the GitHub Action picks it up):
provider: ollama
model: qwen3.6:35b
timeout: 900
num_ctx: 32768

--num-ctx needs enough RAM/VRAM on the ollama host — a bigger window costs memory, so size it to your machine. The token budget that decides when lgtmaybe splits a diff into separate model calls is --max-input-tokens (default 100000), which applies to any provider — raise it to send a large diff in fewer calls, lower it for a small-context model. If a very large diff still truncates, narrow it with include_paths / exclude_paths or a lower max_files in .lgtmaybe.yml, or run a model with a bigger context window.

Review is empty or truncated — the diff may exceed the model's context window. Add a path filter in .lgtmaybe.yml to reduce diff size, or set max_files to a lower value.