Context
I’ve been curious about running local LLMs for development tasks — code review, summarisation, drafting — without relying on cloud APIs. Ollama makes this straightforward to set up, but most benchmarks assume beefy hardware. I wanted to know: what’s the experience like on a standard development machine?
My setup: a laptop with a 12th-gen Intel i7, 16GB RAM, integrated graphics. No discrete GPU. This is the kind of machine most developers actually use.
Setup
I installed Ollama on Ubuntu 22.04 and pulled three models of varying sizes:
- Phi-3 Mini (3.8B) — Microsoft’s compact model
- Llama 3.1 8B — Meta’s general-purpose model
- Mistral 7B — Mistral AI’s base model
Each model was tested with three tasks:
- Summarise a 500-word technical document
- Review a 50-line Python function for bugs
- Generate a commit message from a diff
I measured time-to-first-token and total generation time for each task.
Observations
Phi-3 Mini was the clear winner for constrained hardware. Time-to-first-token was under 2 seconds for all tasks, and total generation completed in 5-15 seconds depending on output length. Quality was surprisingly good for summarisation and commit messages. Code review was acceptable but occasionally missed subtle issues.
Llama 3.1 8B was usable but slow. Time-to-first-token ranged from 4-8 seconds, and longer outputs could take 30-45 seconds. The quality improvement over Phi-3 was noticeable for code review but marginal for simpler tasks. Memory usage peaked at around 12GB, leaving little headroom for other applications.
Mistral 7B performed similarly to Llama 3.1 in speed. Quality was comparable, with slightly better performance on natural language tasks and slightly worse on code-specific tasks.
All three models ran entirely on CPU. The lack of GPU wasn’t a dealbreaker for interactive use, but it does rule out any kind of batch processing or real-time streaming at conversational speeds.
Takeaways
For constrained hardware, smaller models are the practical choice. Phi-3 Mini hits a sweet spot of quality and speed that makes it genuinely useful for development workflows.
Key findings:
- 3-4B parameter models are the sweet spot for 16GB machines. Anything larger and you’re fighting for memory.
- Task complexity matters more than model size for most development work. A small model with a clear prompt often beats a large model with a vague one.
- CPU inference is viable for interactive use — not for streaming chat, but for fire-and-forget tasks like “review this function” or “summarise this doc.”
- Ollama’s model management is excellent — switching between models is trivial, which makes it easy to use the right size for each task.
Next, I want to test quantised versions of larger models to see if I can get 13B-class quality at 7B-class speed.