Why Self-Host?
Hardware Requirements
| Model | Quantization | Min RAM | M-Series Chip | Tokens/sec (M4) |
|-------|-------------|---------|---------------|-----------------|
| Llama 3 8B | Q4_K | 8GB | M3+ | ~25 t/s |
| Mistral 7B | Q4_K | 8GB | M3+ | ~22 t/s |
| Qwen 3 0.6B | FP16 | 2GB | M1+ | ~40 t/s |
| Phi-4 3B | Q4_K | 4GB | M2+ | ~35 t/s |
For anything beyond 7B, you'll want 16GB+ unified memory. 8B models on 16GB RAM are usable but tight.
---
Step 1: Install Ollama
```bash
brew install ollama
```
Or download from ollama.com. Runs as a local server at localhost:11434.
---
Step 2: Pull a Model
```bash
# Small, fast model for daily tasks
ollama pull qwen3:0.6b
# Mid-range for coding
ollama pull qwen3:4b
# Full capability
ollama pull llama3:8b
```
---
Step 3: Run and Query
```bash
ollama run qwen3:0.6b
```
This drops you into an interactive REPL. For API access:
```bash
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:0.6b",
"prompt": "Explain quantum entanglement in simple terms"
}'
```
---
Real Use Cases
Code Review
Running a 4B model locally for quick code reviews. Not as capable as GPT-4o or Claude, but fast and free.
**Setup:** `ollama run qwen3:4b` + VS Code extension
Note Summarization
Phi-4 3B handles long-document summarization well enough for personal knowledge management.
Brainstorming
Any 7B+ model is fine for structured brainstorming. The quality gap with frontier models matters less for generative, exploratory tasks.
---
Limitations
---
Conclusion
For Mac users, self-hosting is now genuinely viable for daily-use cases. The sweet spot is Qwen 3 4B or Llama 3 8B on M3/M4. Not a replacement for frontier models on hard problems, but excellent for high-volume, privacy-sensitive, or cost-sensitive tasks.
Next step: set up an OpenWebUI instance to get a web interface.