← Back to home
llm comparison

Self-Hosting LLMs in 2025: Complete Setup Guide for Mac (M-Series)

A practical guide to running Llama 3, Mistral, and Qwen on your Mac M-series chip. Includes performance benchmarks and real use cases.

Why Self-Host?

Hardware Requirements

| Model | Quantization | Min RAM | M-Series Chip | Tokens/sec (M4) |

|-------|-------------|---------|---------------|-----------------|

| Llama 3 8B | Q4_K | 8GB | M3+ | ~25 t/s |

| Mistral 7B | Q4_K | 8GB | M3+ | ~22 t/s |

| Qwen 3 0.6B | FP16 | 2GB | M1+ | ~40 t/s |

| Phi-4 3B | Q4_K | 4GB | M2+ | ~35 t/s |

For anything beyond 7B, you'll want 16GB+ unified memory. 8B models on 16GB RAM are usable but tight.

---

Step 1: Install Ollama

```bash

brew install ollama

```

Or download from ollama.com. Runs as a local server at localhost:11434.

---

Step 2: Pull a Model

```bash

# Small, fast model for daily tasks

ollama pull qwen3:0.6b

# Mid-range for coding

ollama pull qwen3:4b

# Full capability

ollama pull llama3:8b

```

---

Step 3: Run and Query

```bash

ollama run qwen3:0.6b

```

This drops you into an interactive REPL. For API access:

```bash

curl http://localhost:11434/api/generate -d '{

"model": "qwen3:0.6b",

"prompt": "Explain quantum entanglement in simple terms"

}'

```

---

Real Use Cases

Code Review

Running a 4B model locally for quick code reviews. Not as capable as GPT-4o or Claude, but fast and free.

**Setup:** `ollama run qwen3:4b` + VS Code extension

Note Summarization

Phi-4 3B handles long-document summarization well enough for personal knowledge management.

Brainstorming

Any 7B+ model is fine for structured brainstorming. The quality gap with frontier models matters less for generative, exploratory tasks.

---

Limitations

  • **Context window:** Most quantized models are capped at 8K–32K tokens
  • **Reasoning quality:** Self-hosted models lag behind GPT-4o/Claude on complex multi-step reasoning
  • **Multimodality:** Most self-hosted setups are text-only
  • ---

    Conclusion

    For Mac users, self-hosting is now genuinely viable for daily-use cases. The sweet spot is Qwen 3 4B or Llama 3 8B on M3/M4. Not a replacement for frontier models on hard problems, but excellent for high-volume, privacy-sensitive, or cost-sensitive tasks.

    Next step: set up an OpenWebUI instance to get a web interface.

    ← Back to home