Experimenting with Local LLMs on macOS by Fatih Altinok, September 08, 2025 · 9 minutes to read --- Overview This post explores the author's skeptical yet experimental approach to running large language models (LLMs) locally on a Mac. It discusses the motivations, insights, and practical steps for using local LLMs, focusing on their potential, limitations, and use cases. --- Author's Perspective on LLMs LLMs are essentially advanced next-word predictors with complex emergent behavior, not sentient beings. Useful for summarizing text, providing advice derived from common knowledge, or personal "brain dumping". The author warns against anthropomorphizing LLMs or relying too much on their output as they tend to hallucinate. Skeptical of AI hype and unethical practices by AI companies, preferring open-weight models run locally for privacy and control. --- Why Run LLMs Locally? Experimentation: Enjoyment of exploring new tech, marveling at what LLMs can do on personal hardware. Privacy: Avoid sending sensitive data to cloud services where data retention or training use may occur. Ethical Concerns: Avoid supporting AI companies with questionable ethics and business practices. Control: Open-weight models cannot be revoked and allow freedom to customize and maintain usage. --- How to Run LLMs on macOS The author recommends two main options: Llama.cpp (Open Source) Created by Georgi Gerganov, supports many platforms. Has many configuration options, supports model downloads, and includes a basic web UI. Installation via Nix (example command): nix profile install nixpkgs#llama-cpp Recommended experiment model: Gemma 3 4B QAT, run with: llama-server -hf ggml-org/gemma-3-4b-it-qat-GGUF Access via http://127.0.0.1:8080. LM Studio (Closed Source, User-Friendly) Provides a polished UI for browsing models, managing downloads, chats, guardrails against crashes. Supports two runtimes on macOS: llama.cpp and Apple's MLX (faster but less configurable). Features: Switch and branch conversations Regenerate and edit messages (including assistant's) Create reusable system prompt presets Configure model settings (e.g., context overflow behavior) Downloadable as a DMG installer from lmstudio.ai --- How to Choose a Suitable LLM Key considerations: Model size: RAM limits (e.g., 16 GB RAM means models ideally <12 GB to avoid system issues). Runtime compatibility: GGUF models for llama.cpp, MLX models for MLX runtime. Quantization: Models are typically 16-bit precision; quantizing to 4-bit (Q4) reduces size with acceptable quality. Vision models: Some support image inputs for OCR, object recognition; but dedicated OCR tools often perform better. Reasoning: Some models do inference-time reasoning, balancing better answers vs. longer response time. Tool use: Models can call external tools via MCP servers for code execution, web search, long-term memory, etc. Agents: LLMs combining reasoning and tool use to repeatedly call functions for complex tasks. --- Recommended Models to Try Gemma 3 12B QAT: Visual intelligence and fast, good quality non-reasoning model. Qwen3 4B 2507 Thinking: Small, fast model with reasoning capabilities. GPT-OSS 20B: Largest model that runs on typical machines, multiple reasoning levels, very capable but slow. Phi-4 (14B): Previously favored model, available in reasoning variants. --- Additional Tips and Insights LM Studio helps monitor context window usage; summarizing conversations near context limits helps maintain important info. Use MCPs sparingly as they increase context pollution; they enable advanced functions like JavaScript execution or web search. Always fact-check AI outputs and avoid blindly accepting responses, especially on unverifiable topics. --- Final Thoughts