Documentation Index
Fetch the complete documentation index at: https://isaree-cd4b6397.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
The future of AI is agentic. Rather than relying on a single, monolithic model in the cloud, the next generation of intelligent applications will be powered by ecosystems of specialized agents running directly on user devices. For clinical innovators and low-code builders, this shift is transformative: on-device AI offers unparalleled privacy, eliminates cloud costs, and enables specialized models that can be certified for regulated healthcare domains .
This guide provides everything you need to select the right models, understand extreme quantization techniques (including 1-bit models), and plan hardware requirements for building multi-agent systems on Apple devices using the MLX framework.
Why Apple Silicon for On-Device AI?
Apple’s unified memory architecture is uniquely suited for local AI inference. Unlike traditional PCs where the CPU and GPU have separate memory pools, Apple Silicon shares a single memory pool across all components. This eliminates the need to copy model weights between system RAM and VRAM, significantly accelerating inference .
MLX, Apple’s open-source machine learning framework, is purpose-built for this architecture. It supports a wide spectrum of quantization techniques—from standard 8-bit down to extreme 1-bit and 2-bit formats—allowing builders to compress models by over 85% while maintaining production-ready capabilities .
The Quantization Dimension: Balancing Quality and Size
Quantization is the process of reducing the precision of a model’s weights (e.g., from 16-bit floating point down to 4-bit or even 1-bit integers). This drastically reduces the memory footprint, allowing larger models to run on smaller devices.
When building agents, Small Language Models (SLMs) should be used as reasoning and retrieval engines, not as raw knowledge bases . Because their primary job is logic and tool use, they can often survive aggressive quantization without losing their core utility.
Standard Quantization Spectrum (Post-Training)
Most models on Hugging Face use post-training quantization (PTQ). The table below outlines the trade-offs :
| Quantization Level | Memory per Parameter | Quality Trade-off | Best Use Case |
|---|
| FP16 (16-bit) | 2.00 bytes | Baseline | Unrestricted server environments |
| Q8_0 (8-bit) | 1.06 bytes | Excellent (Near lossless) | High-precision medical reasoning |
| Q6_K (6-bit) | 0.81 bytes | Very Good | Balance of high quality and size |
| Q4_K_M (4-bit) | 0.64 bytes | Acceptable | The standard default for most local agents |
| Q3_K_M (3-bit) | 0.44 bytes | Noticeable loss | Highly constrained mobile environments |
| Q2_K (2-bit) | 0.31 bytes | Significant loss | Not recommended for complex reasoning |
The Breakthrough: Extreme and Native 1-Bit Quantization
Recent breakthroughs in 2025 and 2026 have pushed quantization to its absolute limits, changing the economics of on-device AI:
1.Native 1-Bit Models (e.g., Bonsai 8B): Unlike PTQ models that degrade at 1-bit, models like PrismML’s Bonsai 8B are trained natively from scratch as 1-bit networks. Bonsai delivers an 8.2B parameter model in just 1.15 GB of storage. Despite being 14x smaller than its 16-bit equivalent, it scores a 70.5 average on benchmarks, outperforming standard 8B models that are vastly larger .
2.TurboQuant (KV Cache Compression): Google’s TurboQuant algorithm compresses the Key-Value (KV) cache, the memory used to store conversation context, down to 3 bits with zero accuracy loss. This allows agents to maintain massive context windows (up to 4M tokens) without exhausting device RAM .
Quick Reference: Model Size to Storage
The table below shows approximate storage needs across different quantization levels.
| Model Size | Q8 (8-bit) Storage | Q4 (4-bit) Storage | Native 1-bit Storage | Example Models |
|---|
| 1.7B | ~1.8 GB | ~1.0 GB | ~0.25 GB | SmolLM-1.7B |
| 3B | ~3.2 GB | ~1.8 GB | ~0.45 GB | Phi-3-mini |
| 7B / 8B | ~8.0 GB | ~4.5 GB | 1.15 GB | Llama-3-8B, Bonsai 8B |
| 14B | ~15.0 GB | ~8.5 GB | ~2.0 GB | Qwen-14B |
| 70B | ~74.0 GB | ~40.0 GB | ~10.0 GB | Llama-3-70B |
Note: Storage is for model weights only. Active inference requires additional RAM for the KV cache.
Device Specifications and Multi-Agent Capacity
The primary constraint for on-device AI is unified memory (RAM). By utilizing extreme quantization (like 1-bit models), you can dramatically increase the number of concurrent agents running on a single device.
Mobile Devices (iPhone / iPad)
| Device | RAM | Usable for AI | Max Q4 8B Agents | Max 1-bit 8B Agents | Recommended Setup |
|---|
| iPhone 16 (All) | 8 GB | ~4 GB | 0 | 2 to 3 | Bonsai 8B (1-bit) |
| iPhone 17 Pro | 12 GB | ~7 GB | 1 | 4 to 5 | Bonsai 8B + Qwen3-0.6B |
| iPad Pro M4 (Base) | 8 GB | ~5 GB | 1 | 3 to 4 | Bonsai 8B (1-bit) |
| iPad Pro M4/M5 (1TB+) | 16 GB | ~11 GB | 2 | 7 to 8 | Llama-3-8B (Q4) or Swarm of 1-bits |
Mac Computers
| Device | RAM Options | Usable for AI | Max Q4 8B Agents | Max 1-bit 8B Agents |
|---|
| MacBook Air M4 | 16 GB, 24 GB, 32 GB | 10–24 GB | 2 to 5 | 8 to 18 |
| Mac Mini M4 | 24 GB, 32 GB | 17–24 GB | 3 to 5 | 13 to 18 |
| MacBook Pro M4 Pro | 24 GB, 48 GB | 17–38 GB | 3 to 8 | 13 to 28 |
| Mac Studio M4 Max | 36 GB, 64 GB, 128 GB | 28–105 GB | 6 to 23 | 20 to 75+ |
Usable memory assumes ~6–8 GB reserved for macOS and applications. Mobile devices reserve ~3–4 GB.
Orchestrating Multi-Agent Systems
When building systems with multiple specialized agents, adopt the “Router and Specialist” pattern to maximize efficiency :
1.Hot Router Model: A small, highly compressed model (e.g., a 1-bit 8B model like Bonsai) stays permanently loaded. It classifies incoming requests in under 300ms and handles general queries directly. At just 1.15 GB, it leaves plenty of headroom.
2.Cold Specialist Models: Specialized models (e.g., a Q8 medical reasoning model or a Q4 coding model) load on demand when the router dispatches a task. They are evicted when memory pressure requires it.
This pattern ensures memory is only consumed by high-precision models when actively performing complex work.
The Future: On-Device AI by 2030
As chip manufacturing advances to 2nm processes, the compute capabilities of mobile devices will skyrocket. Combined with breakthroughs in 1-bit quantization and TurboQuant cache compression, it is projected that by 2030, models as performant as today’s state-of-the-art centralized models will be fully runnable on a standard smartphone .
This hardware evolution breaks the linear relationship between AI usage and cloud compute costs, enabling freemium business models and establishing on-device inference as a critical competitive advantage for clinical innovators .