Skip to main content

Documentation Index

Fetch the complete documentation index at: https://isaree-cd4b6397.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Introduction

The future of AI is agentic. Rather than relying on a single, monolithic model in the cloud, the next generation of intelligent applications will be powered by ecosystems of specialized agents running directly on user devices. For clinical innovators and low-code builders, this shift is transformative: on-device AI offers unparalleled privacy, eliminates cloud costs, and enables specialized models that can be certified for regulated healthcare domains . This guide provides everything you need to select the right models, understand extreme quantization techniques (including 1-bit models), and plan hardware requirements for building multi-agent systems on Apple devices using the MLX framework.

Why Apple Silicon for On-Device AI?

Apple’s unified memory architecture is uniquely suited for local AI inference. Unlike traditional PCs where the CPU and GPU have separate memory pools, Apple Silicon shares a single memory pool across all components. This eliminates the need to copy model weights between system RAM and VRAM, significantly accelerating inference . MLX, Apple’s open-source machine learning framework, is purpose-built for this architecture. It supports a wide spectrum of quantization techniques—from standard 8-bit down to extreme 1-bit and 2-bit formats—allowing builders to compress models by over 85% while maintaining production-ready capabilities .

The Quantization Dimension: Balancing Quality and Size

Quantization is the process of reducing the precision of a model’s weights (e.g., from 16-bit floating point down to 4-bit or even 1-bit integers). This drastically reduces the memory footprint, allowing larger models to run on smaller devices. When building agents, Small Language Models (SLMs) should be used as reasoning and retrieval engines, not as raw knowledge bases . Because their primary job is logic and tool use, they can often survive aggressive quantization without losing their core utility.

Standard Quantization Spectrum (Post-Training)

Most models on Hugging Face use post-training quantization (PTQ). The table below outlines the trade-offs :
Quantization LevelMemory per ParameterQuality Trade-offBest Use Case
FP16 (16-bit)2.00 bytesBaselineUnrestricted server environments
Q8_0 (8-bit)1.06 bytesExcellent (Near lossless)High-precision medical reasoning
Q6_K (6-bit)0.81 bytesVery GoodBalance of high quality and size
Q4_K_M (4-bit)0.64 bytesAcceptableThe standard default for most local agents
Q3_K_M (3-bit)0.44 bytesNoticeable lossHighly constrained mobile environments
Q2_K (2-bit)0.31 bytesSignificant lossNot recommended for complex reasoning

The Breakthrough: Extreme and Native 1-Bit Quantization

Recent breakthroughs in 2025 and 2026 have pushed quantization to its absolute limits, changing the economics of on-device AI: 1.Native 1-Bit Models (e.g., Bonsai 8B): Unlike PTQ models that degrade at 1-bit, models like PrismML’s Bonsai 8B are trained natively from scratch as 1-bit networks. Bonsai delivers an 8.2B parameter model in just 1.15 GB of storage. Despite being 14x smaller than its 16-bit equivalent, it scores a 70.5 average on benchmarks, outperforming standard 8B models that are vastly larger . 2.TurboQuant (KV Cache Compression): Google’s TurboQuant algorithm compresses the Key-Value (KV) cache, the memory used to store conversation context, down to 3 bits with zero accuracy loss. This allows agents to maintain massive context windows (up to 4M tokens) without exhausting device RAM .

Quick Reference: Model Size to Storage

The table below shows approximate storage needs across different quantization levels.
Model SizeQ8 (8-bit) StorageQ4 (4-bit) StorageNative 1-bit StorageExample Models
1.7B~1.8 GB~1.0 GB~0.25 GBSmolLM-1.7B
3B~3.2 GB~1.8 GB~0.45 GBPhi-3-mini
7B / 8B~8.0 GB~4.5 GB1.15 GBLlama-3-8B, Bonsai 8B
14B~15.0 GB~8.5 GB~2.0 GBQwen-14B
70B~74.0 GB~40.0 GB~10.0 GBLlama-3-70B
Note: Storage is for model weights only. Active inference requires additional RAM for the KV cache.

Device Specifications and Multi-Agent Capacity

The primary constraint for on-device AI is unified memory (RAM). By utilizing extreme quantization (like 1-bit models), you can dramatically increase the number of concurrent agents running on a single device.

Mobile Devices (iPhone / iPad)

DeviceRAMUsable for AIMax Q4 8B AgentsMax 1-bit 8B AgentsRecommended Setup
iPhone 16 (All)8 GB~4 GB02 to 3Bonsai 8B (1-bit)
iPhone 17 Pro12 GB~7 GB14 to 5Bonsai 8B + Qwen3-0.6B
iPad Pro M4 (Base)8 GB~5 GB13 to 4Bonsai 8B (1-bit)
iPad Pro M4/M5 (1TB+)16 GB~11 GB27 to 8Llama-3-8B (Q4) or Swarm of 1-bits

Mac Computers

DeviceRAM OptionsUsable for AIMax Q4 8B AgentsMax 1-bit 8B Agents
MacBook Air M416 GB, 24 GB, 32 GB10–24 GB2 to 58 to 18
Mac Mini M424 GB, 32 GB17–24 GB3 to 513 to 18
MacBook Pro M4 Pro24 GB, 48 GB17–38 GB3 to 813 to 28
Mac Studio M4 Max36 GB, 64 GB, 128 GB28–105 GB6 to 2320 to 75+
Usable memory assumes ~6–8 GB reserved for macOS and applications. Mobile devices reserve ~3–4 GB.

Orchestrating Multi-Agent Systems

When building systems with multiple specialized agents, adopt the “Router and Specialist” pattern to maximize efficiency : 1.Hot Router Model: A small, highly compressed model (e.g., a 1-bit 8B model like Bonsai) stays permanently loaded. It classifies incoming requests in under 300ms and handles general queries directly. At just 1.15 GB, it leaves plenty of headroom. 2.Cold Specialist Models: Specialized models (e.g., a Q8 medical reasoning model or a Q4 coding model) load on demand when the router dispatches a task. They are evicted when memory pressure requires it. This pattern ensures memory is only consumed by high-precision models when actively performing complex work.

The Future: On-Device AI by 2030

As chip manufacturing advances to 2nm processes, the compute capabilities of mobile devices will skyrocket. Combined with breakthroughs in 1-bit quantization and TurboQuant cache compression, it is projected that by 2030, models as performant as today’s state-of-the-art centralized models will be fully runnable on a standard smartphone . This hardware evolution breaks the linear relationship between AI usage and cloud compute costs, enabling freemium business models and establishing on-device inference as a critical competitive advantage for clinical innovators .