This post details the architecture and implementation of a custom API gateway on a Mac Studio M2 Max (32GB RAM) designed to concurrently serve three powerful models—**Gemma 4 E4B (7.5B)**, **Qwen 2.5 (14B)**, and **Phi-4 (14B)** in full Q8_0 precision—automatically swapping them in and out of VRAM on-demand based on browser-specific routing.
1. The Challenge: 38.6 GB Memory Request vs. 32GB Physical Limit
Here are the VRAM/RAM footprints of the models loaded in our environment:
- Gemma 4 E4B Q8_0: ~8.2 GB
- Qwen 2.5 14B Q8_0: ~15.2 GB
- Phi-4 14B Q8_0: ~15.2 GB
The total allocation is ~38.6 GB, which exceeds the physical 32GB RAM of the system and the ~21.8 GB Metal VRAM allocation budget. Running all three servers concurrently triggers GPU out-of-memory errors or aggressive SSD swapping, reducing speed to less than 1 token/second.
Since we only type in one browser at any given moment, we realized we don't need these models loaded simultaneously. The optimal solution was to dynamically load the requested model into VRAM and immediately unload inactive ones when switching contexts.
2. The Architecture: Automatic Lifecycle Control via API Gateway
We built a lightweight Python API Gateway proxy running on port 11430, registered as a macOS LaunchAgent to run continuously in the background.
When a browser's Page Assist extension sends a chat completion request, the gateway inspects the "model" parameter in the HTTP payload. It then invokes macOS's native service controller (launchctl) to swap the model daemons in the background:
The Switching Logic:
A request from Brave (targeting Gemma 4) arrives ➡️ The gateway calls launchctl unload on Qwen and Phi-4 services (freeing up to 30 GB VRAM instantly) ➡️ Calls launchctl load on the Gemma 4 service.
3. Smooth Warm-up with "/health" Polling
At startup, llama-server immediately opens its network port, but is not ready to serve requests until it completes loading the weights into memory. Forwarding the API call prematurely results in a 503 Loading model error.
To solve this, the gateway polls the target server's /health endpoint every 200ms. It intercepts and holds the client's request until the endpoint returns {"status":"ok"} (HTTP 200).
This allows the ~10-second model loading phase (moving 15 GB from the fast internal SSD to VRAM) to resolve safely and transparently, resuming the chat stream seamlessly once the model is fully initialized.
4. "readline()" Streaming Optimization for Real-Time Flow
During initial testing, the text output in the browser popped in in large, stuttering chunks every few seconds. We traced this to the gateway's proxy loop, which read the upstream socket response in large blocks via response.read(8192).
Since Server-Sent Events (SSE) stream text line-by-line, we rewrote the reading loop to use response.readline(), flushing immediately with self.wfile.flush(). This bypassed any internal buffering and routed every generated token directly to the browser in real-time. The text now flows smoothly and fluidly at 22-47 tokens/sec, matching the user's natural silent reading speed.
5. Results and Summary
This tailored, Docker-free local AI environment runs cleanly with maximum Apple Silicon hardware acceleration:
- Brave (Port 11430 / model.gguf): Dedicated to Gemma 4 Q8_0 for ultra-fast general queries (~47 tokens/s).
- Chrome (Port 11430 / Qwen 2.5): Dedicated to Qwen 2.5 14B Q8_0 for nuanced Japanese tasks (~25 tokens/s).
- Safari (Port 11430 / phi-4): Accessing Phi-4 14B Q8_0 via a standalone web page for programming and logic tasks (~22 tokens/s).