by raullenchai
The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Drop-in OpenAI replacement. Works with Claude Code, Cursor, Aider.
# Add to your Claude Code skills
git clone https://github.com/raullenchai/Rapid-MLXLast scanned: 5/6/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-06T06:29:36.289Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": true
}| Your Mac | Model | Speed (tok/s) | What works | |
|---|---|---|---|---|
| 16 GB MacBook Air | Qwen3.5-4B | 147 tok/s | Chat, coding, tools | |
| 24 GB MacBook Pro | Qwen3.5-9B | 101 tok/s | Great all-rounder | |
| 32+ GB Mac Mini / Studio | 🆕 Gemma 4 12B | 64 tok/s | Vision-capable + tools | |
| 32+ GB Mac Mini / Studio | GPT-OSS 20B | 119 tok/s | Harmony-native, 100% tools | |
| 32+ GB Mac Mini / Studio | Qwen3.6-35B-A3B | 93 tok/s | 256 MoE experts, 262K context | |
| 48+ GB Mac Mini / Studio | Qwen3.5-35B-A3B 8bit | 80 tok/s | Best balance of smart + fast | |
| 96+ GB Mac Studio / Pro | Qwen3.5-122B | 57 tok/s¹ | Frontier-level intelligence | |
| 128+ GB Mac Studio Ultra | DeepSeek V4 Flash 158B-A13B | 31-56 tok/s¹ | Day-0 frontier MoE, 1M context |
Single-user end-to-end throughput (B=1: one request at a time, 256 max output tokens, output_tokens / wall-clock incl. first-token latency), median of 3 rounds. chat_template_kwargs.enable_thinking=False passed where the engine honours it. Tested on M3 Ultra 256 GB / rapid-mlx v0.6.83 (fused top-p sampler). ¹ carried over from 2026-04 bench — disk-constrained on this refresh.
qwen3:Nb) Rapid-MLX leads 1.7–2.4x. The Gemma 4 row is tied at parity with Ollama's Gemma 3 (different architectures, 1.0x). Against mlx-lm serve (same MLX weights) Rapid-MLX is 1.2–1.5x faster. Full caveats in Benchmarks.Step 1 — Install (pick one):
# uv (recommended — one command, isolated env, auto-manages Python)
uv tool install rapid-mlx@latest
# Don't have uv yet? Install it first: curl -LsSf https://astral.sh/uv/install.sh | sh
# Or one-liner with auto-setup (installs Python if needed)
curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash
# Homebrew (Mac-native — needs tap + trust before install on Homebrew 4.x)
brew tap raullenchai/rapid-mlx
brew trust raullenchai/rapid-mlx
brew install rapid-mlx
# pip (requires Python 3.10+ — macOS ships 3.9, so install Python first if needed)
pip install rapid-mlx
Upgrade later: uv tool upgrade rapid-mlx / brew upgrade rapid-mlx / pip install -U rapid-mlx.
Vision/multimodal models (Gemma 4, Qwen-VL, etc.) need extras:
pip install 'rapid-mlx[vision]'. Text-only install is ~460 MB; vision adds ~322 MB. See Optional Extras for the full list.
"No matching distribution" error? Your Python is too old. Run
python3 --version— if it says 3.9, install a newer Python:brew install python@3.12thenpython3.12 -m pip install rapid-mlx
Refusing to load formula ... from untrusted tap? Homebrew 4.x requires third-party taps to be explicitly trusted before install. Thebrew trust raullenchai/rapid-mlxline above is what marks the tap as trusted — without it, even afterbrew tap, the install is refused. Trust is per-machine and persists across upgrades.
Tapping homebrew/core/Operation not permittedduringbrew install? Brew 5.x's install sandbox can't auto-taphomebrew/coremid-install. Pre-tap it once, then retry:brew tap homebrew/core --force # ~1.3 GB, one-time brew tap raullenchai/rapid-mlx brew trust raullenchai/rapid-mlx brew install rapid-mlx
Step 2 — Talk to a model right now (one command, no second terminal):
rapid-mlx chat
Defaults to qwen3.5-4b-4bit. First run downloads the model (~2.5 GB) — you'll see a progress bar. Drops you into a REPL when it's ready. Type /help for slash commands, /exit to quit. Pass --think to surface chain-of-thought.
Step 2b — Or serve a model for use from other apps:
rapid-mlx serve qwen3.5-4b-4bit
Same model, same download — but this starts an OpenAI-compatible HTTP server instead of a REPL. Wait for Ready: http://localhost:8000/v1.
Want vision?
pip install 'rapid-mlx[vision]'thenrapid-mlx serve gemma-4-26b-4bit(~14 GB).
Step 3 — Hit the API (from a second terminal tab):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Say hello"}]}'
That's it — you now have an OpenAI-compatible AI server on localhost:8000. Point any app at http://localhost:8000/v1 and it just works.
Step 4 — Share it publicly (optional — get a https:// URL anyone can hit):
rapid-mlx share qwen3.6-27b-8bit
This spawns the same local serve and tunnels it through rapidserver.quicksilverpro.io over a WebSocket. Your terminal prints a public OpenAI-compatible endpoint plus a bearer key — point any chat UI or OpenAI SDK at it. Bearer auth, a locked-down CORS allowlist, and a default 120 RPM rate-limit are wired on the spawned child; closing the terminal tears the tunnel down.
The default chat surface is our hosted Big-AGI fork (tool calling, personas, voice — no signup); any OpenAI-compatible client also works, e.g. OPENAI_API_BASE_URL=<share-url>/v1 OPENAI_API_KEY=<bearer> open-webui serve.
Pick a 27B-class model or larger for a usable share experience — 4B is fine for local dev but too small for live chat (
rapid-mlx modelslists all aliases).
Want a Claude Code-like TUI? Rapid-MLX is the backend — pair it with an open-source agent CLI like OpenCode or codex for the full slash-commands / tool-use / multi-turn experience. Run
rapid-mlx agents opencode --setup(orcodex --setup) to wire it up automatically.
Tip: Run
rapid-mlx modelsto see all available model aliases. For a smaller/faster model, tryrapid-mlx serve qwen3.5-9b-4bit(~5 GB).
From source (for development):
git clone https://github.com/raullenchai/Rapid-MLX.git
cd Rapid-MLX && pip install -e .
Vision models (adds mlx-vlm + opencv + torch, ~322 MB extra):
pip install 'rapid-mlx[vision]'
Audio (TTS/STT via mlx-audio):
pip install 'rapid-mlx[audio]'
Not into the terminal? Rapid-MLX Desktop is a Mac app that bundles the same
rapid-mlxengine inside a one-click GUI — drag to Applications, pick a model, chat. No Python, nopip, nobrew. The CLI here is still the source of truth for serving and scripting; the desktop app is the friendlier on-ramp.
Try it with Python (make sure the server is running, then pip install openai):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") # any value works, no real key needed
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Say hello"}],
)
print(response.choices[0].message.content)
| Harness | Type | Notes |
|---|---|---|
| Hermes Agent | Agent | 62 tools, multi-turn (test) |
| PydanticAI | Framework | Typed agents, structured output ([test](tests/integrations/te |
Rapid-MLX is an open-source testing skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by raullenchai. The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Drop-in OpenAI replacement. Works with Claude Code, Cursor, Aider. It has 2,995 GitHub stars.
Yes. Rapid-MLX passed SkillsLLM's automated security scan — a dependency vulnerability audit plus prompt-injection heuristics — with no high-severity issues. You can read the full report in the Security Report section on this page.
Clone the repository with "git clone https://github.com/raullenchai/Rapid-MLX" and add it to your Claude Code skills directory (see the Installation section above).
Rapid-MLX is primarily written in Python. It is open-source under raullenchai on GitHub, so you can review or fork the full source.
Yes. SkillsLLM lists many other Testing skills you can browse and compare side by side. Open the Testing category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh Rapid-MLX against similar tools.
No comments yet. Be the first to share your thoughts!
Top skills in this category by stars