by raullenchai
The fastest local AI engine for Apple Silicon. 4.2x faster than Ollama, 0.08s cached TTFT, 100% tool calling. 17 tool parsers, prompt cache, reasoning separation, cloud routing. Drop-in OpenAI replacement. Works with Claude Code, Cursor, Aider.
# Add to your Claude Code skills
git clone https://github.com/raullenchai/Rapid-MLXLast scanned: 5/6/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-06T06:29:36.289Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": true
}| | Your Mac | Model | Speed (tok/s = words/sec) | What works | |:---|:---:|:---:|:---:|:---:| | 16 GB MacBook Air | Qwen3.5-4B | 160 tok/s | Chat, coding, tools | | 32+ GB Mac Mini / Studio | Nemotron-Nano 30B | 141 tok/s | 🆕 Fastest 30B, 100% tools | | 32+ GB Mac Mini / Studio | Qwen3.6-35B | 95 tok/s | 256 experts, 262K context | | 64 GB Mac Mini / Studio | Qwen3.5-35B | 83 tok/s | Best balance of smart + fast | | 96+ GB Mac Studio / Pro | Qwen3.5-122B | 57 tok/s | Frontier-level intelligence | | 128+ GB Mac Studio Ultra | 🆕 DeepSeek V4 Flash 158B-A13B | 31-56 tok/s | Day-0 frontier MoE, 1M context |
Step 1 — Install (pick one):
# Homebrew (recommended — just works, no Python version issues)
brew install raullenchai/rapid-mlx/rapid-mlx
# pip (requires Python 3.10+ — macOS ships 3.9, so install Python first if needed)
pip install rapid-mlx
# Or one-liner with auto-setup (installs Python if needed)
curl -fsSL https://raullenchai.github.io/Rapid-MLX/install.sh | bash
No comments yet. Be the first to share your thoughts!
Top skills in this category by stars
Vision/multimodal models (Gemma 4, Qwen-VL, etc.) need extras:
pip install 'rapid-mlx[vision]'. Text-only install is ~460 MB; vision adds ~322 MB. See Optional Extras for the full list.
"No matching distribution" error? Your Python is too old. Run
python3 --version— if it says 3.9, install a newer Python:brew install python@3.12thenpython3.12 -m pip install rapid-mlx
Step 2 — Serve a model:
rapid-mlx serve qwen3.5-4b
First run downloads the model (~2.5 GB) — you'll see a progress bar. Wait for Ready: http://localhost:8000/v1.
Want vision?
pip install 'rapid-mlx[vision]'thenrapid-mlx serve gemma-4-26b(~14 GB).
Step 3 — Chat (open a second terminal tab):
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Say hello"}]}'
That's it — you now have an OpenAI-compatible AI server on localhost:8000. Point any app at http://localhost:8000/v1 and it just works.
Tip: Run
rapid-mlx modelsto see all available model aliases. For a smaller/faster model, tryrapid-mlx serve qwen3.5-9b(~5 GB).
From source (for development):
git clone https://github.com/raullenchai/Rapid-MLX.git
cd Rapid-MLX && pip install -e .
Vision models (adds torch + torchvision, ~2.5 GB extra):
pip install 'rapid-mlx[vision]'
Audio (TTS/STT via mlx-audio):
pip install 'rapid-mlx[audio]'
Try it with Python (make sure the server is running, then pip install openai):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") # any value works, no real key needed
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Say hello"}],
)
print(response.choices[0].message.content)
| Harness | Type | Notes |
|---------|------|-------|
| Hermes Agent | Agent | 62 tools, multi-turn (test) |
| PydanticAI | Framework | Typed agents, structured output (test) |
| LangChain | Framework | ChatOpenAI, tools, streaming (test) |
| smolagents | Framework | CodeAgent + ToolCallingAgent (test) |
| OpenClaude (Anthropic SDK) | Agent | CLAUDE_CODE_USE_OPENAI=1 (test) |
| Aider | Agent | CLI edit-and-commit, architect mode (test) |
| Goose | Agent | Ollama provider via OLLAMA_HOST |
| Claw Code | Agent | OpenAI & Anthropic endpoints |
| Client | Status | Setup |
|--------|--------|-------|
| Cursor | Compatible | Settings → OpenAI Base URL |
| Continue.dev | Compatible | VS Code / JetBrains extension |
| LibreChat | Tested | Docker (test) |
| Open WebUI | Tested | Docker (test) |
| Any OpenAI-compatible app | Compatible | Point at http://localhost:8000/v1 |
MHI measures how well a model works with a specific agent harness. It combines three dimensions:
| Dimension | Weight | What it measures | Source |
|---|---|---|---|
| Tool Calling | 50% | Can the model+harness execute function calls correctly? | rapid-mlx agents --test |
| HumanEval | 30% | Can the model generate correct code? | HumanEval (10 tasks) |
| MMLU | 20% | Does the harness degrade base knowledge? | tinyMMLU (10 tasks) |
MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU (scale 0-100)
| Model | Best MHI | Best Harness | Tool Calling | |---|---|---|---| | Qwopus 27B | 92 | All (Hermes, PydanticAI, LangChain, smolagents) | 100% | | Qwen3.5 27B | 82 | Hermes / PydanticAI / LangChain | 100% | | Llama 3.3 70B | 83 | smolagents (text-based) | 100% | | Nemotron Nano 30B | 59 | PydanticAI / LangChain | 91-93% | | Gemma 4 26B | 62 | Hermes / smolagents | 100% |
MHI = 0.50 × ToolCalling + 0.30 × HumanEval + 0.20 × MMLU (scale 0-100)
Run rapid-mlx agents to see all supported agents and python3 scripts/mhi_eval.py to compute MHI on your own setup.
| Model + Harness | Tool Calling | HumanEval | MMLU | MHI | |---|---|---|---|---| | Qwopus 27B + Hermes | 100% | 80% | 90% | 92 | | Qwopus 27B + PydanticAI | 100% | 80% | 90% | 92 | | Qwen3.5 27B + Hermes | 100% | 40% | 100% | 82 | | Llama 3.3 70B + smolagents | 100% | 50% | 90% | 83 | | DeepSeek-R1 32B + smolagents | 100% | 30% | 100% | 79 | | Gemma 4 26B + Hermes | 100% | 0% | 60% | 62 | | Nemotron Nano 30B + PydanticAI | 93% | 0% | 60% | 59 |
Quick setup for popular apps:
Cursor: Settings → Models → Add Model:
OpenAI API Base: http://localhost:8000/v1
API Key: not-needed
Model name: default (or qwen3.5-9b — either works)
Cursor's agent/composer mode uses tool calls automatically — Rapid-MLX handles them natively with Qwen3.5 models, no extra flags needed.
Claw Code:
export OPENAI_BASE_URL=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
claw --model "openai/default" prompt "summarize this repo"
OpenClaude:
CLAUDE_CODE_USE_OPENAI=1 OPENAI_BASE_URL=http://localhost:8000/v1 \
OPENAI_API_KEY=not-needed OPENAI_MODEL=default openclaude -p "hello"
Hermes Agent (~/.hermes/config.yaml):
model:
provider: "custom"
default: "default"
base_url: "http://localhost:8000/v1"
context_length: 32768
Goose:
GOOSE_PROVIDER=ollama OLLAMA_HOST=http://localhost:8000 \
GOOSE_MODEL=default goose run --text "hello"
Claude Code:
OPENAI_BASE_URL=http://localhost:8000/v1 claude