by waybarrios
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.
# Add to your Claude Code skills
git clone https://github.com/waybarrios/vllm-mlxGuides for using mcp servers skills like vllm-mlx.
Last scanned: 5/1/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-01T06:39:51.820Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": true
}No comments yet. Be the first to share your thoughts!
Top skills in this category by stars
Read this in other languages: English · Español · Français · 中文
Continuous batching + OpenAI + Anthropic APIs in one server. Native Apple Silicon inference.
A vLLM-style inference server for Apple Silicon Macs. Unlike Ollama or mlx-lm used directly, it ships continuous batching, paged KV cache, prefix caching, and SSD-tiered cache, and exposes both OpenAI /v1/* and Anthropic /v1/messages from a single process. Run LLMs, vision models, audio, and embeddings on Metal with unified memory, no conversion step.
pip install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="default", messages=[{"role": "user", "content": "Hi!"}])
print(r.choices[0].message.content)
Anthropic SDK / Claude Code:
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
/v1/chat/completions, /v1/completions, /v1/embeddings, /v1/rerank, /v1/responses/v1/messages (streaming, tool use, system prompts)response_format (lm-format-enforcer)--ssd-cache-dir)--warm-prompts) for 1.3-2.25x TTFTaudio_url content blocks)--reasoning-parser)--moe-top-k for +7-16% on Qwen3-30B-A3B--mtp for Qwen3-Next--spec-prefill for TTFT reduction/metrics endpoint with --metricsvllm-mlx bench-serve for prompt sweeps with CSV/JSON outputLLM decode (M4 Max, 128 GB, greedy, single stream):
| Model | Tok/s | Memory | |-------|------:|-------:| | Qwen3-0.6B-8bit | 417.9 | 0.7 GB | | Llama-3.2-3B-Instruct-4bit | 205.6 | 1.8 GB | | Qwen3-30B-A3B-4bit | 127.7 | ~18 GB |
Audio speech-to-text (M4 Max, RTF = real-time factor):
| Model | RTF | Use case | |-------|----:|----------| | whisper-tiny | 197x | Real-time / low latency | | whisper-large-v3-turbo | 55x | Quality + speed | | whisper-large-v3 | 24x | Highest accuracy |
See docs/benchmarks/ for continuous-batching results, KV-cache quantization (4-bit / 8-bit / fp16), and MoE top-k sweeps.
vllm-mlx serve mlx-community/Qwen3-8B-4bit --port 8000
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
r = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What is 17 * 23?"}],
)
print("Thinking:", r.choices[0].message.reasoning)
print("Answer:", r.choices[0].message.content)
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
r = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
]}],
)
r = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "List 3 colors."}],
response_format={
"type": "json_schema",
"json_schema": {
"schema": {"type": "object", "properties": {"colors": {"type": "array", "items": {"type": "string"}}}}
},
},
)
/v1/rerank)curl http://localhost:8000/v1/rerank -H 'Content-Type: application/json' -d '{
"model": "default",
"query": "apple silicon inference",
"documents": ["MLX is Apples framework", "Metal kernels on M-series", "CUDA on NVIDIA"]
}'
The built-in MLX reranker forward path supports standard BERT/XLM-RoBERTa
sequence-classification weights with gelu, gelu_new/gelu_fast, relu, or
silu/swish hidden_act values. Other activations fail explicitly so custom
reranker architectures can add a dedicated adapter instead of silently using the
wrong activation.
vllm-mlx serve <llm-model> --embedding-model mlx-community/all-MiniLM-L6-v2-4bit
emb = client.embeddings.create(model="mlx-community/all-MiniLM-L6-v2-4bit", input=["Hello", "World"])
pip install vllm-mlx[audio]
brew install espeak-ng # macOS, needed for non-English TTS
python examples/tts_example.py "Hello, how are you?" --play
python examples/tts_multilingual.py "Hola mundo" --lang es --play
vllm-mlx bench-serve --url http://localhost:8000 --concurrency 5 --prompts prompts.txt --output results.csv
# Product-style workload with quality checks and metrics deltas
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --output results.json
# Append workload rows into SQLite for longitudinal comparisons
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --format sqlite --output bench.db
# Inspect repo metadata, file sizes, config, and rough fit before downloading weights
vllm-mlx model inspect mlx-community/Llama-3.2-3B-Instruct-4bit
# Acquire with resumable Hugging Face transfer and write a local artifact manifest
vllm-mlx model acquire mlx-community/Llama-3.2-3B-Instruct-4bit --target-dir ./models/llama-3b-4bit
# Wrap mlx-lm conversion and record the exact recipe in the converted artifact
vllm-mlx model convert meta-llama/Llama-3.2-3B-Instruct --output ./models/llama-3b-mlx-q4 --quantize --q-bits 4 --q-group-size 64 --q-mode affine
vllm-mlx serve <model> --metrics
curl http://localhost:8000/metrics
Using uv (recommended):
uv tool install vllm-mlx # CLI, system-wide
# or in a project
uv pip install vllm-mlx
Using pip:
pip install vllm-mlx
# Audio extras
pip install vllm-mlx[audio]
brew install espeak-ng
python -m spacy download en_core_web_sm
From source:
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .
See Installation Guide for full options.
┌─────────────────────────────────────────────────────────────────────────┐
│ vllm-mlx Server │
│ OpenAI /v1/* · Anthropic /v1/messages · /v1/rerank · /metrics │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Continuous batching · Paged KV cache ·