by waybarrios
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.
# Add to your Claude Code skills
git clone https://github.com/waybarrios/vllm-mlxvLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac
vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:
No comments yet. Be the first to share your thoughts!
/v1/messages endpoint for Claude Code and OpenCode/v1/embeddings endpoint with mlx-embeddingsUsing uv (recommended):
# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git
# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.git
Using pip:
# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git
# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .
# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000
# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key
from openai import OpenAI
# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
vllm-mlx exposes an Anthropic-compatible /v1/messages endpoint, so tools like Claude Code and OpenCode can connect directly.
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")
response = client.messages.create(
model="default",
max_tokens=256,
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)
To use with Claude Code:
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude
See Anthropic Messages API docs for streaming, tool calling, system messages, and token counting.
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
response = client.chat.completions.create(
model="default",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}]
)
# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng # macOS, for non-English languages
# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play
# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play
# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languages
Supported TTS Models: | Model | Languages | Description | |-------|-----------|-------------| | Kokoro | EN, ES, FR, JA, ZH, IT, PT, HI | Fast, 82M params, 11 voices | | Chatterbox | 15+ languages | Expressive, voice cloning | | VibeVoice | EN | Realtime, low latency | | VoxCPM | ZH, EN | High quality Chinese/English |
Extract the thinking process from reasoning models like Qwen3 and DeepSeek-R1:
# Start server with reasoning parser
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What is 17 × 23?"}]
)
# Access reasoning separately from the answer
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)
Supported Parsers:
| Parser | Models | Description |
|--------|--------|-------------|
| qwen3 | Qwen3 series | Requires both <think> and </think> tags |
| deepseek_r1 | DeepSeek-R1 | Handles implicit <think> tag |
Generate text embeddings for semantic search, RAG, and similarity:
# Start server with an embedding model pre-loaded
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --embedding-model mlx-community/all-MiniLM-L6-v2-4bit
# Generate embeddings using the OpenAI SDK
embeddings = client.embeddings.create(
model="mlx-community/all-MiniLM-L6-v2-4bit",
input=["Hello world", "How are you?"]
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")
See Embeddings Guide for details on supported models and lazy loading.
For full documentation, see the docs directory:
Getting Started
User Guides
Reference
Benchmarks
┌─────────────────────────────────────────────────────────────────────────┐
│ vLLM API Layer │
│ (OpenAI-compatible interface) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MLXPlatform │
│ (vLLM platform plugin for Apple Silicon) │
└─────────────────────────────────────────────────────────────────────────┘
│
┌─────────────┬────────────┴────────────┬─────────────┐
▼ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ mlx-lm │ │ mlx-vlm │ │ mlx-audio │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM) │ │ (TTS + STT) │ │ (Embeddings) │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
│ │ │ │
└─────────────┴─────────────────────────┴─────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MLX │
│ (Apple ML Framework - Metal kernels) │
└─────────────────────────────────────────────────────────────────────────┘
LLM Performance (M4 Max, 128GB):
| Model | Speed | Memory | |-------|-------|--------| | Qwen3-0.6B-8bit | 402 tok/s | 0.7 GB | | Llama-3.2-1B-4bit | 464 tok/s | 0.7 GB | | Llama-3.2-3B-4bit | 200 tok/s | 1.8 GB |
Continuous Batching (5 concurrent requests):
| Model | Single | Batched | Speedup | |-------|--------|---------|---------| | Qwen3-0.6B-8bit | 328 tok/s | 1112 tok/s | 3.4x | | Llama-3.2-1B-4bit | 299 tok/s | 613 tok/s | 2.0x |
Audio - Speech-to-Text (M4 Max, 128GB):
| Model | RTF* | Use Case | |-------|------|----------| | whisper-tiny | **