by waybarrios
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.
# Add to your Claude Code skills
git clone https://github.com/waybarrios/vllm-mlxvLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac
vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:
No comments yet. Be the first to share your thoughts!
/v1/messages endpoint for Claude Code and OpenCode/v1/embeddings endpoint with mlx-embeddingsUsing uv (recommended):
# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git
# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.git
Using pip:
# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git
# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .
# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000
# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key
from openai import OpenAI
# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_ke...