by jjang-ai
vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1 Paged (super fast ttft) + Hybrid SSM Scheduler + Cont Batching + etc!
# Add to your Claude Code skills
git clone https://github.com/jjang-ai/vmlxLast scanned: 5/16/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-16T06:21:39.445Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": true
}JANG 2-bit destroys MLX 4-bit on MiniMax M2.5:
| Quantization | MMLU (200q) | Size | |---|---|---| | JANG_2L (2-bit) | 74% | 89 GB | | MLX 4-bit | 26.5% | 120 GB | | MLX 3-bit | 24.5% | 93 GB | | MLX 2-bit | 25% | 68 GB |
Adaptive mixed-precision keeps critical layers at higher precision. Scores at jangq.ai. Models at JANGQ-AI.
Published on PyPI as vmlx -- install and run in one command:
# Recommended: uv (fast, no venv hassle)
brew install uv
uv tool install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
# Or: pipx (isolates from system Python)
brew install pipx
pipx install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
# Or: pip in a virtual environment
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
Note: On macOS 14+, bare
pip installfails with "externally-managed-environment". Useuv,pipx, or a venv.
The vMLX inference server is now running at http://0.0.0.0:8000 with an OpenAI + Anthropic compatible API. Works with any model from mlx-community -- thousands of models ready to go.
Get MLX Studio -- a native macOS app with chat UI, model management, image generation, and developer tools. No terminal required. Just download the DMG and drag to Applications.
No comments yet. Be the first to share your thoughts!
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
import anthropic
client = anthropic.Anthropic(base_url="http://localhost:8000/v1", api_key="not-needed")
message = client.messages.create(
model="local",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
vMLX runs any MLX model. Point it at a HuggingFace repo or local path and go.
| Type | Models |
|------|--------|
| Text LLMs | Qwen 2/2.5/3/3.5/3.6, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral, Mistral-Medium-3.5 (ministral3), Mistral-Small-4, Gemma 3/4, Phi-4, DeepSeek V2/V3/V4, GLM-4/5, MiniMax M2.5/M2.7, Nemotron, Laguna (poolside), ZAYA (CCA + MoE), Kimi K2.5/K2.6, StepFun, and any mlx-lm model |
| Vision LLMs | Qwen-VL, Qwen3.5-VL / Qwen3.6-VL, Pixtral, InternVL, LLaVA, Gemma 3n / 4-VL, Mistral-Medium-3.5 (PIXTRAL) |
| Multimodal Omni | Nemotron-3-Nano-Omni (text + image + audio + video) — Parakeet audio encoder + RADIO ViT vision tower; routed via OmniMultimodalDispatcher across /v1/chat/completions, /v1/messages, /v1/responses, /api/chat |
| MoE Models | Qwen 3.5/3.6 MoE (A3B/A10B), Mixtral, DeepSeek V2/V3/V4, MiniMax M2.5/M2.7, Llama 4, Laguna (256 routed experts top-8) |
| Hybrid SSM | Nemotron-H, Jamba, GatedDeltaNet (Mamba + Attention), Qwen3.5-A3B hybrid, Granite MoE Hybrid, LFM2 |
| Image Gen | Flux Schnell/Dev, Z-Image Turbo (via mflux) |
| Image Edit | Qwen Image Edit (via mflux) |
| Embeddings | Any mlx-lm compatible embedding model |
| Reranking | Cross-encoder reranking models |
| Audio | Kokoro TTS, Whisper STT (via mlx-audio) |
| Feature | Description |
|---------|-------------|
| Continuous Batching | Handle multiple concurrent requests efficiently |
| Prefix Cache | Reuse KV states for repeated prompts -- makes follow-up messages instant |
| Paged KV Cache | Block-based caching with content-addressable deduplication |
| KV Cache Quantization | Compress cached states to q4/q8 for 2-4x memory savings |
| Disk Cache (L2) | Persist prompt caches to SSD -- survives server restarts |
| Block Disk Cache | Per-block persistent cache paired with paged KV cache |
| Speculative Decoding | Small draft model proposes tokens for 20-90% speedup |
| Prompt Lookup Decoding | No draft model needed — reuses n-gram matches from the prompt/context. Best for structured or repetitive output (code, JSON, schemas). Enable with --enable-pld. |
| JIT Compilation | mx.compile Metal kernel fusion (experimental) |
| Hybrid SSM Support | Mamba/GatedDeltaNet layers handled correctly alongside attention |
| Distributed Compute | Pipeline parallelism across multiple Macs via Thunderbolt 5 / Ethernet / WiFi |
Run models too large for a single Mac across 2+ machines. Each Mac loads a subset of transformer layers and they communicate hidden states over the network.
# On worker Macs:
pip install vmlx
vmlx-worker --secret mysecret
# On coordinator Mac (runs the server):
vmlx serve JANGQ-AI/Qwen3.5-Coder-Rerank-397B-A27B-JANG_2L --distributed --cluster-secret mysecret
| Feature | Description |
|---------|-------------|
| Pipeline Parallelism | Split layers across nodes -- hidden state (~8KB/step) flows sequentially |
| Auto-Discovery | Bonjour mDNS, UDP broadcast, HTTP probes, Tailscale, cached peers, manual IP |
| Capability-Scored Election | Most powerful Mac becomes coordinator automatically |
| Any Network Works | TB5 (120 Gbps), 10GbE, 1GbE, WiFi, Tailscale -- PP is not bandwidth-bound |
| JANG Support | Each worker loads its layer range from JANG safetensors (mmap) |
| Live Node List | Desktop app shows discovered nodes, link type, latency, layer assignments |
| Cluster API | /v1/cluster/status, /v1/cluster/nodes, /v1/cluster/scan REST endpoints |
Request -> Tokens
|
L1: Memory-Aware Prefix Cache (or Paged Cache)
| miss
L2: Disk Cache (or Block Disk Store)
| miss
Inference -> float16 KV states
|
KV Quantization -> q4/q8 for storage
|
Store back into L1 + L2
Auto-detected parsers for every major model family:
qwen - llama - mistral - hermes - deepseek - glm47 - minimax - nemotron - granite - functionary - xlam - kimi - step3p5
Auto-detected reasoning parsers that extract <think> blocks:
qwen3 (Qwen3, QwQ, MiniMax, StepFun) - deepseek_r1 (DeepSeek R1, Gemma 3, GLM, Phi-4) - openai_gptoss (GLM Flash, GPT-OSS)