by jjang-ai
vMLX - Home of JANG_Q - Cont Batch, Prefix, Paged, KV Cache Quant, VL - Powers MLX Studio. Image gen/edit, OpenAI/Anth
# Add to your Claude Code skills
git clone https://github.com/jjang-ai/vmlxJANG 2-bit destroys MLX 4-bit on MiniMax M2.5:
| Quantization | MMLU (200q) | Size | |---|---|---| | JANG_2L (2-bit) | 74% | 89 GB | | MLX 4-bit | 26.5% | 120 GB | | MLX 3-bit | 24.5% | 93 GB | | MLX 2-bit | 25% | 68 GB |
Adaptive mixed-precision keeps critical layers at higher precision. Scores at jangq.ai. Models at JANGQ-AI.
Published on PyPI as vmlx -- install and run in one command:
# Recommended: uv (fast, no venv hassle)
brew install uv
uv tool install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
# Or: pipx (isolates from system Python)
brew install pipx
pipx install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
# Or: pip in a virtual environment
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
Note: On macOS 14+, bare
pip installfails with "externally-managed-environment". Useuv,pipx, or a venv.
Your local AI server is now running at http://0.0.0.0:8000 with an OpenAI + Anthropic compatible API. Works with any model from mlx-community -- thousands of models ready to go.
Get MLX Studio -- a native macOS app with chat UI, model management, image generation, and developer tools. No terminal required. Just download the DMG and drag to Applications.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
import anthropic
client = anthropic.Anthropic(base_url="http://localhost:8000/v1", api_key="not-needed")
message = client.messages.create(
model="local",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
vMLX runs any MLX model. Point it at a HuggingFace repo or local path and go.
| Type | Models | |------|--------| | Text LLMs | Qwen 2/2.5/3/3.5, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral, Gemma 3, Phi-4, DeepSeek, GLM-4, MiniMax, Nemotron, StepFun, and any mlx-lm model | | Vision LLMs | Qwen-VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n | | MoE Models | Qwen 3.5 MoE (A3B/A10B), Mixtral, DeepSeek V2/V3, MiniMax M2.5, Llama 4 | | Hybrid SSM | Nemotron-H, Jamba, GatedDeltaNet (Mamba + Attention) | | Image Gen | Flux Schnell/Dev, Z-Image Turbo (via mflux) | | Image Edit | Qwen Image Edit (via mflux) | | Embeddings | Any mlx-lm compatible embedding model | | Reranking | Cross-encoder reranking models | | Audio | Kokoro TTS, Whisper STT (via mlx-audio) |
| Feature | Description |
|---------|-------------|
| Continuous Batching | Handle multiple concurrent requests efficiently |
| Prefix Cache | Reuse KV states for repeated prompts -- makes follow-up messages instant |
| Paged KV Cache | Block-based caching with content-addressable deduplication |
| KV Cache Quantization | Compress cached states to q4/q8 for 2-4x memory savings |
| Disk Cache (L2) | Persist prompt caches to SSD -- survives server restarts |
| Block Disk Cache | Per-block persistent cache paired with paged KV cache |
| Speculative Decoding | Small draft model proposes tokens for 20-90% speedup |
| Prompt Lookup Decoding | No draft model needed — reuses n-gram matches from the prompt/context. Best for structured or repetitive output (code, JSON, schemas). Enable with --enable-pld. |
| JIT Compilation | mx.compile Metal kernel fusion (experimental) |
| Hybrid SSM Support | Mamba/GatedDeltaNet layers handled correctly alongside attention |
Request -> Tokens
|
L1: Memory-Aware Prefix Cache (or Paged Cache)
| miss
L2: Disk Cache (or Block Disk Store)
| miss
Inference -> float16 KV states
|
KV Quantization -> q4/q8 for storage
|
Store back into L1 + L2
Auto-detected parsers for every major model family:
qwen - llama - mistral - hermes - deepseek - glm47 - minimax - nemotron - granite - functionary - xlam - kimi - step3p5
Auto-detected reasoning parsers that extract <think> blocks:
qwen3 (Qwen3, QwQ, MiniMax, StepFun) - deepseek_r1 (DeepSeek R1, Gemma 3, GLM, Phi-4) - openai_gptoss (GLM Flash, GPT-OSS)
| Feature | Description | |---------|-------------| | Text-to-Speech | Kokoro TTS via mlx-audio -- multiple voices, streaming output | | Speech-to-Text | Whisper STT via mlx-audio -- transcription and translation |
Generate and edit images locally with Flux models via mflux.
pip install vmlx[image]
# Image generation
vmlx serve schnell # or dev, z-image-turbo
vmlx serve ~/.mlxstudio/models/image/flux1-schnell-4bit
# Image editing
vmlx serve qwen-image-edit # instruction-based editing
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "schnell",
"prompt": "A cat astronaut floating in space with Earth in the background",
"size": "1024x1024",
"n": 1
}'
# Python (OpenAI SDK)
response = client.images.generate(
model="schnell",
prompt="A cat astronaut floating in space",
size="1024x1024",
n=1,
)
# Edit an image with a text prompt (Flux Kontext / Qwen Image Edit)
curl http://localhost:8000/v1/images/edits \
-H "Content-Type: application/json" \
-d '{
"model": "flux-kontext",
"prompt": "Change the background to a sunset",
"image": "<base64-encoded-image>",
"size": "1024x1024",
"strength": 0.8
}'
# Python
import base64
with open("source.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = requests.post("http://localhost:8000/v1/images/edits", json={
"model": "flux-kontext",
"prompt": "Make the sky purple",
"image": image_b64,
"size": "1024x1024",
"strength": 0.8,
})
Generation Models:
| Model | Steps | Speed | Memory | |-------|-------|-------|--------| | Flux Schnell | 4 | Fastest | ~6-24 GB | | Z-Image Turbo | 4 | Fast | ~6-24 GB | | Flux Dev | 20 | Slow | ~6-24 GB |
Editing Models:
| Model | Steps | Type | Memory | |-------|--
No comments yet. Be the first to share your thoughts!