TurboLLM

Name: TurboLLM
Author: mohitsoni48

Pending

Run any local LLM engine, auto-tuned to your GPU — polished web UI + OpenAI/Anthropic-compatible API. Point Claude Code at your own machine in one command. No Electron, no Python, offline-first.

51stars

7forks

TypeScript

Installation

# Add to your Claude Code skills
git clone https://github.com/mohitsoni48/TurboLLM

Getting Started

Guides for using api integration skills like TurboLLM.

Getting Started with AI Skills
First-time install walkthrough for Claude Code, Codex CLI, and ChatGPT.
What is an AI Skills Marketplace?
Definitions, how marketplaces work, and how to choose between them in 2026.

README.md

Frequently Asked Questions

What is TurboLLM?

TurboLLM is an open-source api integration skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by mohitsoni48. Run any local LLM engine, auto-tuned to your GPU — polished web UI + OpenAI/Anthropic-compatible API. Point Claude Code at your own machine in one command. No Electron, no Python, offline-first. It has 51 GitHub stars.

Is TurboLLM safe to use?

TurboLLM's catalog security scan is still queued. You can run an instant dependency and prompt-injection check now with the "Scan for vulnerabilities" button above.

How do I install TurboLLM?

Clone the repository with "git clone https://github.com/mohitsoni48/TurboLLM" and add it to your Claude Code skills directory (see the Installation section above).

What programming language is TurboLLM written in?

TurboLLM is primarily written in TypeScript. It is open-source under mohitsoni48 on GitHub, so you can review or fork the full source.

Are there alternatives to TurboLLM?

Yes. SkillsLLM lists many other API Integration skills you can browse and compare side by side. Open the API Integration category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh TurboLLM against similar tools.

LLM Engineer for Beginners

Ship LLM features to production - prompts, RAG, structured outputs, evaluation

39 minBeginner

Comments (0)

to leave a comment.

No comments yet. Be the first to share your thoughts!

Related Skills

superpowers

by obra

An agentic skills framework & software development methodology that works.

234,966

Popular in API Integration

Top skills in this category by stars

CLIProxyAPI

by router-for-me

Wrap Antigravity, ChatGPT Codex, Claude Code, Grok Build as an OpenAI/Gemini/Claude/Codex compatible API service, allowing you to enjoy the free Gemini 3.1 Pro, GPT 5.5, Grok 4.3, Claude model through API

38,140

flock awesome-nsfw-ai

npx turbollm

That one command starts a local daemon, opens a browser UI, and serves your models over an API any tool can talk to. TurboLLM is the performance & bleeding-edge layer for local LLMs — built for people who today hand-compile forks and hunt forums for the right flags.

Why TurboLLM
Speed: TurboLLM vs LM Studio
Features
Quick start
⭐ Bring any engine — the headline feature
Run Claude Code on your own GPU
Use it from any device on your network
Command-line reference
Configuration & data
Requirements
Privacy
How TurboLLM compares
Troubleshooting
Develop from source
License

Why TurboLLM

Local-LLM tools make two choices for you, and both cost you performance:

They pick the engine. LM Studio ships one blessed runtime; Ollama hides the engine entirely. The fastest community innovations — new quant formats, speculative decoding, low-bit KV cache — land in forks first, and you can't use them without compiling.
They don't tell you what speed to expect, and they don't tune the dozens of launch flags (-c, -ngl, --n-cpu-moe, KV type, threads, flash-attn, draft models) that make the difference between 20 and 80 tokens/sec on the same hardware.

TurboLLM does the opposite:

🔌 Any engine, including forks. Point it at any llama-server-compatible binary — a build you compiled, a community fork, or the one it auto-provisions for your GPU. It probes the binary's real capabilities and adapts the UI to them. This is the whole point.
⚡ Auto-tuned to your hardware. It benchmarks on load, derives fast defaults, and shows a VRAM-fit verdict before you load — no more flag guessing.
📊 Real tokens/sec, never faked. Speed in the model list is measured on your machine from actual generation — live while you chat, and remembered per model.
🪶 Lightweight. A ~0.3 MB npm package on Node — no Electron, no bundled Chromium, no Python. It downloads only the engine your GPU actually needs (Vulkan ≈ 38 MB).
🔌 Drop-in APIs. OpenAI and Anthropic-compatible — so Claude Code and every existing tool work unchanged.
🔀 A gateway that loads models for you. Name any model in your API request and TurboLLM loads it on demand, keeping your favorites hot in a small pool — so an agent that hops between models just works, with nothing to pre-wire.
🔒 Offline-first & private. No account, no backend, no internet, no telemetry.

Speed: TurboLLM vs LM Studio

Same GPU (RTX 5070 Ti 16 GB), same model, same 200K context — measured generation speed. TurboLLM is faster than LM Studio on the very same official llama.cpp, and faster still when you run a community fork LM Studio can't.

① On official llama.cpp, TurboLLM is faster. It auto-provisions a GPU-native engine build (CUDA 13 for Blackwell here) and tunes expert-offload to the layer, so at the same KV-cache quant it beats LM Studio's bundled runtime:

Qwen3.6-35B-A3B · 200K	TurboLLM	LM Studio	Speed-up
official llama.cpp — `q4_0`	74.7 t/s	61.0 t/s	1.2×
official llama.cpp — `q8_0`	72.3 t/s	~66 t/s*	1.1×

② Run a faster engine and pull far ahead. Because TurboLLM runs any engine, you can drop in the TurboQuant fork — a llama.cpp fork with a low-bit turbo4 KV cache that LM Studio simply can't load — in one click. On a large-KV model it delivers q8_0-level quality at more than double the speed:

Qwen3.6-27B · 200K · matched quality	TurboLLM + TurboQuant	LM Studio	Speed-up
`turbo4` vs `q8_0`	24.6 t/s	11.4 t/s	2.2×

Same run, 1.7× faster prefill too (1288 vs 757 tok/s).

*LM Studio's q8_0 mildly spilled VRAM at its best offload. A low-bit KV cache helps most when the cache is large; TurboLLM's auto-tuner and on-screen measured t/s pick the fastest engine + config for each model, so you don't have to.

Features

The headline — running any engine, including community forks — has its own section below. Everything else is grouped here; each summary is the gist, expand for the detail:

Use the folders you already have. Point TurboLLM at any directory of GGUFs — your existing LM Studio / Ollama / manual downloads — no re-downloading. It parses GGUF metadata (arch, params, quant, context, vision) for every file.
Browse & download from Hugging Face, in-app: search, see the file tree, pick a quant, and download with resume + SHA-256 verification. Gated models (Llama, Gemma) work via your own HF token, which never leaves your machine.
Import from any URL — not just Hugging Face. Paste a direct .gguf link (model-author sites, mirrors, private servers); it disk-space-checks and downloads through the same manager.
Quant recommendation per GPU and a VRAM-fit verdict so you pick a quant that actually fits before you commit.
Primary download folder, real-time measured t/s per model, and delete-from-disk.

Auto-benchmark on load derives fast defaults for your exact GPU.
Recommended sampling from the model card — auto-tune reads the model's Hugging Face card (falling back to the original model behind a requant) and prefills the author's recommended temperature / top_k / top_p / min_p. No recommendation → your sampling is left untouched.
Real measured tokens/sec in the model list — live while generating, last-session when idle (never a synthetic estimate).
Full load-parameter UI, a superset of what other tools expose: context length, GPU offload (-ngl), MoE CPU-offload (--n-cpu-moe), parallel slots, KV-cache quant type (incl. low-bit on supporting forks), CPU threads, flash attention, and speculative decoding (NextN / MTP / draft).
Fast by default: flash attention on, NextN self-speculative decoding on for models that carry a draft head, threads auto — safely gated to what your engine actually accepts.
Multi-GPU, per model — split a model across cards (layer/row split + main-GPU pick on llama.cpp, tensor-parallel on vLLM). Defaults are no-ops, so single-GPU rigs are untouched.
Saved per-model profiles — tune once, and it loads that way every time.

Streaming with a stop button, live tokens/sec, prompt-processing % and prefill t/s, time-to-first-token, total time, exact token counts, and a context-usage meter (filled / max) on every reply.
Thinking control — toggle reasoning off for a direct answer, or leave it on with collapsible, timed "thought for N s" blocks.
Markdown + syntax-highlighted code with one-click copy — plus inline Unicode charts the model draws when a comparison, trend, or hierarchy is genuinely worth a visual.
Personas — pick a style (Concise · Detailed · Blunt · Formal · Tutor · Creative · Default) per conversation, no prompt-wrangling required.
Edit, regenerate, delete, copy any message; persistent, searchable conversations with rename, delete, and auto-generated titles.
Per-chat system prompt and per-chat sampling overrides — temperature, top-p/k, min-p, repeat/presence/frequency penalties, and stop strings.
Image input for vision models, and TurboLLM Expert — a built-in assistant that knows the app and your hardware for onboarding and troubleshooting without leaving the UI.
Agentic tools — built-in web_search (Tavily), fetch_url, and sandboxed run_code,