by ggml-org
LLM inference in C/C++
# Add to your Claude Code skills
git clone https://github.com/ggml-org/llama.cpp
LLM inference in C/C++
-hf are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.gpt-oss model with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Commentllama-server: #12898 | documentationGetting started with llama.cpp is straightforward. Here are several ways to install it on your machine:
llama.cpp using brew, nix or wingetOnce installed, you'll need a model to work with. Head to the Obtaining and quantizing models section to learn more.
Example command:
# Use a local model file
llama-cli -m my_model.gguf
# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
range of hardware - locally and in the cloud.
The llama.cpp project is the main playground for developing new features for the ggml library.
Typically finetunes of the base models below are supported as well.
Instructions for adding support for new models: HOWTO-add-model.md
No comments yet. Be the first to share your thoughts!