by skye-harris
Home Assistant LLM integration for local OpenAI-compatible services (llamacpp, vllm, etc)
# Add to your Claude Code skills
git clone https://github.com/skye-harris/hass_local_openai_llmGuides for using ai agents skills like hass_local_openai_llm.
Last scanned: 5/30/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-30T15:46:50.230Z",
"npmAuditRan": true,
"pipAuditRan": false
}No comments yet. Be the first to share your thoughts!
30 days in the Featured rail · terms & refunds
Allows use of generic OpenAI-compatible LLM services, such as (but not limited to):
This integration has been forked from Home Assistants OpenRouter integration, with the following changes:
<think> tags from responsesHave HACS installed, this will allow you to update easily.
Adding Tools for Assist to HACS can be using this button:
[!NOTE] If the button above doesn't work, add
https://github.com/skye-harris/hass_local_openai_llmas a custom repository of type Integration in HACS.
Local OpenAI LLM integration.local_openai folder from latest release to the
custom_components folder in your config directory.After installation, configure the integration through Home Assistant's UI:
Settings → Devices & Services.Add Integration.Local OpenAI LLM./v1 but may differ depending on your server configuration.chat_template_kwargs request parameterWhen the server type is set to DeepSeek Cloud, both conversation and AI task agents show a new DeepSeek Configuration section with a Reasoning Effort option. This option controls whether thinking is enabled, and what level of reasoning to perform on the request.
When enabled, thinking content returned by the model is also fed back into the conversation as reasoning content on supported Home Assistant versions (2026.4+).
When the server type is set to llama.cpp, both conversation and AI task agents show a llama.cpp Configuration section with the following options.
Passes enable_thinking=true via chat_template_kwargs to enable reasoning on supported models.
Note: This option completely overrides any existing enable_thinking option in your Chat Template Arguments.
Controls whether thinking/reasoning content from prior conversation turns is sent back in new completion requests.
Some reasoning models require this enabled, and others require it disabled. Check the documentation for your model if unsure.
thinking_content is passed as reasoning_content in the next request, allowing the model to see its own prior reasoning.Pins requests to a specific llama.cpp server slot for prompt-cache reuse. Leave empty to allow any slot to be used.
llama.cpp exposes the value supplied via its --alias flag on the model object. When an alias is set it is used as the model's display name; otherwise the raw model id (typically the full model file path) is used, with the path and .gguf
extension stripped for a cleaner name.
These options control how llama.cpp selects tokens during text generation. Please refer to the llama.cpp documentation for further information and usage.
| Parameter | Description | Range |
|---|---|---|
| Top-P | Restricts sampling to the top-p probability mass of tokens. | 0–1 |
| Min-P | Minimum probability threshold for nucleus sampling, providing additional control when combined with top-p. | 0–1 |
| Top-K | Limits sampling to the k highest-probability tokens. | 1–1000 |
| Repeat Penalty | Penalizes repeat sequences of tokens. | -2–2 |
| Presence Penalty | Penalizes tokens already present in the context. | -2–2 |
This integration supports injecting some dynamic content, presently the date and time, into the active Conversation Agent prompt when making a request. This was added as it is beneficial for the model to be grounded with this context in its role as an assistant, and was previously added to the system prompt by Home Assistant itself before later being removed due to negative effects on prompt caching and performance.
This was previously always-on but has been extracted as an experimental configuration option as this is not a once-size-fits-all for all models. To this end I have provided a number of options so that users can try them out and select the one that works best, or disable entirely if none work well, for their chosen model.
The available options are:
The date and time are inserted as a Tool Call Result message to the model, before the current user message.
As long as the model does not reject it, this is the recommended method to use and produces the most reliable results during testing.
The date and time are inserted as an additional Assistant message to the model, before the current user message.
In cases where the Tool Call Result role method does not work for a model, this is the next recommended to test with.
The date and time are inserted as an additional User message to the model, before the current user message.
Recommended only where neither the System nor Assistant injection methods work for the model, but may not produce desirable results.
Some models have been known to repeat the date/time back to the user without request.
If your model simply refuses to work well with any method, simply remove the value from the configuration option to disable this again.
Retrieval Augmented Generation is used to pre-feed yo