by mitulgarg
Diagnose and Fix CUDA / GPU environments compatibility issues locally, in Docker, and CI/CD. CLI + MCP server available.
# Add to your Claude Code skills
git clone https://github.com/mitulgarg/env-doctor"Why does my PyTorch crash with CUDA errors when I just installed it?"
Because your driver supports CUDA 11.8, but
pip install torchgave you CUDA 12.4 wheels.
Env-Doctor diagnoses and fixes the #1 frustration in GPU computing: mismatched CUDA versions between your NVIDIA driver, system toolkit, cuDNN, and Python libraries.
It takes 5 seconds to find out if your environment is broken - and exactly how to fix it.

| Feature | What It Does |
|---------|--------------|
| One-Command Diagnosis | Check compatibility: GPU Driver → CUDA Toolkit → cuDNN → PyTorch/TensorFlow/JAX |
| Compute Capability Check | Detect GPU architecture mismatches — catches why torch.cuda.is_available() returns False on new GPUs (e.g. Blackwell) even when driver and CUDA are healthy |
| Python Version Compatibility | Detect Python version conflicts with AI libraries and dependency cascade impacts |
| CUDA Auto-Installer | Execute CUDA Toolkit installation directly with --run; CI-friendly with --yes; preview with --dry-run |
| Safe Install Commands | Get the exact pip install command that works with YOUR driver |
| Extension Library Support | Install compilation packages (flash-attn, SageAttention, auto-gptq, apex, xformers) with CUDA version matching |
| AI Model Compatibility | Check if LLMs, Diffusion, or Audio models fit on your GPU before downloading |
| | Validate GPU forwarding, detect driver conflicts within WSL2 env for Windows users |
| | Find multiple installations, PATH issues, environment misconfigurations |
| | Catch GPU config errors in Dockerfiles before you build |
| | Expose diagnostics to AI assistants (Claude Desktop, Zed) via Model Context Protocol |
| | JSON output, proper exit codes, and CI-aware env-var persistence (GitHub Actions, GitLab CI, CircleCI, Azure Pipelines, Jenkins) |
| | Web UI for monitoring multiple GPU machines — aggregate status, drill-down diagnostics, history timeline. Install with |
No comments yet. Be the first to share your thoughts!
pip install "env-doctor[dashboard]"The core CLI has no heavy dependencies — installs in seconds.
pip install env-doctor
# Or with uv (faster, isolated)
uv tool install env-doctor
uvx env-doctor check
If you want to manage a distributed system of multiple GPU nodes, this dashboard can help you from a observability POV. It adds a web UI for monitoring multiple GPU machines and has no effect on the core CLI. You will be able to soon take action directly from the dashboard via distributed env-doctor cli instances on each VM!
pip install "env-doctor[dashboard]"
This adds: fastapi, uvicorn, sqlalchemy, aiosqlite
| | pip install env-doctor | pip install "env-doctor[dashboard]" |
|---|---|---|
| env-doctor check | ✅ | ✅ |
| All CLI commands | ✅ | ✅ |
| MCP server | ✅ | ✅ |
| env-doctor check --report-to | ✅ | ✅ |
| env-doctor report install/status | ✅ | ✅ |
| env-doctor dashboard (web UI) | ✗ | ✅ |
Env-Doctor includes a built-in Model Context Protocol (MCP) server that exposes diagnostic tools to AI assistants like Claude Code and Claude Desktop.
Install env-doctor:
pip install env-doctor
Add to Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"env-doctor": {
"command": "env-doctor-mcp"
}
}
}
Restart Claude Desktop - the tools will be available automatically.
env_check - Full GPU/CUDA environment diagnosticsenv_check_component - Check specific component (driver, CUDA, cuDNN, etc.)python_compat_check - Check Python version compatibility with installed AI librariescuda_info - Detailed CUDA toolkit informationcudnn_info - Detailed cuDNN library informationcuda_install - Step-by-step CUDA installation instructionsinstall_command - Get safe pip install commands for AI librariesmodel_check - Analyze if AI models fit on your GPUmodel_list - List all available models in databasedockerfile_validate - Validate Dockerfiles for GPU issuesdocker_compose_validate - Validate docker-compose.yml for GPU configurationAsk your AI assistant:
Learn more: MCP Integration Guide
The core CLI works standalone. The dashboard is an observability layer for teams running multiple GPU machines.
pip install "env-doctor[dashboard]" unlocks a web UI that aggregates diagnostic results from every machine in your fleet into a single view — no SSH required.
There are two roles — the dashboard host (receives and displays reports) and the GPU machines (run checks and send results). They communicate over a simple REST API.
Dashboard Host GPU Machine 1
┌─────────────────────-┐ ┌───────────────────────────────┐
│ env-doctor dashboard │ ◄──── POST ──── │ env-doctor check --report-to │
│ (React UI + SQLite) │ /api/report │ (runs locally, POSTs result) │
└─────────────────────-┘ └───────────────────────────────┘
▲ GPU Machine 2
│ ┌───────────────────────────────┐
└──────────── POST ─────────────│ env-doctor check --report-to │
/api/report └───────────────────────────────┘
Step 1 — Start the dashboard on any machine with a reachable IP (a cheap CPU instance is enough, no GPU needed):
pip install "env-doctor[dashboard]"
env-doctor dashboard
# → Serving at http://localhost:8765
Step 2 — Report from each GPU machine (only needs the core CLI, not the [dashboard] extra):
pip install env-doctor
# One-time report
env-doctor check --report-to http://<dashboard-host>:8765
# Or: set up automatic reporting every 2 minutes
env-doctor report install --url http://<dashboard-host>:8765 --interval 2m
report install creates a scheduled task on the GPU machine — a cron job on Linux/macOS or a Windows Task Scheduler entry on Windows. That task runs env-doctor check --report-to <url> on the configured interval.
The scheduled task fires every 2 minutes, but it does not POST every 2 minutes. Each run:
env-doctor check locally on the GPU machine~/.env-doctor/report-state.json| Condition | Action | |-----------|--------| | Hash changed (driver updated, library installed, new issue) | POST full report immediately | | Hash unchanged, 30 min since last POST | POST lightweight heartbeat (confirms machine is alive) | | Hash unchanged, heartbeat not due | Skip — no network call, sub-second no-op |
On a stable machine, this means ~1 POST every 30 minutes instead of 720.
These commands run on each GPU machine, not on the dashboard host:
| Command | Where | What it does |
|---------|-------|-------------|
| env-doctor dashboard | Dashboard host | Starts the web UI and API server |
| env-doctor check --report-to URL | GPU machine | Runs check locally, POSTs result to dashboard |
| env-doctor report install --url URL | GPU machine | Creates a cron job / scheduled task on this machine |
| env-doctor report status | GPU machine | Reads this machine's local config and last report time |
| env-doctor report uninstall | GPU