by kubeflow
MCP Server and CLI for Apache Spark History Server. Debug Spark applications from AI agents, scripts, or the terminal.
# Add to your Claude Code skills
git clone https://github.com/kubeflow/mcp-apache-spark-history-serverGuides for using ai agents skills like mcp-apache-spark-history-server.
No comments yet. Be the first to share your thoughts!
Connect AI agents and engineers to Apache Spark History Server for intelligent job analysis, performance monitoring, and investigation
[!IMPORTANT]
✨ NEW — Spark History Server CLI is now available
A standalone Go binary that queries Spark History Server directly from your terminal — no MCP, no AI framework, no daemon process. Inspect jobs, compare runs, investigate failures, and script against the Spark REST API.
This project provides two interfaces to your Spark History Server data:
| | 🛠️ SHS CLI (shs) | ⚡ MCP Server |
|---|---|---|
| For | Engineers, shell scripts, CI/CD, coding agents | AI agents and MCP-compatible clients |
| Mental model | "I know the command I want to run" | "Agent, investigate this Spark app" |
| Install | Single static binary — no dependencies | Python 3.12+, uv |
| Get started | CLI docs → | MCP docs → |
graph TB
subgraph Clients
A[🤖 AI Agent / LLM]
B[👩💻 Engineer / Script / CI]
C[🔧 Coding Agent - Claude Code / Kiro]
end
subgraph "Kubeflow Spark AI Toolkit"
D[⚡ MCP Server]
E[🛠️ CLI - shs]
end
subgraph "Spark History Servers"
F[🔥 Production]
G[🔥 Staging / Dev]
end
A -->|MCP Protocol| D
B -->|Terminal commands| E
C -->|shs skill file| E
D -->|REST API| F
D -->|REST API| G
E -->|REST API| F
E -->|REST API| G
shs) — For Engineers & ScriptsA standalone Go binary — no MCP, no dependencies, no running daemon. Query your Spark History Server directly from the terminal, shell scripts, or CI/CD pipelines. Also works as a skill for coding agents like Claude Code and Kiro.
# Auto-detect latest version, OS, and architecture
VERSION=$(curl -s https://api.github.com/repos/kubeflow/mcp-apache-spark-history-server/releases | grep -m1 '"tag_name": "cli/' | cut -d'"' -f4 | sed 's|cli/||')
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)
[ "$ARCH" = "x86_64" ] && ARCH="amd64"
[ "$ARCH" = "aarch64" ] && ARCH="arm64"
curl -sSL "https://github.com/kubeflow/mcp-apache-spark-history-server/releases/download/cli%2F${VERSION}/shs-${VERSION}-${OS}-${ARCH}.tar.gz" | tar xz
sudo mv shs /usr/local/bin/
# Generate a config file
shs setup config > config.yaml # then set your Spark History Server URL
# Explore applications
shs apps
shs jobs -a APP_ID --status failed
shs stages -a APP_ID --sort duration
shs compare apps --app-a APP1 --app-b APP2
# Use as a skill with Claude Code or Kiro
shs setup skill > ~/.claude/skills/spark-history.md
CLI documentation for full usage, or check out a real-world example of Claude Code comparing two TPC-DS 3TB benchmark runs.
An MCP (Model Context Protocol) server that exposes Spark History Server data as tools for AI agents. Agents query your Spark infrastructure using natural language — the server handles tool selection, multi-server routing, and structured data retrieval.
Use the MCP server when you want an AI agent to conduct multi-step investigations, synthesize findings across tools, or answer natural-language questions about your Spark applications.
# Run directly with uvx (no install needed)
uvx --from mcp-apache-spark-history-server spark-mcp
# Or install with pip
pip install mcp-apache-spark-history-server
spark-mcp
The package is published to PyPI.
Edit config.yaml:
servers:
local:
default: true
url: "http://your-spark-history-server:18080"
auth: # optional
username: "user"
password: "pass"
include_plan_description: false # include SQL plans by default (default: false)
mcp:
transports:
- streamable-http # or: stdio
port: "18888"
debug: false
Environment variable overrides:
SHS_MCP_PORT Port for MCP server (default: 18888)
SHS_MCP_TRANSPORT Transport mode: streamable-http or stdio
SHS_MCP_DEBUG Enable debug mode (default: false)
SHS_MCP_ADDRESS Bind address (default: localhost)
SHS_SERVERS_*_URL URL for a specific server
SHS_SERVERS_*_AUTH_USERNAME
SHS_SERVERS_*_AUTH_PASSWORD
SHS_SERVERS_*_AUTH_TOKEN
SHS_SERVERS_*_VERIFY_SSL
SHS_SERVERS_*_TIMEOUT
SHS_SERVERS_*_EMR_CLUSTER_ARN
SHS_SERVERS_*_INCLUDE_PLAN_DESCRIPTION
Configure multiple Spark History Servers and route queries to specific ones:
servers:
production:
default: true
url: "http://prod-spark-history:18080"
auth:
username: "user"
password: "pass"
staging:
url: "http://staging-spark-history:18080"
Agents can target a specific server per query:
"Get application
<app_id>from the production server"
| Agent | Transport | Guide | |-------|-----------|-------| | Claude Desktop | stdio | Setup → | | Amazon Q CLI | stdio | Setup → | | Kiro | streamable-http | Setup → | | LangGraph | streamable-http | Setup → | | Strands Agents | streamable-http | Setup → | | Local / Inspector | streamable-http | Setup → |
| Tool | Description |
|------|-------------|
| list_applications | List applications with optional status, date, and limit filters |
| get_application | Get application detail: status, resources, duration, attempts |
| Tool | Description |
|------|-------------|
| list_jobs | List jobs with status filtering |
| list_slowest_jobs | Top N slowest jobs |
| Tool | Description |
|------|-------------|
| list_stages | List stages with status filtering |
| list_slowest_stages | Top N slowest stages |
| get_stage | Stage detail with attempt and summary metrics |
| get_stage_task_summary | Task metric distributions (execution time, memory, I/O, spill) |
| Tool | Description |
|------|-------------|
| list_executors | List executors (active and optionally inactive) |
| get_executor | Executor detail: resources, task stats, performance |
| get_executor_summary | Aggregate metrics across all executors |
| get_resource_usage_timeline | Chronological executor add/remove with resource totals |
| Tool | Description |
|------|-------------|
| get_environment | Spark config, JVM info, system properties, classpath |
| Tool | Description |
|------|-------------|
| list_slowest_sql_queries | Top N slowest SQL executions with metrics |
| get_sql_execution | SQL execution detail with optional plan and node metrics |
| compare_sql_execution_plans | Compare SQL plans and metrics between two jobs |
| Tool | Description |
|------|-------------|
| get_job_bottlenecks | Identify bottlenecks across stages, tasks, and executors |
| Tool | Description |
|------|-------------|
| compare_job_environments | Diff Spark configs between two applications |
| compare_job_performance | Diff performance metrics between two applications |
| Tool | Description |
|------|-------------|
| aws_analyze_spark_workload | One-shot root cause analysis of failed/slow Spark workloads |
| aws_spark_code_recommendation | Code fix recommendations for identified Spark issues |
Automatically available when AWS credentials and region are configured. See IAM setup guide.
get_job_bottlenecks + list_slowest_stages + compare_job_performancelist_jobs + get_stage + get_stage_task_summarycompare_job_performance + compare_job_environmentslist_slowest_sql_queries + get_sql_execution + compare_sql_execution_plans