Turn your scattered AI coding sessions into a queryable knowledge graph. Multi-platform (Claude Code, ChatGPT, DeepSeek, Grok, Warp), W3C ontology, Wikidata entity linking, SPARQL.
# Add to your Claude Code skills
git clone https://github.com/robertoshimizu/session-graphTurn your scattered AI coding sessions into a queryable knowledge graph.
Developers use 5+ AI tools every day -- Claude Code, ChatGPT, Cursor, Copilot, Grok, DeepSeek, Warp. Each session is an isolated silo. Knowledge dies when the tab closes.
You have solved the same problem three times across different tools and cannot find any of them. You debugged a Supabase auth flow in Claude Code last Tuesday, discussed the same pattern in ChatGPT a month ago, and asked Grok about JWT refresh tokens somewhere in between. None of these tools talk to each other.
Existing solutions are single-platform and flat-file. They give you search over one tool's history, not structured relationships across all of them. A grep over session logs does not tell you that FastAPI uses Pydantic or that Neo4j is a type of graph database. It just gives you walls of text.
session-graph fixes this.
session-graph extracts structured knowledge triples -- (subject, predicate, object) -- from all your AI coding sessions, links entities to Wikidata for universal disambiguation, and loads everything into a SPARQL-queryable triplestore with full provenance back to the source conversation.
"What technologies have I used across all sessions?" --> SPARQL query --> structured answer
"How does FastAPI relate to Pydantic?" --> FastAPI --uses--> Pydantic
"What sessions discussed authentication?" --> 3 sessions across Claude Code + DeepSeek
The key insight: a knowledge graph without relationships is just a tag cloud. The minimum viable extraction unit is (subject, predicate, object), not [topic1, topic2, topic3].
owl:sameAs. "k8s", "kubernetes", and "K8s" all resolve to Q22661306.From real-world usage across 52 sessions:
| Metric | Value |
|--------|-------|
| Total triples in Fuseki | 1,334,432 |
| Sessions indexed | 607+ |
| Knowledge triples extracted | 47,868+ |
| Distinct entities | ~8,000+ |
| Wikidata-linked entities | 4,774 (~33%) |
| Curated predicates | 24 (with <1% relatedTo fallback) |
| Platforms supported | 4 (Claude Code, DeepSeek, Grok, Warp) |
| Entity linking precision | 7/7 (agentic ReAct linker) |
| Cost per 600 sessions | ~$0.60 (Vertex AI batch pricing) |
Real data from SPARQL — technologies, concepts, and session provenance linked across multiple Claude Code sessions:

Hub nodes (large blue) are highly connected technologies. Green nodes are concepts/outputs. Purple rectangles are session IDs with dashed provenance edges. The "W" badge indicates entities linked to Wikidata.
Scattered Sources Adapter Layer Knowledge Graph
----------------- ------------- ---------------
Claude Code (.jsonl) --+
DeepSeek (.json zip) --+ triple_extraction.py
Grok (.json zip) --+---> (LLM extracts s,p,o ---> Apache Jena Fuseki
Warp (SQLite) --+ from each assistant (SPARQL endpoint)
ChatGPT (planned) --+ message using 24 |
Cursor (planned) --+ curated predicates) |
| v
v SPARQL Queries
link_entities.py (14 local templates
(LangGraph ReAct + 6 Wikidata templates)
agent links to |
Wikidata QIDs) v
Claude Code Skill
(natural language -> SPARQL)
Real-time Loop (Claude Code):
Session pause → stop_hook.sh → RabbitMQ → pipeline-runner → Fuseki
(triple cache: 0 API calls for seen messages)
1. SOURCE PARSING (per platform --> RDF Turtle)
Each parser reads a platform-specific format and produces
PROV-O + SIOC session structure plus knowledge triples.
2. TRIPLE EXTRACTION (LLM-powered)
Each assistant message --> LLM --> top 10 (subject, predicate, object) triples
24 curated predicates | capped at 10 triples/message (prioritizes architecture)
Closed-world vocabulary (deviations fuzzy-matched) | retry on JSON truncation
3. ENTITY FILTERING (two-level)
Level 1: is_valid_entity() in triple_extraction.py -- rejects garbage at extraction
Level 2: is_linkable_entity() in link_entities.py -- pre-filters before Wikidata
Catches: filenames (*.py), hex colors (#8776f6), CLI flags (--force),
ICD codes (j458), snake_case identifiers, DOM selectors, etc.
48 whitelisted short terms bypass filters (ai, api, llm, rdf, sql, etc.)
4. ENTITY LINKING (context-aware, agentic)
For each entity:
+-- Normalize via entity_aliases.json (161 mappings: k8s-->kubernetes, etc.)
+-- Frequency filter: --min-sessions 2 (default) -- only links entities
| appearing in 2+ sessions (~77% reduction)
+-- Check SQLite cache
+-- If miss --> LangGraph ReAct agent (LLM + Wikidata API tool)
+-- Confidence threshold 0.7 --> owl:sameAs link
+-- Entity dedup: same QID --> owl:sameAs between aliases
5. LOAD --> Apache Jena Fuseki (SPARQL endpoint)
6. QUERY --> SPARQL (via Claude Code skill or directly)
| Platform | Parser | Format | Status |
|----------|--------|--------|--------|
| Claude Code | jsonl_to_rdf.py | JSONL | Production |
| DeepSeek | deepseek_to_rdf.py | JSON zip export | Production |
| Grok | grok_to_rdf.py | JSON (MongoDB export) | Production |
| Warp | warp_to_rdf.py | SQLite | Production |
| ChatGPT | -- | JSON export | Planned |
| Cursor | -- | SQLite / Markdown | Planned |
| VS Code Copilot | -- | JSON | Planned |
All parsers produce the same RDF schema. Entities merge by label across platforms.
git clone https://github.com/robertoshimizu/session-graph.git
cd session-graph
./setup.sh
The setup script checks prerequisites, creates .env with your LLM provider, installs Python dependencies, starts Docker services (Fuseki + RabbitMQ), and runs a smoke test — all interactively.
After setup: http://localhost:3030 (Fuseki SPARQL UI) and http://localhost:15672 (RabbitMQ, devkg/devkg).
# 1. Configure
cp .env.example .env
# Edit .env with your LLM provider API key (see Provider Support below)
# 2. Install
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 3. Create output directories
mkdir -p output/claude output/deepseek output/grok output/warp logs
# 4. Start all services (Fuseki + RabbitMQ + pipeline-runner)
docker compose up -d
# Fuseki SPARQL UI: http://localhost:3030
# RabbitMQ Management UI: http://localhost:15672 (devkg/devkg)
# 5. Process a single session (manual)
python -m pipeline.jsonl_to_rdf path/to/session.jsonl output/claude/session.ttl
# 6. Link entities to Wikidata
PYTHONUNBUFFERED=1 python -m pipeline.link_entities \
--input output/*.ttl --output output/wikidata_links.ttl
# 7. Load into Fuseki (--auth required for Docker Fuseki)
python -m pipeline.load_fuseki output/*.ttl --auth admin:admin
# 8. Query at http://localhost:3030
With Docker Compose running, every Claude Code session is automatically processed:
Claude Code session ends
→ stop_hook.sh publishes to RabbitMQ (~33ms, non-blocking)
→ pipeline-runner container picks up the job
→ Extracts triples, generates .ttl, uploads to Fuseki
→ Failed jobs go to dead-letter queue for inspection
Configure the hook in ~/.claude/settings.json:
{
"hooks": {
"Stop": [{"hooks": [{"type": "command", "command": "/path/to/hooks/stop_hook.sh", "timeout": 5}]}]
}
}
Once automatic processing is running, it only captures new sessions going forward. But you likely have weeks or months of past Claude Code sessions already sitting on disk — and that's where most of the value is.
Claude Code stores every session as a .jsonl file under ~/.claude/projects/. Each project directory contains one file per session. A typical developer accumulates hundreds of sessions over a few months. Bulk processing lets you backfill all of them into the knowledge graph in one shot.
This is optional but highly recommended. The more sessions in the graph, the richer the connections — you'll find patterns and relationships you didn't know existed across your past work.
s
No comments yet. Be the first to share your thoughts!