by 0xchamin
Transform YouTube videos into a compounding knowledge base with transcripts, vision analysis, and agentic search. Works as an MCP server for Claude, Copilot & more.
# Add to your Claude Code skills
git clone https://github.com/0xchamin/mcptubeYouTube video knowledge engine β transcripts, vision, and persistent wiki.
mcptube-vision transforms YouTube videos into a persistent, structured knowledge base using both transcripts and visual frame analysis. Built on the Karpathy LLM Wiki pattern: knowledge compounds with every video you add.
Evolved from mcptube v0.1 β mcptube-vision replaces semantic chunk search with a persistent wiki that gets smarter with every video ingested.
Traditional video tools re-discover knowledge from scratch on every query. mcptube-vision is different:
mcptube v0.1 mcptube-vision
βββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Query β vector searchβ β Video ingested β LLM β
β β raw chunks β LLM β β extracts knowledge β β
β β answer (from scratchβ β wiki pages created β β
β every time) β β cross-references built β
βββββββββββββββββββββββ β β
β Query β FTS5 + agent β
β β reasons over compiled β
β knowledge β answer β
βββββββββββββββββββββββββββ
| | v0.1 (Video Search Engine) | vision (Video Knowledge Engine) | |---|---|---| | | Chunk transcript, embed in vector DB | LLM watches + reads, writes wiki pages | | | Find similar chunks | Agent reasons over compiled knowledge | | | Timestamp or keyword extraction | Scene-change detection + vision model | | | Re-search all chunks each time | Connections already in the wiki | | | Library of isolated videos | Compounding knowledge base |
No comments yet. Be the first to share your thoughts!
mcptube-vision is built around a core insight: video knowledge should compound, not be re-discovered. Every architectural decision flows from this principle.
flowchart TD
YT[YouTube URL] --> EXT[YouTubeExtractor\ntranscript + metadata]
EXT --> FRAMES[SceneFrameExtractor\nffmpeg scene-change detection]
FRAMES --> VISION[VisionDescriber\nLLM vision model]
VISION --> WIKI_EXT[WikiExtractor\nLLM knowledge extraction]
EXT --> WIKI_EXT
WIKI_EXT --> WIKI_ENG[WikiEngine\nmerge + update]
WIKI_ENG --> FILE[FileWikiRepository\nJSON pages on disk]
WIKI_ENG --> FTS[SQLite FTS5\nsearch index]
FILE --> AGENT[Ask Agent\nFTS5 β LLM reasoning]
FTS --> AGENT
FILE --> CLI[CLI / MCP Server]
FTS --> CLI
subgraph Ingestion Pipeline
EXT
FRAMES
VISION
WIKI_EXT
end
subgraph Knowledge Store
WIKI_ENG
FILE
FTS
end
subgraph Retrieval
AGENT
end
The system overview shows three distinct subsystems connected by a unidirectional data flow. The Ingestion Pipeline (left) transforms a raw YouTube URL into structured knowledge through four stages: transcript extraction, scene-change frame detection, vision-model description, and LLM-powered knowledge extraction. Each stage enriches the signal β raw video becomes text, text becomes typed knowledge objects.
The Knowledge Store (center) is the persistent layer. The WikiEngine applies merge semantics β deciding whether to create new pages or append to existing ones β then writes JSON files to disk and updates the FTS5 search index in parallel. These two stores serve different access patterns: files for full-page reads and exports, FTS5 for sub-millisecond keyword retrieval.
The Retrieval layer (right) combines both stores. The Ask Agent first narrows via FTS5, then loads full pages from disk, and finally reasons over candidates with structural awareness from the wiki TOC. The CLI and MCP Server sit alongside as thin presentation layers β they never contain business logic.
sequenceDiagram
participant User
participant CLI
participant YouTubeExtractor
participant SceneFrameExtractor
participant VisionDescriber
participant WikiExtractor
participant WikiEngine
participant FileRepo
participant FTS5
User->>CLI: mcptube add <url>
CLI->>YouTubeExtractor: fetch transcript + metadata
YouTubeExtractor-->>CLI: segments, duration, channel
CLI->>SceneFrameExtractor: extract scene frames (ffmpeg)
SceneFrameExtractor-->>CLI: frame images (scene_000x.jpg)
CLI->>VisionDescriber: describe frames (LLM vision)
VisionDescriber-->>CLI: frame descriptions (prose)
CLI->>WikiExtractor: extract knowledge\n(transcript + frame descriptions)
WikiExtractor-->>CLI: entities, topics, concepts, video page
CLI->>WikiEngine: merge into wiki
WikiEngine->>FileRepo: write/update JSON pages\n(append entities, rewrite synthesis)
WikiEngine->>FTS5: update search index
FileRepo-->>WikiEngine: β
FTS5-->>WikiEngine: β
WikiEngine-->>CLI: wiki processed
CLI-->>User: β
Added + Wiki: full_analysis
The ingestion flow is a write-once pipeline β LLM-heavy at ingest time, but never repeated for the same video. This is the key cost tradeoff: invest tokens upfront to build compiled knowledge, so retrieval is cheap.
The sequence shows two critical branching points. First, after transcript extraction, the pipeline forks into vision processing (scene frames β LLM vision descriptions) and feeds both streams into the WikiExtractor. This dual-signal approach means the LLM sees both what was said and what was shown β critical for content like coding tutorials or slide-based lectures where the transcript alone misses visual information.
Second, the WikiEngine merge step is where knowledge compounding happens. Rather than blindly writing new pages, it checks for existing entities, topics, and concepts β appending new video contributions to existing pages and rewriting synthesis summaries. This is why ingesting video #10 makes the wiki smarter about videos #1β9 too: shared concepts get richer synthesis with each new source.
The final FTS5 index update runs synchronously after the file write, ensuring search consistency. There is no eventual-consistency window β once add_video returns, all new knowledge is immediately searchable.
sequenceDiagram
participant User
participant CLI
participant FTS5
participant FileRepo
participant Agent
User->>CLI: mcptube ask "What is RLHF?"
CLI->>FTS5: keyword search (sanitized query)
FTS5-->>CLI: candidate page slugs (ranked)
CLI->>FileRepo: load candidate pages (JSON)
FileRepo-->>CLI: wiki pages (entities, topics, concepts)
CLI->>FileRepo: load wiki TOC
FileRepo-->>CLI: table of contents (all page titles + types)
CLI->>Agent: candidates + TOC + question
Agent-->>CLI: reasoned answer with source citations
CLI-->>User: answer + (source-slug) citations
The retrieval flow is deliberately two-stage to balance cost and intelligence. The first stage β FTS5 keyword search β runs entirely locally with zero LLM tokens, narrowing thousands of wiki pages to a ranked handful in milliseconds. Query sanitization strips special characters (e.g. ?, !) that would break FTS5 syntax, ensuring robustness for natural-language questions.
The second stage loads two types of context for the agent: the candidate pages (full detail β summaries, contributions, entity references) and the wiki TOC (a compact structural map of all knowledge). The TOC is critical β it gives the agent awareness of what it doesn't know. Without it, the agent would hallucinate answers from weak matches. With it, the agent can reason: "The wiki has pages on RLHF and scaling laws, but nothing on quantum computing β so I should say I don't have that information."
In CLI mode (BYOK), the agent is an LLM call that synthesizes the final answer with source citations. In MCP server mode (passthrough), this stage returns the raw candidates and TOC to the client β letting the client's own model (Copilot, Claude, Gemini) do the reasoning. This dual-mode design means the server never requires an API key when used via MCP.
YouTubeExtractor pulls transcript segments via youtube-transcript-api and video metadata via yt-dlp. Transcripts are chunked by natural segment boundaries, not fixed token windows β preserving semantic coherence.
SceneFrameExtractor uses ffmpeg's perceptual scene-change filter (select='gt(scene,{threshold})') rather than fixed-interval sampling. This is deliberate: fixed intervals waste tokens on static frames (slides held for 30s), while scene-change detection captures transitions β the moments of highest information density. The threshold (default 0.4) is configurable.
VisionDescriber sends detected frames to a vision-capable LLM (GPT-4o, Claude, Gemini β auto-detected via API key priority). Frame descriptions are plain prose, not structured JSON, to maximise the LLM's descriptive latitude.
Why this matters: A transcript of a coding tutorial misses the code on screen. Scene-change vision capture recovers that signal without the token cost of dense fixed-interval sampling.
Inspired by the Karpathy LLM Wiki pattern, this is the most architecturally distinctive component.
WikiExtractor t