🎬 mcptube-vision

YouTube video knowledge engine — transcripts, vision, and persistent wiki.

mcptube-vision transforms YouTube videos into a persistent, structured knowledge base using both transcripts and visual frame analysis. Built on the Karpathy LLM Wiki pattern: knowledge compounds with every video you add.

Evolved from mcptube v0.1 — mcptube-vision replaces semantic chunk search with a persistent wiki that gets smarter with every video ingested.

🧠 How It Works

Traditional video tools re-discover knowledge from scratch on every query. mcptube-vision is different:

           mcptube v0.1                    mcptube-vision
    ┌─────────────────────┐         ┌─────────────────────────┐
    │ Query → vector search│         │ Video ingested → LLM     │
    │ → raw chunks → LLM  │         │ extracts knowledge →     │
    │ → answer (from scratch│        │ wiki pages created →     │
    │   every time)        │         │ cross-references built   │
    └─────────────────────┘         │                         │
                                    │ Query → FTS5 + agent    │
                                    │ → reasons over compiled  │
                                    │   knowledge → answer     │
                                    └─────────────────────────┘

| | v0.1 (Video Search Engine) | vision (Video Knowledge Engine) | |---|---|---| | | Chunk transcript, embed in vector DB | LLM watches + reads, writes wiki pages | | | Find similar chunks | Agent reasons over compiled knowledge | | | Timestamp or keyword extraction | Scene-change detection + vision model | | | Re-search all chunks each time | Connections already in the wiki | | | Library of isolated videos | Compounding knowledge base |

mcptube

🎬 mcptube-vision

🧠 How It Works

Related Skills

🏗️ Technical Architecture

System Overview

Ingestion Flow

Retrieval Flow

Subsystem Breakdown

1. Ingestion Pipeline

2. WikiEngine — The Novel Core ⭐