Give Claude the ability to watch any video — scene-change frames + transcript + a structured report, with a 0-10s hook microscope and optional Obsidian auto-save.
# Add to your Claude Code skills
git clone https://github.com/taoufik123-collab/claude-watchGuides for using ide extensions skills like claude-watch.
claude-watch is an open-source ide extensions skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by taoufik123-collab. Give Claude the ability to watch any video — scene-change frames + transcript + a structured report, with a 0-10s hook microscope and optional Obsidian auto-save. It has 304 GitHub stars.
claude-watch's catalog security scan is still queued. You can run an instant dependency and prompt-injection check now with the "Scan for vulnerabilities" button above.
Clone the repository with "git clone https://github.com/taoufik123-collab/claude-watch" and add it to your Claude Code skills directory (see the Installation section above). claude-watch ships a SKILL.md manifest, so compatible agents can discover and load it automatically.
claude-watch is primarily written in Python. It is open-source under taoufik123-collab on GitHub, so you can review or fork the full source.
Yes. SkillsLLM lists many other IDE Extensions skills you can browse and compare side by side. Open the IDE Extensions category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh claude-watch against similar tools.
No comments yet. Be the first to share your thoughts!
Unlocks once the catalog security scan passes (runs nightly).
The deep catalog scan for this skill is still queued. Run an instant dependency check now instead.
report.md and, after answering the user, optionally auto-ingests the analysis into your Obsidian vault (configurable via $WATCH_VAULT_DIR) — tied to why the user watched it.
argument-hint: " [why you're watching it]"
allowed-tools: Bash, Read, AskUserQuestion
homepage: https://github.com/taoufik123-collab/claude-watch
repository: https://github.com/taoufik123-collab/claude-watch
author: taoufik
license: MIT
user-invocable: trueYou don't have a video input; this skill gives you one. A Python script downloads the video, extracts frames as JPEGs (one per detected shot via scene-change), gets a timestamped transcript (native captions first, then Whisper API as fallback), runs editorial pacing metrics, and microscopes the first 10 seconds at higher density. You then Read each frame path to see the images, combine them with the transcript to answer the user, fill the structured report.md, and offer to ingest the analysis into Taoufik's Second Brain.
report.md — every watch emits an ingest-shaped report at <workdir>/report.md with TL;DR, key moments, hook breakdown, editorial profile, quotable moments, entities, concepts, and transcript. Narrative sections are emitted as <!-- pending Claude fill: ... --> markers — you fill them in before offering ingest.$VAULT_DIR/CLAUDE.md (if it exists) and run that vault's Ingest op against the report.None of the above add new dependencies — pure ffmpeg + stdlib + the existing Whisper backend.
Steps 4.4 and 4.5 stage the report inside an Obsidian vault so the user can read it where they read everything else. Resolve the vault directory in this order — first hit wins, and the result is what $VAULT_DIR refers to everywhere below:
$WATCH_VAULT_DIR env var — if set and the path exists, use it. This is the user-controlled override.~/Second brain/ — if it exists as a directory.~/Documents/Obsidian/ — if it exists as a directory.~/Obsidian/ — if it exists as a directory.📄 Report (no vault detected): <workdir>/report.md. Suggest they set WATCH_VAULT_DIR if they want auto-ingest.A quick way to resolve it in bash inside the skill:
VAULT_DIR="${WATCH_VAULT_DIR:-}"
if [ -z "$VAULT_DIR" ] || [ ! -d "$VAULT_DIR" ]; then
for candidate in "$HOME/Second brain" "$HOME/Documents/Obsidian" "$HOME/Obsidian"; do
if [ -d "$candidate" ]; then VAULT_DIR="$candidate"; break; fi
done
fi
The vault's URL-name (for the obsidian:// URL scheme in Step 4.4) is the final path component — e.g. $HOME/Second brain → Second brain. URL-encode spaces as %20.
/watch invocation, silent on success)Python interpreter: every python3 ... command in this skill is for macOS/Linux. On Windows, substitute python — the python3 command on Windows is the Microsoft Store stub and will not run the script.
Before every /watch run, verify that dependencies and an API key are in place:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --check
This is a <100ms lookup. On exit 0, the script emits nothing — proceed to Step 1 without comment. Do NOT announce "setup is complete" to the user — they don't need a status message on every turn. The only acceptable user-visible output from Step 0 is when remediation is required.
On non-zero exit, follow the table:
| Exit | Meaning | Action |
|---|---|---|
2 |
Missing binaries (ffmpeg / ffprobe / yt-dlp) |
Run installer |
3 |
No Whisper API key | Run installer to scaffold .env, then ask user for a key |
4 |
Both missing | Run installer, then ask for a key |
The installer is idempotent — safe to re-run:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py"
On macOS with Homebrew, it auto-installs ffmpeg and yt-dlp. On Linux/Windows, it prints the exact install commands for the user to run. It scaffolds ~/.config/watch/.env with commented placeholders at 0600 perms, and writes SETUP_COMPLETE=true once deps + a key are in place so the next session knows this user has already been through the wizard.
If an API key is still missing after install: use AskUserQuestion to ask the user whether they have a Groq API key (preferred — cheaper, faster) or an OpenAI key. Then write it into ~/.config/watch/.env — set the matching GROQ_API_KEY=... or OPENAI_API_KEY=... line. If they don't want to set up Whisper, proceed with --no-whisper and tell them videos without native captions will come back frames-only.
Structured mode (optional): python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --json emits {status, first_run, missing_binaries, whisper_backend, has_api_key, config_file, platform} where status is one of ready | needs_install | needs_key | needs_install_and_key. Use this when you need to branch on specifics (e.g. "is this the user's very first run?" → first_run: true).
Within a single session, you can skip Step 0 on follow-up /watch calls — once --check returned 0, nothing about the environment changes between turns.
.mp4, .mov, .mkv, .webm, etc.) and asks about it./watch <url-or-path> [question].Step 1 — parse the user input. Separate the video source from any question the user asked. The question (or the user's prior stated interest) IS the intent — pass it to the script via --intent. Example: /watch https://youtu.be/abc what's the hook pattern? → source = https://youtu.be/abc, intent = what's the hook pattern?. If no question is given, use a brief inferred intent ("general summary") so the report's TL;DR has a lens. The intent shapes how the report's TL;DR and entity/concept sections get filled at Step 4 — same video with intent "pricing tactics" vs "editing style" produces different reports.
Step 2 — run the watch script. Pass the source verbatim. Do not shell-escape it yourself beyond normal quoting:
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "<source>" --intent "<intent string>"
Pass --intent whenever you have any signal from the user about why they want this video — the question they asked, a stated goal, or a brief inferred summary. Empty --intent works but produces less-targeted report sections.
Optional flags:
--start T / --end T — focus on a section. Accepts SS, MM:SS, or HH:MM:SS. When either is set, fps auto-scales denser (see "Focusing on a section" below).--max-frames N — lower the cap for tighter token budget (e.g. --max-frames 40)--resolution W — change frame width in px (default 512; bump to 1024 only if the user needs to read on-screen text)--fps F — override auto-fps (clamped to 2 fps max). Setting --fps disables scene-change sampling.--out-dir DIR — keep working files somewhere specific (default: an auto-generated tmp dir)--whisper groq|openai — force a specific Whisper backend (default: prefer Groq if both keys exist)--no-whisper — disable the Whisper fallback entirely (frames-only if no captions)--no-scene-change — force uniform frame sampling (debug only; usually leave on)--no-hook-microscope — skip the 0-10s dense pass (saves ~1 Whisper call)When the user asks about a specific moment — "what happens at the 2 minute mark?", "zoom into 0:45 to 1:00", "the first 10 seconds" — pass --start and/or --end. The script switches to focused-mode budgets, which are denser than full-video budgets (still capped at 2 fps):
Focused mode is the right call for:
Transcript is auto-filtered to the same range. Frame timestamps are absolute (real video timeline, not offset-from-start).
Examples:
# Last 10 seconds of a 1 minute video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" video.mp4 --start 50 --end 60
# Zoom into 2:15 → 2:45 at 3 fps (90 frames)
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 2:15 --end 2:45 --fps 3
# From 1h12m to the end of the video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 1:12:00
Step 3 — Read every frame path the script lists. The Read tool renders JPEGs directly as images for you. Read all frames in a single message (parallel tool calls) so you see them together. The frames are in chronological order with a t=MM:SS timestamp so you can align them to the transcript.
Step 4 — answer the user, then fill the report. You now have three streams of evidence:
report.md — structured artifact at <workdir>/report.md with <!-- pending Claude fill: ... --> markersFirst, answer the user's question in chat citing timestamps.
Then, fill in the pending markers in report.md using the Edit tool. Walk every <!-- pending Claude fill: ... --> in order:
[[wikilink]] style.The fully-filled report.md is what gets ingested at Step 4.5. Do not skip the fill — empty markers won't ingest cleanly.
Step 4.4 — Stage to the Obsidian vault and open in Obsidian (when a vault is detected). After filling every marker, resolve $VAULT_DIR per the Configuration section. If no vault is detected, skip this step and emit 📄 Report (no vault detected): <workdir>/report.md in chat instead.
When $VAULT_DIR resolves:
report.md frontmatter, slugify (lowercase, ASCII-only, hyphens, max 60 chars), append -YYYY-MM-DD. Example: karpathy-claude-md-43k-installs-2026-05-24.mkdir -p "$VAULT_DIR/raw/watched/<slug>".report.md + every hero frame (filenames in the report frontmatter under hero_frames:) into that dir. The report MUST live inside the vault for Obsidian to open it.$VAULT_DIR with spaces URL-encoded as %20:
VAULT_NAME=$(basename "$VAULT_DIR" | sed 's/ /%20/g')
open "obsidian://open?vault=${VAULT_NAME}&file=raw/watched/<slug>/report.md"
The file= value is the path relative to the vault root, no leading slash. Don't ask permission — the user has already opted in by running /watch.📄 Report (open in Obsidian): raw/watched/<slug>/report.md. So if Obsidian was closed / the URL handler missed, the user can still navigate to it manually inside the vault.Rationale: the report is the leverage point of /watch. If the user reads everything in Obsidian, opening in Preview or VS Code defeats the purpose. Staging at 4.4 also means Step 4.5's "Yes / Stage" branches are no-ops on the copy step (the file is already in the vault); they only differ in whether the Ingest op runs.
Cleanup implication for Step 4.5: if the user picks "No, drop it" at 4.5 AND a vault was staged at 4.4, ALSO rm -rf "$VAULT_DIR/raw/watched/<slug>" since we pre-staged. Do NOT drop the vault copy if they picked Yes or Stage.
Step 4.5 — Offer ingest into the Obsidian vault. Skip this step entirely if no vault was detected at Step 4.4. Otherwise use AskUserQuestion once, with these options (do NOT skip if a vault was found unless the user explicitly said "don't ingest" before /watch ran):
Question: "Want to ingest this into your Obsidian vault?"
- Yes — same angle ("")
- Yes — different angle (user specifies in the notes field)
- Stage to
raw/watched/for later- No, drop it
Routing based on response:
A. Yes (same or different angle):
report.md frontmatter, slugify (lowercase, ASCII-only, hyphens, max 60 chars), append -YYYY-MM-DD. Example: me-at-the-zoo-2026-05-24.$VAULT_DIR/raw/watched/<slug>/ (Step 4.4 already created it).$VAULT_DIR/CLAUDE.md exists: Read it to refresh the Ingest op definition — that file is authoritative; this skill must not duplicate its steps. Execute the Ingest op against raw/watched/<slug>/report.md exactly as $VAULT_DIR/CLAUDE.md defines it.$VAULT_DIR/CLAUDE.md exists: Run a generic ingest — read the report, identify entities + concepts, append a one-line entry to $VAULT_DIR/log.md (create if missing), and tell the user the report is staged at raw/watched/<slug>/report.md and they can wire up an Ingest op of their own.log.md entry written.B. Stage to raw/watched/ for later:
log.md.$(basename $VAULT_DIR)/raw/watched/<slug>/. Run an Ingest op against it when you're ready."C. No, drop it: proceed to Step 5 (cleanup) — and per the cleanup-implication note in Step 4.4, rm -rf "$VAULT_DIR/raw/watched/<slug>" to undo the pre-staging.
The "different angle" path is what makes /watch truly plug-and-play — the user can watch a video for one reason, then on the way out decide it's actually more useful for a different concept, and the resulting wiki entry reframes accordingly.
Step 5 — clean up. The script prints a working directory at the end. If you ingested (Step 4.5 path A), the hero frames + report.md are already copied to Second Brain — you can rm -rf the original workdir. If you staged (path B), same — the workdir copy is no longer needed. If the user picked "no, drop it" (path C) and isn't going to ask follow-ups, delete with rm -rf <dir>. If they might ask follow-ups, leave it in place.
The script gets a timestamped transcript in one of two ways:
ffmpeg -vn -ac 1 -ar 16000 -b:a 64k, ~0.5 MB/min) and uploads it to whichever Whisper API has a key configured:
whisper-large-v3. Preferred default: cheaper, faster. Get a key at console.groq.com/keys.whisper-1. Fallback. Get a key at platform.openai.com/api-keys.Both keys live in ~/.config/watch/.env. The script prefers Groq when both are set; override with --whisper openai to force OpenAI. Use --no-whisper to skip the fallback entirely.
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" (auto-installs ffmpeg/yt-dlp via brew on macOS, scaffolds the .env). For API key, ask the user via AskUserQuestion and write it to ~/.config/watch/.env.--start/--end rather than a sparse full-video scan.--whisper openai if Groq failed (or vice versa).<!-- pending Claude fill: ... --> markers → you skipped Step 4. Go back, read the report, fill every marker via Edit, then offer ingest. Never ingest a half-filled report — the Second Brain Ingest op will produce sparse/wrong entity pages.raw/watched/<slug>/, and they can re-run by saying "ingest the staged report at <slug>".This skill burns tokens primarily on frames. Order of magnitude:
--resolution to 1024 roughly quadruples the image tokens per frame. Only do it when necessary.If you already watched a video this session and the user asks a follow-up, do not re-run the script — you already have the frames and transcript in context. Just answer from what you have.
What this skill does:
yt-dlp locally to download the video and pull native captions when the source supports them (public data; the request goes directly to whatever host the URL points at)ffmpeg / ffprobe locally to extract frames as JPEGs and, when Whisper is needed, a mono 16 kHz audio clipapi.groq.com/openai/v1/audio/transcriptions) when GROQ_API_KEY is set (preferred — cheaper, faster)api.openai.com/v1/audio/transcriptions) when OPENAI_API_KEY is set and Groq is not, or when --whisper openai is forced--out-dir if specified) so Claude can Read them~/.config/watch/.env (mode 0600) to store the Whisper API key(s) and a SETUP_COMPLETE marker. As a fallback, also reads .env in the current working directory$VAULT_DIR/CLAUDE.md at orchestration time (only when ingest is requested and the file exists) to follow that vault's Ingest operation definitionreport.md plus copies of hero frames into $VAULT_DIR/raw/watched/<slug>/ when a vault is detected at Step 4.4$VAULT_DIR/wiki/ (entities, concepts, sources, index.md) and appends to $VAULT_DIR/log.md — following the actions defined by the vault's Ingest op (or a generic fallback if no CLAUDE.md is present)What this skill does NOT do:
--no-whisperapi.groq.com, OpenAI key only goes to api.openai.com)~/.config/watch/.env (and Second Brain when ingest is consented to) — clean up the working directory when you're done (Step 5)Bundled scripts: scripts/watch.py (entry point), scripts/download.py (yt-dlp wrapper), scripts/frames.py (ffmpeg uniform + scene-change extraction + hero selection), scripts/pacing.py (editorial metrics), scripts/hook.py (0-10s microscope), scripts/report.py (structured report emitter), scripts/transcribe.py (caption selection + Whisper orchestration), scripts/whisper.py (Groq / OpenAI clients, supports word-level timestamps), scripts/setup.py (preflight + installer)
Review scripts before first use to verify behavior.
Give Claude the ability to watch any video.
Paste a URL or a local file and Claude watches it — scene-change frame extraction (one frame per cut instead of every-N-seconds), a 0-10s hook microscope (dense frames + word-level Whisper on the opening, where every video earns or loses your attention), and optional Obsidian auto-save so a watched video becomes a connected wiki entry without copy-paste.
Claude Code:
/plugin marketplace add taoufik123-collab/claude-watch
/plugin install watch@claude-watch
claude.ai (web): download watch.skill and drop it into Settings → Capabilities → Skills.
Codex / generic skills:
git clone https://github.com/taoufik123-collab/claude-watch.git ~/.codex/skills/watch
Zero config to start — yt-dlp and ffmpeg install on first run via brew on macOS (Linux/Windows print exact commands). Captions cover most public videos for free. Whisper API key is only needed when a video has no captions. Set $WATCH_VAULT_DIR to point at your Obsidian vault for auto-save, or leave it unset and the skill skips the ingest step quietly.
scripts/frames.py grabs one frame per detected shot via ffmpeg's select=gt(scene,...), not a uniform tick every N seconds. Token cost stays flat on long videos because the frame count is bounded by the number of cuts, not the duration.scripts/hook.py runs a denser 2 fps pass on the opening 10 seconds plus a word-level Whisper transcript, so the report tells you what was on screen as each word landed. The first 10 seconds is where every video either earns your attention or loses it.report.md with Claude-fill markers — scripts/report.py emits a fixed-schema report (TL;DR, key moments, hook breakdown, editorial profile, quotable moments, entities, concepts, transcript) where narrative sections are explicit <!-- pending Claude fill: ... --> markers. Claude has a job-list to walk before ingest, not a blank doc.$VAULT_DIR/raw/watched/<slug>/ and opens it via the obsidian:// URL scheme. Step 4.5 offers ingest into the vault's wiki. Both steps skip cleanly when no vault is detected. Vault path is resolved from $WATCH_VAULT_DIR or auto-detected from ~/Second brain/, ~/Documents/Obsidian/, ~/Obsidian/.The core pipeline — yt-dlp download, ffmpeg frames, Groq/OpenAI Whisper backends, the --start/--end focused mode, the SessionStart hook, the multi-surface install — comes from the original claude-video project and works unchanged (see Credits).
Claude can read a webpage, run a script, browse a repo. What it can't do, out of the box, is watch a video. You paste a YouTube link and it has to either guess from the title or pull a transcript that's missing 90% of what's on screen.
With Claude Video /watch you can paste a URL or a local path, ask a question, and Claude downloads the video, extracts frames at an auto-scaled rate, pulls a timestamped transcript (free captions when available, Whisper API as fallback), and Reads every frame as an image. By the time it answers, it has seen the video and heard the audio.
/watch https://youtu.be/dQw4w9WgXcQ what happens at the 30 second mark?
I built this because I'm constantly using video to keep up with content. If I see a YouTube video that's blowing up, I want to know how the creator structured the hook — what's on screen in the first 3 seconds, what they said, why it worked. That used to mean watching it myself with a notepad. Now I just paste the URL and ask.
The other half is summarization. Most YouTube videos don't deserve 20 minutes of my attention. I hand the URL to Claude, it pulls the transcript, and tells me what actually happened. If the visual matters, frames come along too. If it's a podcast or a talking head, transcript is enough.
Claude is great at reading and synthesizing — but until now, video was the one input I couldn't hand it. Pasting a YouTube link got you nothing useful. /watch closes that gap.
Analyze someone else's content. /watch https://youtu.be/<viral-video> what hook did they open with? Claude looks at the first frames, reads the opening transcript, breaks down the structure. Same for ad creative, competitor launches, podcast intros, anything where the how matters as much as the what.
Diagnose a bug from a video. Someone sends you a screen recording of something broken. /watch bug-repro.mov what's going wrong? Claude watches the recording, finds the frame where the issue appears, describes what's on screen, often catches the cause without you ever opening the file.
Summarize a video. /watch https://youtu.be/<long-thing> summarize this does the obvious thing — pulls the structure, the key moments, what was actually said and shown. Faster than watching at 2x.
.mp4, .mov, .mkv, .webm).yt-dlp downloads it. For URLs, into a temp working directory. For local files, no download — just probed in place.ffmpeg extracts frames at an auto-scaled rate. The frame budget is duration-aware: ≤30s gets ~30 frames, 30-60s gets ~40, 1-3min gets ~60, 3-10min gets ~80, longer gets 100 sparsely. Hard ceilings: 2 fps, 100 frames. JPEGs at 512px wide by default — bump with --resolution 1024 if Claude needs to read on-screen text.yt-dlp pulls native captions (manual or auto-generated) from the source. Free, instant, accurate-ish. Fallback: extract a mono 16 kHz audio clip and ship it to Whisper — Groq's whisper-large-v3 (preferred — cheaper and faster) or OpenAI's whisper-1.t=MM:SS markers and the transcript with timestamps. Claude Reads each frame in parallel — JPEGs render directly as images in its context.Token cost is dominated by frames. Every frame is an image; image tokens add up fast. The script's auto-fps logic exists so you don't blow your context budget on a sparse scan of a 30-minute video that would have been better answered by a focused 30-second window.
| Duration | Default frame budget | What you get |
|---|---|---|
| ≤30 s | ~30 frames | Dense — basically every key moment |
| 30 s - 1 min | ~40 frames | Still dense |
| 1 - 3 min | ~60 frames | Comfortable |
| 3 - 10 min | ~80 frames | Sparse but workable |
| > 10 min | 100 frames | "Sparse scan" warning — re-run focused |
When the user names a moment ("around 2:30", "the last 30 seconds", "from 0:45 to 1:00"), pass --start / --end. Focused mode gets denser per-second budgets, capped at 2 fps. Far more useful than a sparse pass over the whole thing.
| Surface | Install |
|---|---|
| Claude Code | /plugin marketplace add taoufik123-collab/claude-watch then /plugin install watch@claude-watch |
| claude.ai (web) | Download watch.skill → Settings → Capabilities → Skills → + |
| Codex | git clone https://github.com/taoufik123-collab/claude-watch.git ~/.codex/skills/watch |
| Manual / dev | git clone https://github.com/taoufik123-collab/claude-watch.git ~/.claude/skills/watch |
| Configuration | Optional: export WATCH_VAULT_DIR=/path/to/your/obsidian/vault to enable auto-save. Auto-detects ~/Second brain/, ~/Documents/Obsidian/, ~/Obsidian/. |
/plugin marketplace add taoufik123-collab/claude-watch
/plugin install watch@claude-watch
Update later with /plugin update watch@claude-watch.
watch.skill from the latest release.+ and drop the file in.Enable "Code execution and file creation" under Capabilities first — the skill shells out to ffmpeg and yt-dlp, so it won't run without it.
git clone https://github.com/taoufik123-collab/claude-watch.git ~/.codex/skills/watch
git clone https://github.com/taoufik123-collab/claude-watch.git ~/.claude/skills/watch
On the first /watch call, the skill runs scripts/setup.py --check. If ffmpeg / yt-dlp aren't on your PATH, or no Whisper API key is set, it walks you through fixing it:
brew install ffmpeg yt-dlp.apt / dnf / pipx commands.winget / pip commands.~/.config/watch/.env (mode 0600) with commented placeholders for GROQ_API_KEY (preferred) and OPENAI_API_KEY.After setup, preflight is silent and /watch just works. The check is a sub-100ms lookup, so it doesn't slow you down on subsequent runs.
Captions cover the majority of public videos for free. The Whisper fallback only kicks in when a video genuinely has no caption track — typically local files, TikToks, some Vimeos, and the occasional caption-less YouTube upload.
| Capability | What you need |