Teaching Claude to See
Custom Skills for Visual Context
Claude Code is a terminal tool. It can read files, run commands, and write code. What it cannot do, by default, is see what you’re looking at. Your screen, your clipboard, that YouTube video someone just sent you: all invisible.
I fixed that.
In about an hour, I built three custom skills that bridge the gap between my visual world and Claude’s text-based reality. Now when I say “look at this thing I just copied,” it actually looks.
The Problem
Here’s the scenario that kept happening: I’m debugging something, I take a screenshot of the error, I want to ask Claude about it. The normal workflow is to save the file somewhere, find the path, paste it into the conversation. Or worse: manually transcribe what’s on my screen like some kind of medieval scribe.
Same with the clipboard. I copy some code, want to ask “what’s wrong with this,” and have to paste it into the chat first. And videos? Forget it. “Can you explain what happened in this screen recording” was a non-starter.
The terminal is powerful, but it’s blind.
The Solution: Three Skills
Claude Code supports custom skills: markdown files that define reusable behaviors. Skills can run bash commands, read files, and chain operations together. I built three of them.
/clipboard
Reads the Windows clipboard from inside WSL. Text or images.
The implementation is straightforward: a PowerShell command checks what type of content is in the clipboard. If it’s text, another PowerShell command retrieves it. If it’s an image, it saves to a temp file, and I read that file visually.
# Check clipboard type
powershell.exe -command "Add-Type -AssemblyName System.Windows.Forms; \
if ([System.Windows.Forms.Clipboard]::ContainsImage()) { Write-Output 'IMAGE' } \
elseif ([System.Windows.Forms.Clipboard]::ContainsText()) { Write-Output 'TEXT' } \
else { Write-Output 'EMPTY' }"
For images, the save-and-read dance:
# Save clipboard image to temp
powershell.exe -command "Add-Type -AssemblyName System.Windows.Forms; \
\$img = [System.Windows.Forms.Clipboard]::GetImage(); \
if (\$img) { \$img.Save('C:\\Users\\$USER\\AppData\\Local\\Temp\\clipboard_img.png') }"
Then the WSL path /mnt/c/Users/$USER/AppData/Local/Temp/clipboard_img.png gets read as an image. Claude sees it. Done.
Use case: Copy an error message, run /clipboard what does this mean. Copy a diagram, run /clipboard explain this architecture. The clipboard becomes a universal input buffer.
/screenshot
Grabs the latest screenshot from the Windows Screenshots folder.
Even simpler than clipboard. List the directory sorted by modification time, grab the newest .png, read it. The skill auto-detects Windows paths and converts them to WSL mounts, so I can just paste C:\Users\...\Screenshots and it figures out the /mnt/c/... translation.
# Find newest screenshot
ls -t "/mnt/c/Users/$USER/OneDrive/Pictures/Screenshots 1/"*.png | head -1
Then read and respond.
Use case: Take a screenshot with Win+Shift+S, immediately ask /screenshot what's the error here. No hunting for file paths, no manual copying. The most recent screenshot is always one command away.
It handles multiple screenshots too. /screenshot last 5 summarize these grabs the five most recent and analyzes them together. Useful for “here’s the sequence of steps I took” or “what changed between these screens.”
/video
This one’s the heavy hitter. Analyzes videos by extracting frames and, when available, subtitles.
For YouTube videos, yt-dlp handles the download and auto-caption extraction:
yt-dlp -f "best[height<=720]" \
--write-subs --write-auto-subs --sub-lang en \
--convert-subs srt \
-o "/tmp/video_dl_$$/video.%(ext)s" "<url>"
For local videos (screen recordings, etc.), Whisper in a conda environment handles transcription.
Frame extraction uses ffmpeg with scene detection. Instead of grabbing a frame every N seconds (which gives you 50 frames of someone’s static IDE), scene detection grabs frames only when the visual content actually changes:
ffmpeg -i "<video_path>" \
-vf "select='gt(scene,0.3)',showinfo" \
-vsync vfr -q:v 2 /tmp/frames/frame_%04d.jpg
The gt(scene,0.3) filter triggers when more than 30% of pixels change between frames. A 10-minute video with someone clicking through tabs gives you 20 meaningful frames instead of 600 redundant ones.
I cap at 30 frames max. Beyond that, token limits become a problem and you’re not getting much additional value anyway.
Use case: /video https://youtube.com/watch?v=xyz summarize this for YouTube content. /video what went wrong for the latest screen recording. Temporal context from video becomes accessible.
Video Context Caching
The /video skill now caches analyzed videos. Here’s the problem with one-shot analysis: you watch a 40-minute video, ask one question, and all that context evaporates. Want to ask a follow-up? Time to re-download and re-analyze the whole thing.
The solution: cache everything.
~/.claude/video-cache/<hash>/
├── metadata.json # URL, title, timestamp
├── subtitles.srt # transcript
├── frames/ # extracted frames
└── analysis.md # summary
Now the workflow changes completely:
- First analysis caches everything locally
- Follow-up questions read from cache instantly
/video follow-up <question>uses the last analyzed video/video --listshows what’s cached/video --clearcleans up
This turns video analysis from a one-shot query into a conversation. “Summarize this video” followed by “What did they say about X?” followed by “Show me that part about Y.” All using the same cached context. No re-downloading, no re-extracting frames, no waiting.
It’s basically a video RAG system. Analyze once, query forever.
Packaging It: claude-vision
These skills are packaged as a Claude Code plugin called claude-vision. The distribution challenge: the skills need ffmpeg, yt-dlp, and optionally whisper. Nobody wants to install a bunch of dependencies manually.
The solution: Docker + MCP.
The plugin includes an MCP server that runs in a container with all dependencies pre-installed. No ffmpeg on your system? No problem. The container has it. The skills talk to the MCP server, the server does the heavy lifting.
claude-vision/
├── skills/
│ ├── clipboard/ # Local (PowerShell interop)
│ ├── screenshot/ # Local (filesystem)
│ ├── video/ # MCP server or local fallback
│ └── setup/ # Interactive config wizard
├── mcp-server/
│ ├── Dockerfile # ffmpeg, yt-dlp, whisper pre-installed
│ └── server.py # MCP tools for video processing
└── docker-compose.yml
Installation
From the plugin marketplace:
claude plugin add ellyseum/claude-vision
Or clone and run locally:
git clone https://github.com/ellyseum/claude-vision.git
claude --plugin-dir ./claude-vision
The source is at github.com/ellyseum/claude-vision.
On first run, /claude-vision:setup walks you through the options:
- Docker mode: Container handles all video processing
- Local mode: Uses your installed ffmpeg/yt-dlp
- Auto mode: Tries Docker, falls back to local
The Prerequisite Problem, Solved
The only thing users need: Docker. That’s it.
Run docker-compose up, install the plugin, run setup. Everything else is containerized. The plugin detects what’s available and adapts.
For those who hate Docker (I see you), local mode works if you install the deps yourself. The setup wizard tells you exactly what’s missing.
The Technical Bits
A few implementation notes for anyone building similar tools:
WSL + PowerShell interop is the key to clipboard access. WSL can call PowerShell directly, and PowerShell can access the Windows clipboard APIs. The path translation between Windows (C:\Users\...) and WSL (/mnt/c/Users/...) happens in the skill definition.
Scene detection vs. time-based sampling: I tried both. Time-based sampling (fps=1/5 for one frame every 5 seconds) works fine for some content but gives you garbage for screen recordings. Scene detection is smarter. It captures transitions, which is usually what you actually care about.
Whisper transcription: For local videos without subtitles, audio context matters. The Whisper model runs in a conda environment. It’s not fast, but it’s local and free. The skill checks for existing subtitles first (YouTube auto-captions, SRT files) before falling back to transcription.
Hot-reload: Skills are just markdown files. Edit them, save, and the next invocation picks up the changes. No restart required. This made iteration fast.
Why This Matters
I’ve had the same impulse since I was a kid: poke at a system until you understand how it works, then bend it into doing something new. The kid who parsed raw ANSI escape sequences on a 2400 baud modem is now teaching Claude to watch YouTube videos. Different scale. Same energy.
Claude Code wasn’t designed to see my screen. But the pieces were all there: bash access, PowerShell interop, file reading, image understanding. The skill system just needed the right glue.
This is what “working with AI” actually looks like in practice. Not prompting a chatbot, but extending its capabilities to fit your workflow. The AI becomes more useful the more context it has. These skills give it the same visual context I have.
When I debug now, Claude sees what I see. When someone sends me a YouTube video, Claude can summarize it. When I copy an error message, Claude already knows what we’re talking about before I finish typing the question.
The terminal stopped being blind. Took about an hour to build.
The Hook for the Non-Technical
If you’re reading this on LinkedIn and your eyes glazed over at “ffmpeg”: the point is simple.
AI assistants are constrained by their inputs. They can only help with what they can perceive. The default perception is text in a chat box. That’s a narrow window.
These skills widened the window. Now my AI assistant can see my clipboard, my screenshots, and my videos. It went from “paste your error message here” to “I already saw it.”
The video caching takes it further. Instead of one-shot analysis where you ask a question and lose the context, it’s now a video RAG system. Analyze once, query many times. The video’s transcript, frames, and metadata all persist locally. Follow-up questions are instant. That 40-minute conference talk becomes a searchable knowledge base.
That’s the automation instinct at work. Every time I caught myself doing a manual step to get information to Claude, I wrote a skill to eliminate that step. The clipboard-paste workflow? Automated. The screenshot-path-hunting workflow? Automated. The video-transcription workflow? Automated. Re-analyzing the same video because I had a follow-up question? Also automated.
If I keep doing something manually, I shouldn’t be doing it manually.
That impulse has been with me since I was a kid writing MUD scripts. The tools change. The instinct doesn’t.