Your AI Can't Watch Your Video | jeffrygonzalez.dev

I just finished recording a demo video for a developer tool I’ve been building with Claude. The tool is specifically designed to make application state legible to AI assistants. I recorded it, dropped the file in the folder, and realized: Claude can’t watch it.

Which is funny. And also exactly the point.

The problem

Video is the format humans reach for when they want to show something that’s hard to describe. A three-minute demo communicates things that would take pages of prose. But most AI assistants — the ones that are actively useful in your development workflow right now — can’t watch video. They can’t scrub a timeline, read a facial expression, or follow a cursor.

So you make this great demo, and your most capable collaborator is blind to it.

This is the same problem developer tools have always had with AI: the thing you built to show what’s happening exists in a format the AI can’t consume. Runtime state lives in a browser panel. Video lives in an mp4. Both are useful to humans and invisible to AI.

The fix is the same in both cases: the artifact around the content is what the AI can use.

What actually helps

Transcripts — the obvious one, and still underused. A verbatim transcript of any voiceover or narration is trivially useful. If you’re narrating a demo, your words are already structured for an AI to read. The transcript is just… the words, written down.

For my video there’s no voiceover (I didn’t want to talk, and I needed a haircut). So a transcript wouldn’t capture much. Which brings me to the actually useful thing:

Timecode-aligned descriptions — not a transcript, not chapter markers, but a structured description of what’s happening at each moment. Something like:

0:00-0:15  App loads. Looks like a normal Angular app. Nothing special visible.
0:15-0:30  Stellar overlay opens (bottom-right button). Store picker appears.
0:30-1:00  User interacts with counter. State updates live in overlay. Diff highlighted.
1:00-1:20  HTTP call made. Response links causally to state change. Badge appears.

An AI handed this description can reason about the video’s content, answer questions about it, write a summary, suggest edits, or generate a blog post from it. It doesn’t need to see the footage.

Closed captions — if you’re publishing to YouTube, generate them. Not for your human audience (though that too), but because caption files are structured, timestamped text. Any AI that can access a .vtt or .srt file can parse a video’s spoken content exactly as well as a transcript, with timing.

Structured metadata — title, description, tags, the thing you’re showing and why it matters. YouTube descriptions are an underrated AI surface. A dense, accurate description of what’s in the video — not marketing copy — is what lets an AI connect the video to a question someone asks it.

The principle

This is AI Accessibility applied to video. The same rule that says “don’t make the AI infer what it could be told” applies here. A runtime state snapshot should carry the store name, the trigger, the diff, and type hints — so the AI doesn’t have to ask. A video should have a transcript or timecode description — so the AI doesn’t have to watch.

The content isn’t the artifact. The artifacts around the content are what make it legible.

What I’m doing

For this demo: timecode-aligned description, handed to Claude when I want to talk about edits, pacing, or what to change in the next version. Closed captions generated by YouTube’s auto-captioner and corrected before publishing (they’re usually close; technical terms are where they go wrong).

Not glamorous. Takes maybe twenty minutes after the edit. But it means the AI that helped build the tool can actually engage with the demo that shows it off — which feels like the least I can do.