Veo3 Prompting Framework | Billy McDermott

tl:dr;

CHARACTER BLOCKS, not screenplays. JSON, not vibes.

AI video generators don’t read scripts the way humans do. They scan for signal — who’s in frame, what they’re doing, what they look like, what they say. Feed them a screenplay and you get chaos. Feed them structured data with explicit subject-object relationships and you get something you can actually iterate on.

This framework came out of hundreds of hours of prompting Veo3 and Runway, trying to get consistent characters across scenes in short films. The core discovery: models need character-blocked action sequences, not intercut editing. One character’s full block of actions, then the next. Never interleaved.

The structure

CHARACTER BLOCKS define every person in the scene as a self-contained object — description, physique, wardrobe, appearance, position, voice. The model gets an unambiguous picture of who it’s rendering before it sees a single action line.

MAPS (environment descriptors) pin the scene to a specific place, time, lighting condition, and atmosphere. Props and set pieces are enumerated, not implied.

Action sequences are written procedurally per character. Rex does his full sequence. Then Solly does his. The model doesn’t have to track intercut actions between two characters — it processes one block at a time. This was the breakthrough that took output from random to repeatable.

Dialogue is inline within action blocks. No quotation marks — the models get confused by them. Speaker identity is established by the CHARACTER BLOCK, not by attribution.

The whole thing sits on top of Google’s official 9-element prompt structure (subject, action, scene, mood, camera, lighting, style, color, sound) but replaces the freeform text with typed JSON fields.

Films made with the framework

K&A Gang — a noir set in 1960s Kensington, Philadelphia, based on the real K&A burglar gang. Personal connection: my great-grandmother’s kitchen table was reportedly used for planning jobs.

Hurricane Sandy — a drama about the Red Hook, Brooklyn artist community during the storm. I lived in Red Hook during 2006–2008; the neighborhood’s since-gentrified reality is the subtext.

Rex Thunder Armstrong — an 80s action film where a retired Delta Force operative gets into a confrontation with a 400-pound crime boss who fights with a roasted chicken. This one was mostly for testing the limits of character consistency and physical comedy.

Why JSON

Natural language prompts are lossy. You write “a muscular man on a beach at dawn” and the model interprets every word independently with no guaranteed relationships between them. JSON forces explicit structure: this character has this physique, wears this, stands here, does this action to that person. Every field is a constraint. More constraints means less drift.

The framework is portable — it worked on Veo3, adapted cleanly to Runway with minor changes to how camera movement is specified. The underlying principle (structured data over prose, blocked actions over intercut, explicit over implied) transfers to any video generation model.