The 6-Part AI Video Prompt Formula That Works Across Every Model

You write a prompt, hit generate, and get something that looks nothing like what you imagined. The subject drifts, the camera does something random, and the lighting looks flat. The problem isn’t the AI model — it’s the prompt structure.

After testing thousands of prompts across Sora 2, Veo 3.1, Runway Gen-4.5, and Kling 3.0, a consistent pattern emerges. The prompts that produce usable video all follow the same six-part structure, regardless of which model you’re using.

The Universal Formula

Every effective AI video prompt contains these six elements in roughly this order:

Camera + Subject + Action + Setting + Lighting + Style

That’s it. The models differ in what they’re best at, but they all parse prompts looking for these same components. Miss one, and the model fills in the gap with a random guess.

Here’s what each element does and how to write it.

1. Camera and Cinematography

This tells the model what the “virtual camera” is doing. Use standard film vocabulary — every major model has been trained on it.

Shot types: wide shot, medium shot, close-up, extreme close-up, bird’s eye view, low angle, Dutch angle

Camera movement: dolly forward, pan left, tracking shot, whip pan, slow push-in, crane up, Steadicam follow

Lens specification: “shot on 24mm anamorphic lens, f/2.8, shallow depth of field”

Always specify speed: “slow dolly” and “rapid pan” produce very different results. If you skip camera direction entirely, the model defaults to a static medium shot — which is rarely what you want.

Good: “Slow push-in, close-up” Bad: “Camera moves toward the subject” (vague, no standard term)

2. Subject and Character

Who or what is the focus of the shot. Be specific about physical details — the more concrete your description, the more consistent the output.

Good: “A woman in her late twenties with wavy brown hair, light freckles, wearing a tailored navy blazer over a white t-shirt”

Bad: “A woman”

The first 20-30 words of your prompt carry the most weight. Front-load your subject description because models prioritize the beginning of the prompt. If the model doesn’t know who the scene is about in the first sentence, faces drift, clothing shifts, and proportions change mid-generation.

3. Action and Motion

What the subject is doing. This is where most prompts fail — they describe a scene but forget to describe movement, which is the entire point of video.

Use specific verbs. “Sprints,” “drifts,” “pivots,” and “reaches” give the model precise motion to render. “Moves” and “goes” produce generic results.

For physical interactions, describe the physics: “Water sloshes with visible surface tension as the glass tilts” instead of “water moves in the glass.”

Good: “She turns slowly to face the camera, rain intensifying around her, expression shifting from neutral to determined”

Bad: “She looks at the camera in the rain”

4. Environment and Setting

Where the scene takes place. Use sensory language that gives the model environmental context — not just a location name, but what the location looks, feels, and sounds like.

Good: “A smoky jazz club at 2 AM, amber light filtering through haze, half-empty tables with candles guttering, exposed brick walls”

Bad: “A jazz club”

Include spatial relationships: foreground, midground, background. “A crowded market stretches behind him, out of focus” tells the model exactly where to place the depth.

5. Lighting

Name the source and behavior, not just the brightness level. Lighting is the single biggest factor in whether a generated video looks cinematic or flat.

Specific lighting descriptions that work:

“Golden hour, warm backlight creating rim-lit silhouettes”
“Harsh midday sun, deep shadows, high contrast”
“Neon-lit, blue and pink reflections pooling on wet asphalt”
“Overcast softbox lighting, even and diffused, no hard shadows”
“Warm key light from upper-left, soft fill from below”

What doesn’t work: “Good lighting” or “bright” — these give the model nothing to work with.

Naming the physical light source helps models that simulate physics (Sora and Veo especially). “Warm sidelight from a rising sun” gives the model a direction, color temperature, and intensity all in one phrase.

6. Style and Mood

The overall aesthetic and quality markers. This shapes everything from color grading to texture.

Medium: “realistic,” “stop-motion,” “claymation,” “film noir,” “VHS aesthetic,” “anime”

Color: “muted desaturated tones,” “vibrant saturated colors,” “teal and orange color grading,” “black and white”

Quality markers: “cinematic 4K,” “shallow depth of field,” “ultra-realistic textures,” “35mm film grain”

Reference styles: “Wes Anderson symmetry,” “Netflix documentary quality,” “music video pacing” — these act as shorthand for complex aesthetic packages the models understand.

Putting It Together

Here’s the formula applied to a complete prompt:

Close-up, slow push-in. A weathered fisherman in his sixties, deep sun-creased skin, salt-stained cap, mends a torn net with practiced hands on a wooden dock. Early morning fog rolls across a still harbor behind him. Warm golden sidelight from the rising sun, cool blue shadows. Documentary realism, 35mm film grain, shallow depth of field.

That’s 6 elements in 3 sentences. Camera (close-up, slow push-in), subject (fisherman with specific details), action (mends a net), setting (wooden dock, foggy harbor), lighting (golden sidelight, blue shadows), style (documentary, film grain). The model knows exactly what to render.

Model-Specific Adjustments

The formula works universally, but each model has quirks worth knowing:

Sora 2

Sora thinks in physics and cause-and-effect. Describe causal chains: “the wind catches the fabric, pulling it taut” rather than “fabric blowing.” Sora also handles native audio — add a seventh element for sound design: ambient noise, dialogue, music style. Best for cinematic realism up to 60 seconds.

Veo 3.1

Veo interprets technical cinematography terms precisely. Terms like “dolly zoom,” “timelapse,” and “slow push-in” translate directly into camera behavior. Front-load the subject more aggressively than with other models — Veo needs to know who the scene is about in the first few words. Best for commercial and corporate quality with ultra-clean output.

Runway Gen-4.5

Runway works best when you describe motion, not appearance — especially with image-to-video workflows where the image already carries the visual information. Focus your text prompt on forces and movement: “wind pulls the curtain toward the open window” instead of “curtain near a window.” Best for quick iteration and social media content with its motion brush and Director Mode tools.

Kling 3.0

Kling excels at long-form generation (2+ minutes) and handles camera motion as a narrative tool. Specify start and end points for camera moves: “pan from left to right across the city skyline.” Use weighted elements with ++ for critical components: "++sleek red convertible++ driving along a coastal highway." Best for longer TikTok-style content and dialogue scenes with lip-sync.

The Optimal Prompt Length

Aim for 50-150 words (3-6 sentences). Single-sentence prompts almost always produce generic results. Prompts over 200 words tend to confuse models — they start ignoring later instructions or blending conflicting elements.

The sweet spot is 3-4 sentences that each address 1-2 of the six elements. No sentence should try to cover everything at once.

Common Mistakes

Negative language. Never write “no text overlays” or “don’t show buildings.” Most models can’t process negation — they’ll often generate exactly what you told them to avoid. Describe what you want instead. (Exception: Kling supports explicit negative prompts.)

Multiple scene changes. One prompt = one scene. If you need a sequence, generate multiple clips with consistent descriptions and edit them together.

Vague motion. “Moving” is not a motion description. “Drifting slowly left to right while rotating” is. The more specific your verbs, the more intentional the output.

Conflicting styles. “Golden hour” plus “studio lighting” breaks the model’s understanding of where light comes from. Pick one lighting scenario and commit.

Build Your Prompt Library

Once you find prompts that produce great results, save them. Most successful AI video creators maintain a library of working prompts organized by style, subject type, and model. Modify proven prompts for new subjects rather than starting from scratch every time.

LzyPrompt is built around this exact workflow — storing, organizing, and iterating on video prompts that work. If you’re generating video regularly, a prompt library saves hours of trial and error.

For more model-specific techniques, check out our guides on writing prompts for Sora and Veo 3 prompt engineering.

FAQ

What is the best AI video prompt structure?

Use the six-part formula: Camera + Subject + Action + Setting + Lighting + Style. This structure works across Sora, Veo, Runway, Kling, and most other models. Aim for 50-150 words with specific details in each element.

How long should an AI video prompt be?

3-6 sentences (50-150 words) is the sweet spot. Single-sentence prompts produce generic results. Prompts over 200 words tend to confuse models, causing them to ignore later instructions or blend conflicting elements.

Which AI video generator is best in 2026?

It depends on your use case. Sora 2 leads for cinematic realism (up to 60 seconds). Veo 3.1 excels at commercial quality with precise camera control. Runway Gen-4.5 is best for quick iteration and social media. Kling 3.0 handles the longest generations (2+ minutes) with built-in dialogue.

Why does my AI video look different from my prompt?

The most common causes: vague motion descriptions (use specific verbs), missing lighting direction (name the light source), subject not front-loaded in the first sentence, or prompt length outside the 50-150 word range. Models weight the beginning of prompts heavily — put your most important details first.

Do I need different prompts for different AI video models?

The six-part structure works universally. Adjust emphasis per model: Sora responds well to physics descriptions, Veo interprets cinematography terms precisely, Runway prioritizes motion over appearance, and Kling supports weighted elements and negative prompts. Start with the universal formula and tweak based on results.