What This Is
This is a voiceover script generator for property tour videos. It’s designed specifically for real estate agents who need professional, timed narration that matches their video footage. The prompt works step by step:
- It asks you for your raw video footage and a few simple details (property description, tone, call to action, etc.).
- It analyzes the video second by second to determine what spaces and features appear on screen.
- It builds a walkthrough outline and then generates a narration script that’s perfectly timed to the video’s length.
- The script is formatted in XML, a markup language similar to HTML. XML here adds instructions for pacing, emphasis, and pauses—so the voiceover sounds natural, not robotic.
- You’ll take that XML script and paste it directly into Elevenlabs. Elevenlabs is a text-to-speech platform known for realistic voice cloning. You can generate narration in your own cloned voice or a professional-sounding AI voice.
The end result: a done-for-you voiceover that syncs seamlessly with your property tour, ready to drop into your video editor (CapCut, Descript, Premiere, etc.).
Update — November 11, 2025
ChatGPT’s environment has changed and it can no longer analyze videos frame by frame. This prompt should now be used in Claude, which can still interpolate video files—meaning it examines the footage at one-second intervals to understand what’s visually happening in each moment.
By doing so, Claude can recognize scene changes (like moving from the foyer to the kitchen to the backyard) and generate a script that follows the real sequence of your video. The interpolation instruction in this prompt now applies specifically to Claude, not ChatGPT.
How To Use It
- Copy the Prompt
- Copy the entire prompt exactly as provided. Do not change or edit anything.
- Paste into ChatGPT
- Start a new chat and paste the prompt in.
- Hit enter to run it.
- Answer the Questions
- The prompt will ask you questions one at a time:
- Upload your exact video footage (final cut).
- Provide a short property description.
- Choose where the video will be posted (Reels, TikTok, YouTube, etc.).
- Select a vibe/tone (professional, warm, cinematic, etc.).
- Decide if you want a call to action.
- Specify anything to include or avoid (e.g., number of bedrooms, avoid Fair Housing issues).
- Review the Walkthrough Outline
- ChatGPT will create a second-by-second outline of what’s happening in the video.
- This ensures the narration matches the flow of the footage.
- Confirm you’re ready to move forward.
- Get Your Script in XML
- ChatGPT will generate a voiceover script in XML, inside a preformatted text block.
- XML includes instructions like pauses, emphasis, and tone, so the speech sounds natural.
- Copy the entire XML script as is.
- Paste into Elevenlabs
- Go to Elevenlabs.io.
- Use the Text-to-Speech feature.
- Paste the XML script directly into the text box.
- Choose your cloned voice or another voice option.
- Generate and download the audio file.
- Add to Your Video
- Import the downloaded voiceover into your video editor (CapCut, Descript, Premiere, etc.).
- Align it with your footage (it will already match the timing).
- Export your finished video with a professional narration track.
Prompt
# ElevenLabs Ready XML Prompt
**Role:** You are tasked with writing expressive narration in **ElevenLabs Ready XML**.
**Rule:** Do **not** produce XML until all questions are answered **and** the exact video cut is attached.
**Output:** All outlines and scripts must be returned in **preformatted code blocks (“` … “`).** XML must contain *only* narration and supported SSML tags.
—
## Intake (ask one at a time)
1. Please attach the **SUBJECT VIDEO FOOTAGE** (the exact cut we’re voicing to).
2. Provide a short **PROPERTY DESCRIPTION** (style, key features, neighborhood, etc.).
3. Where will this video be **posted**? (Instagram Reel, TikTok, YouTube Long-Form, Widescreen Cinematic, Social Media Fun, etc.)
4. What **VIBE / TONE** should the narration carry? (Professional, Warm, Excited, Cinematic, Energetic.)
5. Do you want to include a **CALL TO ACTION** at the end? If yes, what should it say?
6. Are there any specific things you **DO or DON’T want mentioned**? (e.g., must-say facts like beds/baths/square footage, brand-voice preferences, or restrictions such as avoiding Fair Housing issues or certain phrases.)
—
## After Inputs + Footage Are Received
A. **Target Length:** Infer automatically from the video runtime. The primary script must match this length.
B. **1-Second Analysis (optical recognition required):**
– Analyze the uploaded video at one-second intervals.
– Identify what appears in the frame: the room, space, feature, or transition.
– Group consecutive seconds that show the same space into clusters.
– Use **logic + context** to ensure accuracy. Example: if footage shows moving upstairs, then subsequent segments must describe spaces consistent with being upstairs.
– Avoid low-level descriptors (colors, brightness, textures). Focus on semantic property details: rooms, finishes, features, flow.
– Incorporate user-provided facts (e.g., user says “marble” not “granite”). Defer to user if visual inference conflicts.
C. **Walkthrough Outline (semantic + contextual):**
– Return a timestamped outline like a table of contents for the tour.
– Each segment must describe what the viewer is seeing in property-tour terms.
– Examples:
“`
WALKTHROUGH OUTLINE
00:00–00:04 — Exterior approach: curb appeal, two-car garage, modern stucco
00:05–00:09 — Entry/Foyer: glass door, sidelights, staircase in view
00:10–00:16 — Kitchen: large island, stainless appliances, marble counters
00:17–00:22 — Living Room: fireplace, built-ins, sliding doors to patio
00:23–00:28 — Staircase Up: ascending to second floor
00:29–00:34 — Primary Bedroom: accent wall, large windows
“`
– The length of descriptions should be flexible. They should be **detailed enough** to support natural narration that will fit the video timing.
– Ask the user if they want you to proceed to the narration script.
—
## Once User Confirms to Proceed
1. Produce **ONE complete XML script** that matches the video length.
2. Return the XML inside a **single preformatted code block (“` … “`)** so it can be copy/pasted directly into ElevenLabs.
3. The XML must contain **only narration and supported SSML tags** — no timestamps, no outline notes, no comments.
4. Use **only** ElevenLabs-supported tags:
– `<speak>` (root)
– `<p>` and `<s>` (structure)
– `<break time=”###ms”/>` (pauses)
– `<emphasis level=”moderate|strong”>` (key highlights)
– `<prosody rate=”slow|medium|fast” pitch=”+/-N%” volume=”+/-N dB”>` (tone, pacing, inflection)
5. **Narration must be expressive.**
– Punch key property highlights with `<emphasis>`.
– Use `<prosody>` to vary pacing and tone: faster for energy, slower for luxury details.
– Insert `<break>` for scene changes or dramatic beats.
– Match the requested vibe/tone (cinematic, energetic, professional, etc.).
– Ensure pacing fits runtime: do the math so word counts align with video length.
6. **Voiceover Script Rules:**
– Narration must follow the **Walkthrough Outline** exactly.
– Describe the property tour footage — the spaces and features the viewer is seeing — in logical order.
– Use context: if the outline says “stairs up,” narration should naturally lead to a bedroom, not back to the kitchen.
– Language must be conversational, vivid, and aligned with user’s inclusions/exclusions.
—
## After Outputting the Primary Script
– Ask the user if they would like alternate versions (**Shorter ~–10%**, **Longer ~+10%**).
– Only generate alternates if the user confirms.
– If generated, each alternate must also be in its **own preformatted block** and must be **pure XML narration only**.