12 — Long-Form Video Pipeline
The Long-Form Pipeline is designed for creating multi-segment videos from a single long audio recording — for example, turning a 30-minute podcast into a series of talking-head clips, or generating a full-length video course from a recorded script.
Access this feature under Long-Form in the left panel or by clicking + Long-Form in the toolbar.
Requirements: Replicate API key, cloud storage configured (for reference image uploads).
Step 1: Upload Audio
- Click Upload Audio and select your source audio file (WAV, MP3, M4A).
- A waveform preview appears with playback controls.
- Review the audio to confirm it’s the correct file before proceeding.
- Click Next.
Step 2: Segment
Break the audio into chunks that will each become a separate video clip.
Auto-Segmentation
Click Auto-Segment to let the app analyze the audio and propose logical split points.
Configuration options:
| Option | Description |
|---|---|
| Target duration | Desired length per segment (seconds) |
| Segment variance tolerance | How much segments can deviate from the target duration |
| Prefer silence detection | When enabled, splits are placed at quiet moments rather than at fixed intervals |
| Silence threshold (dB) | Audio level below which a moment is classified as “silence” |
Manual Adjustment
After auto-segmentation, you can:
– Drag segment boundaries on the waveform to move split points
– Add a split point by clicking on the waveform where you want a new break
– Remove a split point by clicking the × on any existing marker
– Preview any segment by clicking the play button on that segment’s entry
When your segments look correct, click Next.
Step 3: Generate
This step sends all segments to an AI video generation model simultaneously.
Reference Image
Upload a portrait image of the person who will appear in the talking-head videos. This same image is used for all segments in the batch.
- Follow the same best practices as in Talking Head: clear, forward-facing, well-lit, plain background.
- The image is uploaded to your configured cloud storage so the AI model can access it.
Model Selection
Choose from models optimized for long-form batch generation:
| Model | Notes |
|---|---|
| MultiTalk | Reliable audio-driven lip sync |
| OmniHuman | High realism, body crop options |
| Seedance | Fast generation |
| Runway Gen 4.5 | Top quality |
| Sora 2 | High quality |
| Veo 3.1 | Excellent motion |
| Kling V3 / Kling V3 Omni | Strong realism |
| Hailuo 2.3 | Cost-efficient |
| (and more) | 13+ models total |
Prompt
Write a scene description that applies to all segments. For example:
“Professional presenter in a modern office, natural lighting, looking directly at camera”
Optional Settings
| Setting | Description |
|---|---|
| Frame continuity | Uses the last frame of each generated clip as the first frame of the next — produces seamless transitions between segments |
| Turbo mode | Requests faster generation (may reduce quality on some models) |
Starting Generation
Click Generate All. All segments are submitted to the job queue simultaneously. A progress indicator shows the generation status per segment (Submitted → Running → Succeeded / Failed).
You can monitor progress on this screen or navigate to Job Queue to see the full list.
Step 4: Merge
When all (or most) segments have completed:
- The completed video clips are listed with preview thumbnails.
- Review each clip — if any failed, you can regenerate that segment individually.
- Click Merge.
- The app concatenates all clips in order, syncing the original audio track automatically.
- Transitions are applied between clips (using the same transition settings as the main editor).
- The merged video is exported as an MP4 to your configured export folder.
Tips for Long-Form Projects
- Keep segments under 30 seconds. Most models perform better at shorter durations. For 5-minute sections, the pipeline will automatically chunk further if needed.
- Use consistent lighting in your reference image. Since all segments use the same portrait, consistency in the reference photo produces a more cohesive final video.
- Expect some failed segments. AI generation at scale can have occasional failures. The merge step lets you regenerate only the failed ones before merging.
- Frame continuity works best with talking-head models. For image/video generation models, it may look disjointed since the “last frame” of one clip becomes the “first frame” of a very different generated clip.
Previous: AI Audio & Music | Next: Job Queue →