10 — Talking Head / Lip-Sync Video

Talking Head generation takes an audio clip and a reference portrait image and produces a video of the person appearing to speak. It’s useful for creating avatar-driven explainer videos, spokesperson content, or AI presenter clips without recording on camera.

Access this feature under AI Generate → Talking Head.

A Replicate API key and cloud storage (for the reference image upload) are required.

Available Models

Model	Best For	Max Duration
MultiTalk	High-quality audio-driven lip sync	Varies
OmniHuman (ByteDance)	Realistic full-body or portrait; 3 crop modes	Varies
Wan 2.2 S2V	Cost-efficient; good for short clips	Varies
Fabric 1.0 (VEED)	Up to 60 seconds; resolution control	60 seconds

Preparing Your Reference Image

The reference image is the face or body that will be animated. Quality here directly affects result quality.

Best practices:
– Use a clear, high-resolution photo (minimum 512×512px)
– The face should be well-lit, looking forward or slightly angled
– Avoid sunglasses, heavy shadows, or blurry photos
– A plain or simple background helps the model focus on the face
– Portrait framing (head and shoulders) works best for most models

OmniHuman-Specific Options

OmniHuman supports three crop modes that determine how much of the body is shown:

Mode	Description
Portrait	Head and shoulders only
Half body	Torso visible
Full body	Full person visible

Select the mode that matches your reference image framing.

Fabric 1.0 Options

Setting	Description
Resolution	Choose standard or high-resolution output
Max duration	Up to 60 seconds per clip

Generating a Talking Head Video

In the Talking Head tab, select your model.
Upload your reference image. The image is uploaded to your configured cloud storage (S3 or R2) to generate an accessible URL for the AI model. Make sure cloud storage is configured in Settings.
Select or upload your audio. You can:
Upload an audio file (MP3, WAV)
Use a synthesized audio clip from Voice Studio
Select an audio clip already on your timeline
Configure any model-specific settings (body crop, resolution, duration).
Click Generate.
The job is added to the Job Queue. Talking head generation typically takes 1–4 minutes.

Using the Result

When the job completes:
– Download to save the MP4 to your computer
– Add to Timeline to insert the clip as a video segment

The resulting video is a lip-synced clip that can be trimmed, composited with other segments, and captioned like any other video segment.

Previous: AI Video Generation | Next: AI Audio & Music →