10 — Talking Head / Lip-Sync Video

10 — Talking Head / Lip-Sync Video

Talking Head generation takes an audio clip and a reference portrait image and produces a video of the person appearing to speak. It’s useful for creating avatar-driven explainer videos, spokesperson content, or AI presenter clips without recording on camera.

Access this feature under AI Generate → Talking Head.

A Replicate API key and cloud storage (for the reference image upload) are required.


Available Models

Model Best For Max Duration
MultiTalk High-quality audio-driven lip sync Varies
OmniHuman (ByteDance) Realistic full-body or portrait; 3 crop modes Varies
Wan 2.2 S2V Cost-efficient; good for short clips Varies
Fabric 1.0 (VEED) Up to 60 seconds; resolution control 60 seconds

Preparing Your Reference Image

The reference image is the face or body that will be animated. Quality here directly affects result quality.

Best practices:
– Use a clear, high-resolution photo (minimum 512×512px)
– The face should be well-lit, looking forward or slightly angled
– Avoid sunglasses, heavy shadows, or blurry photos
– A plain or simple background helps the model focus on the face
– Portrait framing (head and shoulders) works best for most models


OmniHuman-Specific Options

OmniHuman supports three crop modes that determine how much of the body is shown:

Mode Description
Portrait Head and shoulders only
Half body Torso visible
Full body Full person visible

Select the mode that matches your reference image framing.


Fabric 1.0 Options

Setting Description
Resolution Choose standard or high-resolution output
Max duration Up to 60 seconds per clip

Generating a Talking Head Video

  1. In the Talking Head tab, select your model.
  2. Upload your reference image. The image is uploaded to your configured cloud storage (S3 or R2) to generate an accessible URL for the AI model. Make sure cloud storage is configured in Settings.
  3. Select or upload your audio. You can:
  4. Upload an audio file (MP3, WAV)
  5. Use a synthesized audio clip from Voice Studio
  6. Select an audio clip already on your timeline
  7. Configure any model-specific settings (body crop, resolution, duration).
  8. Click Generate.
  9. The job is added to the Job Queue. Talking head generation typically takes 1–4 minutes.

Using the Result

When the job completes:
Download to save the MP4 to your computer
Add to Timeline to insert the clip as a video segment

The resulting video is a lip-synced clip that can be trimmed, composited with other segments, and captioned like any other video segment.


Previous: AI Video Generation | Next: AI Audio & Music →