06 — Voice Studio

Voice Studio is a 4-step pipeline that lets you replace or re-voice audio from any source. The typical use case is: extract audio from a video → transcribe it to text → edit and polish the script → synthesize a new voice.

Access Voice Studio from the left panel or by clicking + Voice Studio in the toolbar.

Step 1: Extract

Get the audio you want to work with.

Three input methods:

From video file — Upload a video file. The app strips the audio track and prepares it for transcription. A waveform preview is shown.
From audio file — Upload a WAV, MP3, or M4A file directly.
Record — Record from your microphone in real time.

After the source is ready, click Next to proceed to transcription.

Step 2: Transcribe

Convert speech to text using one of three speech-to-text (STT) engines:

Engine	API Key Required	Best For
OpenAI Whisper	OpenAI	Highest accuracy; supports most languages
Google Cloud STT	Google Cloud	Fast and reliable; good for clean audio
Google Gemini	Gemini	Good general accuracy; uses the Gemini API

Select your preferred engine.
Click Transcribe.
The result is a word-level transcript with timestamps — meaning each word knows exactly when it was spoken.

This timestamp data is what powers the word-by-word animated captions in the Captions panel.

Click Next to review and edit the transcript.

Step 3: Edit / Polish

Manual Editing

The transcript is fully editable. Click on any word to correct a transcription error. This is important because the polished text will drive voice synthesis in Step 4.

AI Polish

Use AI to rewrite the transcript for a different tone or style:

Click Polish with AI (requires a Gemini API key).
Select a tone:
Professional — formal, clear business language
Conversational — natural, relaxed phrasing
Engaging — energetic and attention-holding
Marketing — persuasive, benefit-focused language
Casual — informal, friendly tone
The AI rewrites the text while preserving the core meaning.

You can Polish multiple times or mix manual edits with AI polishing.

Adding Pause Markers

Control how the synthesized voice breathes and pauses. Insert pause markers directly into the transcript text using this syntax:

<#0.8#>

The number represents a pause duration in seconds. For example, <#1.2#> inserts a 1.2-second pause at that point in the speech.

Tip: Add pauses after sentences, before topic changes, or around critical points you want to emphasize. Natural breathing makes synthesized audio sound significantly more human.

When you are satisfied with the transcript, click Next.

Step 4: Synthesize

Convert your edited text to a new AI voice.

Voice Settings

Setting	Description
Voice	Choose from 25+ pre-trained voices (Alex, Amy, Deep, Energetic, and more)
Pitch	Adjust from -2.0 (deeper) to +2.0 (higher)
Speed	Adjust from -2.0 (slower) to +2.0 (faster)
Language Boost	Enhance specific language accent (Mandarin, Japanese, Korean, etc.)
Emotion	Preset emotional tone: Neutral, Happy, Sad, Angry, Fearful, Disgusted, or Surprised

Generating

Configure the settings above.
Click Synthesize.
The request is sent to the MiniMax voice synthesis model via Replicate. Generation typically takes 10–30 seconds.
A waveform preview appears when complete.

Synthesis History

Every synthesis attempt is saved to your history. You can:
– Play back any previous attempt
– Compare voices side-by-side by listening to each

Adding to Timeline

Once you have a result you’re happy with:

Click Add to Timeline.
The audio clip is automatically probed for duration and inserted into the audio track at the current playhead position.
From there, it can be repositioned, trimmed, or mixed with other audio clips like any other clip.

Previous: Audio Editing | Next: Animated Captions →