06 — Voice Studio
Voice Studio is a 4-step pipeline that lets you replace or re-voice audio from any source. The typical use case is: extract audio from a video → transcribe it to text → edit and polish the script → synthesize a new voice.
Access Voice Studio from the left panel or by clicking + Voice Studio in the toolbar.
Step 1: Extract
Get the audio you want to work with.
Three input methods:
- From video file — Upload a video file. The app strips the audio track and prepares it for transcription. A waveform preview is shown.
- From audio file — Upload a WAV, MP3, or M4A file directly.
- Record — Record from your microphone in real time.
After the source is ready, click Next to proceed to transcription.
Step 2: Transcribe
Convert speech to text using one of three speech-to-text (STT) engines:
| Engine | API Key Required | Best For |
|---|---|---|
| OpenAI Whisper | OpenAI | Highest accuracy; supports most languages |
| Google Cloud STT | Google Cloud | Fast and reliable; good for clean audio |
| Google Gemini | Gemini | Good general accuracy; uses the Gemini API |
- Select your preferred engine.
- Click Transcribe.
- The result is a word-level transcript with timestamps — meaning each word knows exactly when it was spoken.
This timestamp data is what powers the word-by-word animated captions in the Captions panel.
Click Next to review and edit the transcript.
Step 3: Edit / Polish
Manual Editing
The transcript is fully editable. Click on any word to correct a transcription error. This is important because the polished text will drive voice synthesis in Step 4.
AI Polish
Use AI to rewrite the transcript for a different tone or style:
- Click Polish with AI (requires a Gemini API key).
- Select a tone:
- Professional — formal, clear business language
- Conversational — natural, relaxed phrasing
- Engaging — energetic and attention-holding
- Marketing — persuasive, benefit-focused language
- Casual — informal, friendly tone
- The AI rewrites the text while preserving the core meaning.
You can Polish multiple times or mix manual edits with AI polishing.
Adding Pause Markers
Control how the synthesized voice breathes and pauses. Insert pause markers directly into the transcript text using this syntax:
<#0.8#>
The number represents a pause duration in seconds. For example, <#1.2#> inserts a 1.2-second pause at that point in the speech.
Tip: Add pauses after sentences, before topic changes, or around critical points you want to emphasize. Natural breathing makes synthesized audio sound significantly more human.
When you are satisfied with the transcript, click Next.
Step 4: Synthesize
Convert your edited text to a new AI voice.
Voice Settings
| Setting | Description |
|---|---|
| Voice | Choose from 25+ pre-trained voices (Alex, Amy, Deep, Energetic, and more) |
| Pitch | Adjust from -2.0 (deeper) to +2.0 (higher) |
| Speed | Adjust from -2.0 (slower) to +2.0 (faster) |
| Language Boost | Enhance specific language accent (Mandarin, Japanese, Korean, etc.) |
| Emotion | Preset emotional tone: Neutral, Happy, Sad, Angry, Fearful, Disgusted, or Surprised |
Generating
- Configure the settings above.
- Click Synthesize.
- The request is sent to the MiniMax voice synthesis model via Replicate. Generation typically takes 10–30 seconds.
- A waveform preview appears when complete.
Synthesis History
Every synthesis attempt is saved to your history. You can:
– Play back any previous attempt
– Compare voices side-by-side by listening to each
Adding to Timeline
Once you have a result you’re happy with:
- Click Add to Timeline.
- The audio clip is automatically probed for duration and inserted into the audio track at the current playhead position.
- From there, it can be repositioned, trimmed, or mixed with other audio clips like any other clip.
Previous: Audio Editing | Next: Animated Captions →