Resemble Now Available on fal

Team fal

Jun 2, 2025 • 3 min read

Today, we’re excited to announce our partnership with Resemble AI, bringing their powerful suite of real-time, high-fidelity text-to-speech and voice cloning models to the fal platform.

Introducing Resemble AI on fal

Resemble AI sets a new benchmark in TTS by combining zero-shot voice cloning, expressive speech synthesis, and real-time inference. This new integration unlocks advanced voice capabilities for creators, developers, and product teams—all accessible through fal’s simple, developer-friendly APIs.

Key features include:

Zero-Shot Voice Cloning: Instantly clone any voice with just a few seconds of reference audio—no additional training required.
Emotion Exaggeration Control: Adjust the emotional intensity of the synthesized voice from monotone to highly expressive using a single parameter.
Alignment-Informed Generation: Ensures ultra-stable and coherent speech output.
Real-Time Inference: Achieve faster-than-real-time voice synthesis, ideal for interactive applications.
Easy Voice Conversion: Supports voice conversion out of the box.

Try out the models

Chatterbox (OSS) - Text to Speech: Generate expressive, natural speech with Resemble AI's Chatterbox. Features unique emotion control, instant voice cloning from short audio, and built-in watermarking.

Chatterbox (OSS) - Speech to Speech: Convert audio to new voices or your own samples, with expressive results and built-in perceptual watermarking.

Chatterbox HD - Text to Speech: Generate expressive, natural speech with Resemble AI's Chatterbox. Higher-sampling-rate version than available OSS (OSS is 24kHz; HD supports up to 48kHz) with newer weights for improved zero-shot accent capture and pre-built voices

Chatterbox HD - Speech to Speech: Convert audio to new voices or your own samples, with expressive results and built-in perceptual watermarking. Higher-sampling-rate version than available OSS (OSS is 24kHz; HD supports up to 48kHz) with newer weights for improved zero-shot accent capture and pre-built voices

Chatterbox Pro: Coming Soon!

Hear It in Action

Example #1:

Text Input: "My name is Maximus Decimus Meridius, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. Father to a murdered son, husband to a murdered wife. And I will have my vengeance, in this life or the next.”

Voice Cloning Target Voice:

0:00

/14.278549

Generated Text-To-Speech:

0:00

/18.36

Example #2:

Text Input: "Introducing the next generation of refreshment. Duff Beer just got bolder, smoother, and brewed to perfection. Whether you're kicking back or barfing up, it's the taste that never quits. Crack open a classic. Duff is back, and better than ever."

Voice Cloning Target Voice:

0:00

/29.849002

Generated Text-To-Speech:

0:00

/17.96

Getting Started with High-Quality Results

Creating High-Quality Voice Clones

The model will mimic the audio fidelity and voice characteristics of the input. To achieve the best results, ensure you use clean audio featuring a single speaker - ideally recorded with a professional microphone. Fast tips;

Use at least 10 seconds of clean speech from a single speaker.
Record at 24kHz or higher, ideally with a professional microphone and no background noise.
Match the speaking style of your reference audio to your intended output (e.g., use audiobook-style speech for audiobook generation).

Generating Conversational Speech

For natural, conversational voices—ideal for voice agents—adjust the model settings:

Default values (exaggeration = 0.5, cfg = 0.5) perform well in most cases.
If your reference speaker talks quickly, reduce cfg to ~0.3 to improve pacing.

Generating Expressive or Dramatic Speech

For dramatic or emotional speech—great for film, games, or ads:

Lower cfg to ~0.3 and increase exaggeration to 0.7 or higher.
Higher exaggeration speeds up delivery; reducing cfg helps balance the pacing.

What to Know About Model Constraints

Accent Preservation
Zero-shot cloning may favor American or British English accents and may weaken regional accents in the output.
Language Support
Currently, the model supports English text only.

Try It Now

All of this is ready to use on fal’s GenAI platform—just plug and play via API. Get started with Resemble AI today and bring expressive, real-time voice to your creative workflows.

Explore the models in the fal model gallery.