Cartesia Sonic

Also known as: Sonic TTS, Sonic 3.5, Cartesia TTS

Cartesia Sonic
Cartesia Sonic is a real-time text-to-speech API from Cartesia AI designed for voice agent applications. The current production model, Sonic 3.5, generates natural-sounding speech in under 90 milliseconds across 42 languages, using a streaming architecture optimized for interactive, latency-sensitive deployments.

Cartesia Sonic is a real-time text-to-speech API from Cartesia AI, built for voice agents. According to Cartesia Docs, Sonic 3.5 — the current model — generates speech in under 90 milliseconds across 42 languages.

What It Is

Most text-to-speech systems were built for a world where audio generation was a post-processing step: write the text, send it to a service, wait for the audio file, then play it. That model breaks in conversational AI, where waiting half a second for a voice response destroys the feeling of a real conversation.

Cartesia Sonic works differently. It streams audio output while still processing the input text, so the first chunk of speech reaches the user’s speaker before the full sentence has been converted. Think of it like a live conference interpreter who starts whispering the translation in your ear before the speaker finishes each sentence — staying just a beat behind without waiting for the full utterance to end. According to Cartesia Docs, this approach produces audio that begins playing in under 90 milliseconds from the first characters received, which falls below the threshold most people perceive as a delay in normal conversation.

According to Cartesia Docs, the current production model is Sonic 3.5, released in May 2026, with a pinned version available as sonic-3.5-2026-05-04 for stable deployments. The older model identifiers — sonic, sonic-english, and sonic-multilingual — were removed from the API in June 2026. Developers still sending requests to those legacy identifiers will receive errors; migration to sonic-3.5 is required.

According to Cartesia Docs, the model covers 42 languages. On quality benchmarks, according to Artificial Analysis, Sonic 3.5 holds the top quality ELO score on their text-to-speech leaderboard — ELO being a rating system that ranks competitors by the outcome of head-to-head comparisons rather than absolute scores. Alongside quality and speed, the model supports instant voice cloning: submit a short audio sample and the system generates a voice profile in seconds, without a long training run.

Cartesia offers a free tier for prototyping, with paid plans adding higher usage limits, a commercial license, and instant voice cloning for production deployments.

How It’s Used in Practice

The most common place you encounter Cartesia Sonic is in AI-powered voice agents: customer support bots that speak their replies aloud, voice assistants embedded in mobile apps, or interactive voice response systems where an LLM generates responses dynamically. In these pipelines, TTS is the last step before the user hears anything — latency here compounds with every upstream delay in the pipeline.

A typical setup feeds a language model’s text output directly into Cartesia’s streaming API. The API receives tokens as the LLM generates them and begins returning audio data almost immediately, so there is no need to buffer the full reply before playback starts. The user begins hearing the answer while the model is still generating the rest of it.

Pro Tip: In production, always use the pinned model version (sonic-3.5-2026-05-04) rather than the floating sonic-3.5 alias. The alias always resolves to the latest model release, which may change without notice when Cartesia ships updates. Pinning locks your production voice to a specific behavior — evaluate new releases separately on a staging environment before switching.

When to Use / When Not

ScenarioUseAvoid
Real-time voice agent with LLM-generated replies
Batch narration of pre-written articles with no latency constraints
Multilingual customer support bot across many languages
High-volume, cost-optimized bulk audio generation
Rapid voice prototyping with custom voice clones
Long-form audiobook production where audio quality is manually reviewed

Common Misconception

Myth: Cartesia Sonic is a general-purpose TTS tool you can drop in anywhere you need speech output.

Reality: Sonic is optimized for low-latency, streaming, interactive scenarios. It is not necessarily the most cost-efficient choice for bulk or asynchronous audio generation where latency is irrelevant. Choosing it for batch workloads means paying a premium for a streaming architecture you are not using — other TTS providers may offer better economics for non-real-time workloads.

One Sentence to Remember

Cartesia Sonic is the TTS layer for the AI agent era — it starts speaking before it finishes reading, which is what makes a voice bot feel like a conversation rather than a menu system.

FAQ

Q: What happened to the “sonic-multilingual” model ID? A: Cartesia removed the sonic-multilingual, sonic-english, and original sonic model IDs from the API in June 2026. All requests now require sonic-3.5 or the pinned version sonic-3.5-2026-05-04 — the old identifiers return errors.

Q: Can I use Cartesia Sonic to clone a voice for a commercial product? A: Voice cloning for commercial use requires a paid plan. The free tier is for prototyping only and has monthly credit limits. Check Cartesia’s current terms for licensing specifics before deploying a cloned voice in a production product.

Q: How does Sonic compare to other TTS options in the 2026 market? A: According to Artificial Analysis, Sonic 3.5 holds the top quality ELO ranking on their TTS leaderboard. For latency-sensitive voice agent work, Sonic’s sub-90ms delivery is among the fastest available — other providers optimize for different trade-offs like cost, voice variety, or offline capability.

Sources

Expert Takes

Cartesia Sonic’s speed comes from streaming audio in chunks rather than waiting for a complete text sequence. The model generates short audio segments as tokens arrive, trading some prosodic optimality for dramatically lower perceived latency. The result falls below the threshold most humans notice as a conversational pause — which is what separates a live-feeling voice agent from a pre-recorded response. Long, melodically complex sentences are where the trade-off shows most clearly.

The critical implementation decision with Sonic in a voice agent pipeline is where to split LLM output into TTS input chunks. Send too few characters and you get choppy audio with unnatural pauses; send full sentences and you lose the latency advantage. Most teams settle on sentence-boundary splitting with a small character minimum per chunk. Pin the specific model version in production rather than the floating alias — your output voice should not change silently between deployments.

The TTS market is fragmenting, and Sonic is staking out the real-time voice agent segment before others realize that is the high-value territory. Batch TTS is becoming a commodity. The real endgame is voice agents that handle entire customer service interactions without escalation, and latency is the gate — nobody tolerates a noticeable pause from a support bot. Companies building on that stack today are locking in voice identity and infrastructure choices that will be costly to change later.

Voice cloning at this speed creates a genuine verification gap. If a voice agent can replicate any speaker from a short audio clip in seconds, the question of consent becomes harder to enforce at deployment time. The feature is marketed as convenient personalization, but the same capability that lets a company create a consistent brand voice also makes it trivial to imitate a real person without their knowledge. The policy frameworks here are well behind the technology.