Gradium
Also known as: Gradium AI, Gradium TTS, Gradium Voice API
- Gradium
- Gradium is a Paris-based voice-AI startup, spun out of Kyutai, that provides real-time text-to-speech, speech-to-text, and voice cloning through a single API and currently leads independent latency benchmarks among voice generation providers.
Gradium is a Paris-based voice-AI startup, spun out of Kyutai, whose text-to-speech model leads independent real-time latency benchmarks among voice generation providers as of 2026.
What It Is
A voice assistant that takes half a second to start speaking back feels broken, even though half a second sounds instant on paper. The illusion of a real conversation depends on that first sliver of response time — the gap between a user finishing a sentence and the AI’s voice starting to answer. Gradium is one of the companies racing to shrink that gap to almost nothing, and right now it’s the one doing it fastest according to an outside benchmark.
Gradium spun out of Kyutai, a French AI research lab known for open audio language models, in September 2025. Its founders carry a research background in neural audio codecs — technology that compresses and represents speech as sequences of tokens, the same way a language model processes text rather than raw sound waves. The founding team includes Kyutai scientist Neil Zeghidour, researchers Laurent Mazaré and Alexandre Défossez, and ex-Google engineer Olivier Teboul.
Gradium ships text-to-speech (TTS, turning text into spoken audio), speech-to-text (STT, transcribing audio into text), and voice cloning through one API, instead of requiring three separate vendors stitched together. The metric it competes on is time-to-first-audio (TTFA): how long after a request before the first chunk of audio starts streaming back, rather than waiting for the entire reply to finish generating. Think of it like a chef sending out the first plate the moment it’s ready instead of holding the whole order until every dish is cooked. According to Gradium Blog, on an independent benchmark run by Coval, Gradium’s TTS model led the field on time-to-first-audio among major real-time voice providers as of a recent published snapshot.
Speed alone isn’t the whole story — a model that talks instantly but garbles words isn’t useful. According to Gradium Blog, the same benchmark also tracks word accuracy, and its results show Gradium holding competitive accuracy while leading on latency, the harder combination: most providers trade one for the other.
How It’s Used in Practice
Most people encounter Gradium through its API when building a voice agent — a customer support bot, an AI phone assistant, or an in-app voice feature that needs to sound like it’s actually listening, not playing back a buffered clip. The API streams audio over a persistent connection (a websocket, which keeps one connection open instead of a new request per reply), so the first words begin playing while the rest of the sentence still generates.
A second common scenario is voice cloning: instead of picking from a small library of stock voices, a product team can create a custom voice from a short sample, or commission a higher-fidelity “Pro Voice Clone” for production use, then route both TTS and STT requests through the same account.
Pro Tip: Latency numbers from any benchmark, including Gradium’s own, are measured under controlled conditions. Before committing to a provider, test time-to-first-audio over the actual network and device your users will be on — a mobile connection adds delay a lab benchmark never sees.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Building a real-time voice agent where response delay breaks the conversation | ✅ | |
| Wanting text-to-speech, speech-to-text, and voice cloning from one vendor instead of three separate tools | ✅ | |
| Producing pre-recorded narration or audiobooks where generation happens offline and latency doesn’t matter | ❌ | |
| Locking in a single voice vendor for a long enterprise contract before any extended track record exists | ❌ | |
| Prototyping a voice feature and benchmarking against the current latency leader | ✅ | |
| Needing a language outside Gradium’s current launch set (English, French, German, Spanish, Portuguese) | ❌ |
Common Misconception
Myth: A model that tops a latency benchmark today will always be the fastest option. Reality: Independent latency benchmarks like the one Gradium leads are live snapshots, refreshed regularly as providers ship updates and competitors retune their own models. The benchmark measures time-to-first-audio specifically — how quickly the first sound starts — not the total time to generate a full response, and not the network delay a real user actually experiences. A leaderboard position from one snapshot doesn’t guarantee the same ranking later.
One Sentence to Remember
Gradium’s bet is that in voice AI, shaving the delay between a request and the first sound of a reply is the product, not a footnote — and for now, independent benchmarks back that up, making it worth a serious look if your application needs a conversation to feel live.
FAQ
Q: What is Gradium used for? A: Gradium provides a single API for real-time text-to-speech, speech-to-text, and voice cloning, mainly used to build voice agents, customer support bots, and in-app voice features that respond without noticeable delay.
Q: Who founded Gradium? A: Gradium was founded in September 2025 by a team including Kyutai scientist Neil Zeghidour, researchers Laurent Mazaré and Alexandre Défossez, and ex-Google engineer Olivier Teboul, after spinning out of French AI lab Kyutai.
Q: Is Gradium the fastest text-to-speech provider? A: According to Gradium Blog, its TTS model led time-to-first-audio among major real-time voice providers in a recent published benchmark snapshot; rankings on live benchmarks can shift as providers update their models.
Sources
- TechCrunch: Paris-based AI voice startup Gradium nabs $70M seed - Company founding, founding team background, and seed funding round.
- Gradium Blog: TTS Latency Benchmark 2026: TTFA Compared Across Gradium, ElevenLabs, Cartesia and Deepgram - Independent latency benchmark showing Gradium’s TTS model leading on time-to-first-audio.
Expert Takes
Gradium’s lead isn’t a flashy parameter count or a clever training trick — it’s systems engineering layered on audio language models that already encode speech as tokens, the same family of models its founders helped build. Streaming the first sound while the rest of the sentence is still generating turns text-to-speech from a batch job into a pipeline. Not a bigger model. A faster pipe around the same kind of model.
If you’re wiring a voice agent into a product, a latency-leading benchmark matters less than whether the integration matches how your app actually talks to it — streaming over a persistent connection, handling interruptions mid-sentence, falling back gracefully when the network stalls. A spec sheet that just says “fastest” tells you nothing about behavior under your own traffic pattern. Test it inside your own pipeline before trusting someone else’s lab numbers.
Voice is becoming the next interface battleground, and the winning move isn’t the best-sounding voice — it’s the fastest one that still sounds convincing. Gradium spun out of a respected research lab with investors willing to bet early, and it’s already forcing established voice-AI players to defend their latency numbers in public. Either the incumbents close the gap fast, or a new default voice provider gets crowned this year.
A voice that responds instantly, sounds natural, and can be cloned from a short sample raises a question nobody in this latency race seems eager to answer: how does a listener tell a live cloned voice from the real person on the other end of the call? Speed and realism are being optimized together, while the safeguards — disclosure, consent for cloning, audio provenance — are arriving as an afterthought, not a launch requirement.