Kokoro TTS
Also known as: Kokoro-82M, Kokoro text-to-speech, hexgrad Kokoro
- Kokoro TTS
- Kokoro TTS is an open-weight text-to-speech model with 82 million parameters that converts text to natural speech across eight languages using 54 fixed built-in voices, running faster than real-time on standard CPU hardware without cloud access.
Kokoro TTS is a compact open-weight text-to-speech model that converts text to natural speech using 54 fixed built-in voices across eight languages, running on standard CPU hardware without cloud access or voice cloning capability.
What It Is
Most text-to-speech integrations push text to a remote server, pay per character, and hand control to a third party. Kokoro TTS is a local alternative: a model you download once, run entirely on your own hardware, and call as many times as needed with no network requirement, no ongoing cost, and no data leaving your system.
According to Local AI Master, Kokoro-82M (v1.0, released January 27, 2025) weighs approximately 350MB, runs faster than real-time on CPU, and is licensed under Apache 2.0 for unrestricted commercial use. Think of it as a pre-recorded voice actor library rather than a live synthesis engine — every voice in the model was built in during training, not captured from real people on demand.
The underlying architecture is a decoder-only StyleTTS 2 model. StyleTTS 2 treats speech generation as style transfer: the model learns not just how to map text to sound, but how to replicate the prosodic characteristics — rhythm, pitch contour, emphasis — that make speech sound natural rather than robotic. Earlier pipeline architectures like Tacotron pass through an intermediate mel-spectrogram (a frequency-domain representation of audio) before synthesizing a waveform. Kokoro generates audio more directly, which contributes to its speed advantage.
The model ships with 54 built-in voices covering American and British English, Spanish, French, Hindi, Italian, Japanese, and Brazilian Portuguese, according to Local AI Master. Critically for the ethical context of voice synthesis: Kokoro has no voice cloning capability. You select from the bundled voices at inference time; there is no mechanism to feed it a recording of a real person and produce speech in that person’s voice. This design makes it structurally different from models like XTTS or Cartesia Sonic that explicitly support voice adaptation from audio samples.
How It’s Used in Practice
The most common use case is an application that needs local text-to-speech output without cloud dependency — a reading assistant, an accessibility layer in a desktop tool, an offline documentation reader, or a podcast production workflow where text narration should stay on the creator’s machine. Because Kokoro runs on CPU without a GPU requirement, the deployment profile is broad: laptops, headless servers, containers, even single-board computers.
A typical integration loads the model file once per session, specifies a voice from the bundled list, passes a text string, and receives an audio array that can be written to a file or piped directly to a speaker. Subsequent calls within a session are fast enough for interactive workflows.
Pro Tip: Kokoro’s CPU performance and small model footprint make it a realistic choice for edge deployment — on embedded hardware or in a Docker container with minimal resource allocation. If you are building a voice output feature and want to avoid cloud TTS costs as usage scales, evaluate Kokoro against a production TTS API before committing to the cloud path.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Local app needing offline voice output without cloud costs | ✅ | |
| Reproducing or approximating a specific person’s voice | ❌ | |
| Multilingual TTS across the eight supported languages | ✅ | |
| Building a product that requires a custom brand voice identity | ❌ | |
| Privacy-sensitive workflows where text cannot leave the device | ✅ | |
| Real-time voice streaming under tight low-latency requirements | ❌ |
Common Misconception
Myth: Kokoro TTS can clone voices from audio samples, making it a tool for producing unauthorized speech in someone else’s voice.
Reality: Kokoro has no voice cloning mechanism. Its voices are fixed and bundled with the model weights — you choose from what was included at training time, and you cannot add external audio as a speaker reference. If you encounter a cloning workflow described as “using Kokoro,” other tooling is doing the cloning. Kokoro itself only synthesizes from its 54 built-in voices.
One Sentence to Remember
Kokoro TTS lets you add speech output to any application without a cloud subscription or privacy exposure — and because it cannot clone voices, it sidesteps the consent questions that make other TTS tools ethically complicated.
FAQ
Q: Is Kokoro TTS free for commercial use? A: Yes. Kokoro TTS is released under the Apache 2.0 license, which permits commercial use, modification, and redistribution. The only requirement is including the license notice in your distribution.
Q: Can Kokoro TTS reproduce a specific person’s voice? A: No. According to Local AI Master, Kokoro provides only its 54 fixed built-in voices. The model has no speaker conditioning mechanism or adapter for external audio, so it cannot be directed at a real person’s voice.
Q: Does Kokoro TTS require a GPU to run at acceptable speed? A: No. According to Local AI Master, Kokoro runs faster than real-time on CPU hardware with model weights of approximately 350MB, making it practical on standard laptops and edge devices without a dedicated GPU.
Sources
- Hugging Face: hexgrad/Kokoro-82M Model Card — official model repository with voice list and usage documentation
- Local AI Master: Kokoro TTS Local Setup (2026): Tiny 82M Open Voice Model — setup guide with verified model specifications
Expert Takes
Kokoro’s architecture tells an interesting story about the efficiency frontier in speech synthesis. StyleTTS 2 reframes voice generation as style transfer — the model learns not just phoneme-to-waveform mapping but the prosodic envelope that makes speech sound human. It demonstrates that natural-sounding output is achievable at a fraction of cloud TTS model sizes. The constraint is flexibility: fixed voices trade adaptability for consistency, and the audio sample rate is adequate for most applications but not broadcast-quality.
When specifying local TTS in a privacy-sensitive workflow, Kokoro’s fixed-voice model is actually a feature, not a limitation. It eliminates consent infrastructure around voice collection because there is no collection. The integration contract is simple — model file, voice name, text in, audio out — and the permissive open-source license means no vendor lock-in or usage audits. For internal tools or edge deployment where internet access is unreliable, this is a well-scoped tool for a clear job.
Compact open-weight TTS models that run on laptop CPUs are compressing a market that cloud providers assumed they owned. Kokoro proves that offline voice synthesis no longer requires a data center, which changes the economics for indie developers and enterprise apps that want to avoid per-character billing. Fixed voices are a current ceiling, not a permanent design commitment — watch what happens when fine-tuning support arrives at models of this size.
Kokoro’s lack of voice cloning is worth naming explicitly — the ethical weight of TTS lives almost entirely in that feature. A model that synthesizes from built-in voices raises no consent questions about copying someone’s voice. But we should resist calling this “ethical TTS.” The question is never whether one model supports cloning; it is whether the tools built on top of it prevent misuse. Kokoro is fine. The ecosystem still needs scrutiny.