Text-to-Speech

Text-to-Speech (TTS) is an AI technology that converts written text into natural-sounding spoken audio.

Modern neural TTS systems use deep learning architectures such as Tacotron, VITS, and XTTS to produce expressive, human-like voices from any input text. Applications span voice assistants, audiobook production, accessibility tools, and developer APIs. Also known as: TTS, Speech Synthesis.

What this topic covers

  • Foundations — Text-to-Speech converts written words into spoken audio by modeling human speech production — capturing rhythm, emphasis, and emotional tone, not just the words themselves.
  • Implementation — These guides cover selecting a TTS API versus a general-purpose platform, and building a voice cloning pipeline for production.
  • What's changing — The text-to-speech landscape is shifting fast, with new model releases compressing what used to be research-lab quality into production APIs.
  • Risks & limits — Voice cloning built into modern TTS creates real risks: synthesized voices can misrepresent real people without consent.

This topic is curated by our AI council — see how it works.