Fish Audio

Also known as: Fish Speech, Fish Audio S2, Fish Audio S2.1 Pro

Fish Audio
Fish Audio is an open-weight text-to-speech platform built on a 4B-parameter dual-autoregressive model. It clones voices from short audio clips, supports over 80 languages with fine-grained emotion control, and is available both as a commercial API and as open-weight model weights for self-hosting.

Fish Audio is an open-weight text-to-speech platform that generates natural-sounding speech in over 80 languages and clones a target voice from a short audio clip using a dual-autoregressive model architecture.

What It Is

Most enterprise TTS options lock you into a proprietary API — you send text, get audio, and have no insight into how the model works or any ability to self-host. Fish Audio positions itself as a counterpoint to that: an open-weight TTS platform where the model architecture and weights are publicly available, alongside a commercial API for teams that don’t want to manage GPU infrastructure.

According to Fish Audio Blog, the current platform model is a 4-billion-parameter Qwen3-based design using a dual-autoregressive (Dual-AR) architecture. This approach generates speech in two separate passes: one handles the high-level semantic structure of what’s being said, and the other handles fine-grained acoustic detail like timbre, rhythm, and pronunciation. Separating these two concerns is what allows voice cloning to work from a short reference clip — according to Fish Audio, a 10-second sample provides enough acoustic signal to reproduce the key characteristics of a voice.

Think of it like a printmaker working from a single proof. You hand them one page of the original print, and they extract the ink behavior, pressure, and texture from that sample — then reproduce that print across any content. Fish Audio works the same way with voice: one short clip, then generation in that voice across any text.

According to Fish Audio, the platform model supports over 80 languages, and voice style can be adjusted at generation time through more than 15,000 fine-grained emotion and delivery tags — such as [whisper], [excited], or [laughing] — inserted directly into the text input. This tag-based control is more direct than most commercial TTS systems, which typically expose emotion through separate API parameters or coarser preset categories.

In the 2026 TTS market — where Sonic 3.5 and Gemini TTS represent the commercial tier — Fish Audio occupies a distinct position. According to OfflineTTS, Fish Audio ranks as the top open-weight model on the TTS Arena leaderboard based on public human preference evaluations.

How It’s Used in Practice

The most common way to use Fish Audio is through its REST API: you send a text string along with a reference audio URL (the voice to clone), and the API returns synthesized audio. For teams building content tools — podcast automation, e-learning narration, audiobook production — this means capturing a narrator’s voice once and then generating unlimited audio without re-recording sessions.

The API’s pricing structure is byte-based rather than character-based. This matters for multilingual content: a Japanese or Arabic string uses multiple UTF-8 bytes per character, so the same word count costs more to process than equivalent English text. Teams running multilingual pipelines should account for this when modeling API costs.

For self-hosting, the model weights are available on GitHub under the Fish Audio Research License.

Pro Tip: When building a voice cloning pipeline, record the reference clip in the same acoustic environment you plan to use for production. A clean reference gives noticeably better cloning output than a clip recorded in a noisy room — the model learns from what it hears, including the background noise.

When to Use / When Not

ScenarioUseAvoid
Multilingual audio generation with voice cloning
Rapid voice prototyping from a short reference clip
Commercial product deployment without a commercial license
English-only high-volume workload where byte-based cost is a concern
Research or personal experimentation on self-hosted infrastructure
Enterprise use requiring contractual uptime SLAs or on-premise data guarantees

Common Misconception

Myth: Fish Audio is fully open-source and free to use in any product.

Reality: The GitHub repository uses a FISH AUDIO RESEARCH LICENSE — not Apache 2.0 or MIT. According to Fish Audio GitHub, this license restricts commercial use. The platform at fish.audio is a commercial service. Check the license terms before building a product on the self-hosted model weights.

One Sentence to Remember

Fish Audio is the open-weight TTS option with the strongest benchmark ranking among non-commercial models, but “open-weight” describes the visibility of the model parameters — not freedom to commercialize them, which requires a separate licensing conversation.

FAQ

Q: How does Fish Audio differ from Kokoro or XTTS for voice cloning?

A: Fish Audio clones a voice from a short clip without fine-tuning the model, while models like XTTS require longer reference audio. Fish Audio also supports more than 80 languages and ranks higher on the TTS Arena than other open-weight alternatives.

Q: Does Fish Audio require a GPU to run locally?

A: Self-hosting the Fish Audio model requires GPU infrastructure. The cloud API at fish.audio handles compute on Fish Audio’s servers, making it accessible without any local GPU setup.

Q: What languages does Fish Audio support?

A: According to Fish Audio, the platform supports more than 80 languages. Because pricing is byte-based (UTF-8), non-Latin scripts — which encode to more bytes per character — cost more per request than English text of the same word count.

Sources

Expert Takes

The dual-autoregressive design separates semantic structure from acoustic detail into two generation passes. This is the architectural reason why voice cloning from a short clip works — the model doesn’t need to relearn how the language sounds in general; it only needs enough signal to characterize the speaker’s acoustic identity. Earlier voice cloning architectures mixed these concerns, which is why they required far more reference audio to produce comparable results.

Fish Audio’s byte-based API pricing changes the cost structure for multilingual content pipelines in a way teams often miss until billing arrives. Budget estimates based on word count or character count will be wrong for non-Latin scripts — Japanese, Arabic, and Chinese characters each encode to multiple UTF-8 bytes. Model costs on actual byte counts from your real content sample before committing the API integration to production specs.

The TTS Arena gives Fish Audio a clear ranking: top open-weight model. For any team that won’t route voice data through a closed commercial API — for privacy, compliance, or cost reasons — that ranking answers the “which open model do we use?” question. The research license on the weights is the only real constraint. Know it, plan around it, and stop treating open-weight as a synonym for unrestricted commercial use.

Fish Audio markets itself as open and accessible, but “open-weight” and “open-source” are different things, and the research license makes that difference commercially meaningful. The pattern is common in AI model releases: weights ship under a custom license that restricts commercial use, while the product documentation leads with the word “open.” Developers who build production features on these weights without a licensing conversation are taking on risk that the branding quietly obscures.