Fish Speech S2 Pro: Open-Source Voice AI That Beats Closed Models
Hands-on with Fish Speech S2 Pro — what it sounds like, what it costs to run locally, and where it actually wins.
Watch (8:35)
Overview
Hands-on with Fish Speech S2 Pro — what it sounds like, what it costs to run locally, and where it actually wins.
Full transcript (from the video)
Fish Audio just released S2 Pro and this model changes what open- source texttospech. It is a 5 billion parameter model trained on over 10 million hours of audio across and the results are not incremental. On the CTS evaluation benchmark cy2 pro scores a 0.54% word error rate. Those numbers beat every closed source competitor test including models from major labs. The weights are on hugging face right now under a research light. If you work with voice synthesis, voice cloning or any pipeline that turns text into speech, this is the model to understand. Before we get into the architecture, it helps to understand why text to speech is the job is not just converting characters to sound. A good TTS model has to handle proity which is the rhythm and melody of speech. It has to handle emotion pacing and it has to do that across languages that have completely different phonetic.
Most TTS systems solve this by building a separate phone which means more engineering, more failure point. That is the baseline fish speech is working against. The jump from fish speech version 1 to S2 Pro is not just version 1.4 was the first to use a large language model and it worked reasonably well for basic multilingual. But S2 Pro changes three fundamental things. First, it scales to 5 billion parameters with a completely second it replaces pure supervised training with reinforcement learning alignment. Third, the control surface went from a handful of emotion savor every word, complete every frame.
Each of those changes matters on its own. Together, they move fish speech from a promising experiment. The core innovation in S2 Pro is that instead of one giant transformer doing everything, S2 Pro splits the job between two model.
The slow AR has 4 billion parameters and works along. It predicts the primary semantic code book which captures what is being said and how it should sound at a high level. The fast AR has 400 million parameters and works. It generates the remaining nine residual code books that encode texture and precise waveform shape. The 10:1 parameter ratio is deliberate. The semantic prediction is the hard part.
Once you know what the speech should sound like, filling in the acoustic detail is a much simpler. This split keeps inference fast without the audio codec is the bridge between the transformer stew pro uses a residual vector quantization codec. That means for every second of audio, the model makes roughly 21 prediction steps. At each step, the first code book captures the semantic. The slow AR is responsible for that. Then code books 2 through 10 add progressive and the fast AR handles all of those. This layered approach means you can think of speech generation. First, deciding what to say and how to say it, then filling in exactly what it sounds like. The 21 Herz frame rate keeps the sequence length.
Compare that to raw audio at 44 100 samples per second. The KOD compresses audio by roughly 2,000 sixes your troubleshoot systems. Here is the part that makes S2 Pro practical. The dual AR architecture is structurally isomorphic.
That means it is not just inspired by LL. It is literally compatible with LLM.
Fish audio uses SG lang to serve S which gives them continuous batching pay CUDA graph replay and radics attention prefix on a single H200 GPU. This architecture delivers a real-time factor meaning it generates speech about five times. Time to first audio is around 100 millisecond and throughput hits over 3,000. C row C row C row C row. Those are production grade numbers achieved by reusing the LLM ecosystem in the training pipeline is where S2 Pro makes its biggest fish speech version one. Use straightforward.
You have text and audio pairs and the model learns to predict one from the other. S2 Pro adds reinforcement learning on top using the reward model evaluates four dimensions which checks if the speech matches the text.
Instruction adherence which checks if the model followed acoustic quality which rates how natural the and timber similarity which measures how well it match. The clever part is that fish audio reuses the same they use for data filtering and annotation that eliminates the distribution mismatch problem that plague the reward model already understands the data. The control system in S2 Pro is unlike anything. You embed tags directly in the text using square and the model follows them. But this is not a fixed set of 20 or 30 predefined.
CU2 Pro supports over 15,000 and it accepts free form natural language description. You can write whisper in small voice professional broadcast tone slowly with or sarcastic and dry and the model will adjust its output accordingly. That means the interface for controlling speech. You are essentially directing the model the way you would direct a voice actor. Tell it how the line should feel and it adjusts proity, pacing, volume and emotion to match. This is a direct result of the reinforcement learning. The model learned to follow instructions, not just voice cloning in S2 Pro works. You provide a reference audio sample of 10 to the model captures the tambber, speaking style, and emotional tendencies of that voice. No fine-tuning, no training loop, no pervoice model. The reference audio goes through the same codec and the model conditions its generation on those. This means you can clone a voice and immediately start generating new that voice with full control over emotion and proity through the practical result is that content creators game developers and accessibility tools can generate. You just need a clean audio sample and the text you want spoke. The benchmarks tell a clear story on the seed TTS evaluation which is a standard benchmark for speech synthesis quality. DUP Pro scores a 0.54% that beats Quinn3 TTS at 0.77. In English, it scores 0.99. Again, the best overall on the audio touring test, which measures how well humans can distinguish synthesized S2 Pro achieves a posterior mean that is 24% better than the CTS and an emergent TTS eval, which tests naturalness and expressiveness. CUCU Pro wins 81.8 8 with a staggering 91.61% win rate. Specific these are not cherrypicked results. They represent consistent performance across multip one of the most important architectural decisions in fish speech is that it skip. Traditional TTS systems convert text to f which means you need separate pronunciation rules for that is fragile engineering. Every new language requires linguistic expertise. Fish speech operates directly on raw text. The transformer learns the relationship between characters and sound. This is why it scales to 80 plus languages without proportion. Japanese, English and Chinese get tier one. Korean, Spanish, Portuguese and German are tier 2 and then there are over 70 additional languages. The phone free design is what makes that breath possible. The performance numbers are production ready on a single Nvidia H200 GPU. S2 Pro achieves a realtime factor. That means it generates 1 second of speech in a roughly five times faster than real time. Time to first audio is around 100 millisecond which is fast enough for interactive applications and throughput stays above 3,000. Sir is every word complete every maintaining a realtime factor below 0. All of this comes from the SG lang integration. Because the model architecture mirrors a standard, it inherits continuous batching for handling multiple paged KV cache for memory efficiency. CUDA graph replay for reduced kernel launch over and preface caching for faster repeated generation.
These are mature optimizations from the language applied to speech synthesis without modification. CES 2 Pro handles multispeaker generation. You use speaker tokens to switch between voices within and each speaker maintains their own tambber and style throughout. But the more interesting part is the multi-turn capability. Context from earlier in the conversation influences how the if the conversation starts calm and becomes heated, the model adjusts naturally.
Emotion tags work per speaker.