How Fish-S2 Plus ClearVoice Made The Voice Feel Real
An AI voice pipeline is only useful if it can reject bad takes before they ship. Fish-S2 Plus generates the narration, ClearVoice cleans each slide after the mix, and NIQA plus Audio Flamingo score artifacts, speaker match, and expression so cleanup improves rough audio without sanding the voice flat.
Watch (5:35)
Overview
An AI voice pipeline is only useful if it can reject bad takes before they ship. Fish-S2 Plus generates the narration, ClearVoice cleans each slide after the mix, and NIQA plus Audio Flamingo score artifacts, speaker match, and expression so cleanup improves rough audio without sanding the voice flat.
Full transcript (from the video)
This voice was generated entirely with AI. In this video, I will show how we built the pipeline behind it, why the workflow matters, and why it is worth getting ahead of this now. These models are going to keep improving, and the people who learn the loop early will have a real advantage. Here is the stack.
Resemble S2 generates the voice, and Clear Voice Studio cleans the tape. And NIQA and Audio Flamingo score the sound. The hard part was inconsistency. A few slides sounded warm, close.
Other slides carried a raspy edge or a faint digital crackle that made the microphone feel worse than the voice model really was. The obvious move was to keep changing generation settings. We tried lower temperature, different seeds. That helped individual takes, but it did not create a dependable workflow.
The better answer was to generate a candidate, clean it, score it, and only accept it if the voice stays natural, similar to the speaker, and free of obvious artifacts. The scoring loop had to match what I actually hear. One check looks for artifacts. If a take has crackle, gaps, clipped sounds, or broken audio, another check asks whether the voice still has A clean waveform is not enough if the delivery turns flat or robotic.
The quality score is only one signal. Simple checks look for clipping. We compare the voice against the reference clip. The goal is not to replace human listening.
It is to keep bad takes from becoming the default. Resemble Enhance was the first tool that made the repair path feel real. A denoise only passed on the first test clip was immediately useful. It kept the timing aligned, reduced the raspy edge, and did not break the slide duration.
When we combined denoise and enhancement, the score jumped from weak to excellent. The measurement still showed the tradeoff. Speaker similarity fell by several points. Resemble could make the audio cleaner, but sometimes by pulling the voice away from itself.
That made it useful for repair, but risky as the automatic and the strongest lesson was that more processing is not automatically that the early pass sounded useful. So, we pushed it. Three passes, five passes, duration still looked stable and peak level stayed below a basic file inspection would not scream failure. The judge loop did.
After repeated passes, the voice picked up artifact hits and the score. That is exactly why the loop exists. A cleanup model can slowly smear consonants, flatten room tone, or add a plastic edge while still producing a normal-looking wave file. The ear hears the damage before a simple file check does.
Clear voice changed the decision because it improved the weak clip without pulling the voice away from the speaker. The score moved from weak to strong. Speaker similarity stayed very close to the original. Expression improved and the loop found no hard reject and no artifact hits.
That balance mattered more than winning one headline score. The model behaved like a voice-preserving clean. It helped the weak take while keeping the speaker close enough to remain believable, which is exactly what an automatic cleanup step needs to do. The second test was just as important.
A default enhancer must not damage the strong. We ran ClearVoice on a slide that already sounded solid and the score barely moved. The model did not force a dramatic change where the audio was already clean. It showed restraint.
That is the behavior we want. Improve rough slides and avoid turning the deck into a collection of differently processed voices. The best cleanup model is not the one that changes everything. It is the one that knows when to leave the performance alone.
Once ClearVoice won, the implementation boundary mattered. The cleanup step runs after each slide is mixed and before that mixed file is marked complete in the cache. That means a resumed build does not silently reuse old audio from before the cleaner exists. The batch path and the individual slide generation path both call the same.
The final music mix is left alone. Background music ducking and export happen after the narration. The result is one enhanced voice wave per slide, easy to inspect, easy to score, and easy to replace. Automatic cleanup is only half of the workflow.
The repair loop now has a Codex command line wrapper for text fixes. When the cloud path is unavailable, the wrapper starts a fresh small Codex model process and asks for the smallest narration change that fixes the failed speech check. The audio subgraph stays narrow, too. It takes one input clip, one output path, one model choice.
We can generate one audition candidates, put it in project media, listen in the editor, score it, and only then decide whether it should replace the slide. The rule is simple. Do not chase one number. Audio quality If a candidate gets cleaner while losing speaker similarity, that may be worse for this channel.
If a repeatedly noise chain looks stable, but picks up artifact hits, it should lose immediately. If a model improves a bad slide, but damages a good slide, it is not a safe default. Clear Voice won >> >> because it balanced the measurements. It improved the rough clip, avoided hard rejects, kept similarity high, and barely touched an already That is the behavior we want in every build.
Generate with character first, >> >> then clean with restraint. Be Fish gives us the speaker identity and natural timing. Clear Voice cleans the slide after it becomes a The loop protects against two failures, keeping raspy audio just because the speaker meant keeping over-processed audio just because the waveform looks clean. The target is simple: realistic voice, close speaker match, and a final clip that sounds ready to ship.