The faster model was the wrong call

I had two noise-suppression models to choose between for a real-time voice pipeline. One was clearly faster: about 2.5× cheaper per audio frame, and at high concurrency it ran circles around the alternative. On a throughput dashboard it was an obvious win. I almost shipped it.

The easy benchmark was wrong

Speed benchmarks are easy to reach for, and they produce a single big number. But “frames per second” wasn’t the thing I actually cared about. The denoiser exists to make downstream steps work better, specifically to help the system tell speech from silence in a noisy call. A model can be fast and still make that job harder.

Benchmark the outcome, not the step

So I built a second, more annoying harness: take clean speech, mix in real-world noise at controlled signal-to-noise ratios, run each denoiser, and measure how well voice-activity detection lined up with the ground truth afterward. Not “how fast?” but “did this actually help the next stage do its job?”

The answer flipped the decision. At comfortable noise levels both were fine. But in the moderately noisy conditions that matter most, the ones where you actually need denoising, the faster model degraded voice detection, while the slower one clearly improved it. Shipping the fast one would have quietly broken edge detection on exactly the calls that needed help, and no throughput dashboard would ever have told me.

What I keep relearning

When you’re choosing a component, benchmark the property the larger system depends on, even when that benchmark is harder to build than the easy one sitting right there. The cost of the good benchmark is a day of work. The cost of the easy benchmark is a subtle regression you discover in production, months later, on someone’s real phone call.

The faster model really was faster. It just wasn’t measuring the property the system depended on.