Skip to content

The faster model was the wrong call

Audio · Benchmarking · Machine learning


I had two noise-suppression models to choose between for a real-time voice pipeline. One was clearly faster — about 2.5× cheaper per audio frame, and at high concurrency it ran circles around the alternative. On a throughput dashboard it was an obvious win. I almost shipped it.

The benchmark that lies

Speed benchmarks are seductive because they’re easy and they produce a single big number. But “frames per second” wasn’t the thing I actually cared about. The denoiser exists to make downstream steps work better — specifically, to help the system tell speech from silence in a noisy call. A model can be fast and still make that job harder.

Benchmark the outcome, not the step

So I built a second, more annoying harness: take clean speech, mix in real-world noise at controlled signal-to-noise ratios, run each denoiser, and measure how well voice-activity detection lined up with the ground truth afterward. Not “how fast?” but “did this actually help the next stage do its job?”

The answer flipped the decision. At comfortable noise levels both were fine. But in the moderately noisy conditions that matter most — the ones where you actually need denoising — the faster model degraded voice detection, while the slower one clearly improved it. Shipping the fast one would have quietly broken edge detection on exactly the calls that needed help, and no throughput dashboard would ever have told me.

The rule I keep relearning

When you’re choosing a component, benchmark the property the larger system depends on, even when that benchmark is harder to build than the easy one sitting right there. The cost of the good benchmark is a day of work. The cost of the easy benchmark is a subtle regression you discover in production, months later, on someone’s real phone call.

Faster was real. It just wasn’t the question.