An order of magnitude more calls per machine

A voice AI call is a pipeline: audio comes in, gets cleaned up, gets recognised as speech, goes to a model, comes back as speech, goes out. Run one of those per phone call and you discover that the language you wrote the pipeline in suddenly matters a great deal.

Why the Python version hit a wall

Python is a fine place to prototype a voice pipeline: the ecosystem is right there. But at real concurrency it runs into structural limits: a single interpreter lock serialises CPU-bound work, the async event loop saturates, and memory per stream climbs faster than you’d like. You can scale out by throwing more processes and machines at it, but the per-machine efficiency is poor, and for a workload measured in concurrent live calls that efficiency is the whole ballgame.

The Go rewrite

I rebuilt the pipeline in Go around a frame-based model: each stage of the pipeline is a goroutine, audio frames flow between stages over channels, and the stages compose. This plays to exactly what Go is good at: cheap concurrency, predictable memory, real parallelism across cores. The result was roughly an order of magnitude more useful throughput per machine, at a fraction of the memory footprint. In operational terms: many more simultaneous calls on the same hardware, at the same quality bar.

The part that was actually hard

The rewrite is the headline; the subtleties are the substance. The trickiest one turned out to be a deceptively simple question: when has the agent actually finished speaking?

You need to know this precisely: it’s the instant the agent should start listening again, and getting it wrong means either talking over the caller or leaving dead air. And the answer depends entirely on how the audio is leaving the system. Over one transport, “done” means the local send buffer has drained. Over another, it means the far end acknowledged the audio. Over a third, it means you padded out a tail of silence and heard a marker echo back. The same logical event, “the bot stopped talking”, resolves three completely different ways depending on the path the audio took.

So the pipeline grew a small set of completion strategies, plus a deadman fallback for when none of the expected signals arrive, because in a real call, eventually one of them doesn’t. Most of the reliability of a voice agent lives in unglamorous edges like this: not in the model, but in knowing exactly when one turn ends and the next begins.

The two takeaways

First: for a concurrency-bound real-time workload, the language and concurrency model aren’t an implementation detail. They set your per-machine ceiling, and Go’s goroutines-and-channels model is a genuinely good fit. Second, and more durable: the hard part of real-time voice isn’t the headline pipeline, it’s the seams: when a turn is over, when playback truly finished, what happens when an expected signal never comes. Getting the seams right is what makes the rest of the system reliable; getting them wrong undermines even a good model.