StrawGo: a real-time voice AI framework in Go

StrawGo is an open-source framework I’m building for real-time conversational voice AI in Go. If you’ve ever wired up a voice agent in Python, you know the ceiling you eventually hit: at real call concurrency, the runtime starts fighting you. I wanted to see how far a goroutine-native design could push that ceiling.

What it is

A frame-based pipeline where every stage of a call (audio in, noise suppression, voice-activity detection, turn-taking, the model, speech out) is a composable processor, with audio frames flowing between stages over channels. It speaks the common telephony and WebRTC transports and plugs into a wide range of speech-to-text, text-to-speech, and language-model providers.

What’s interesting about it

An order of magnitude more throughput per machine, at a fraction of the memory, versus the Python framework it takes inspiration from, which, for a workload measured in concurrent live calls, is the whole game.
An in-process audio pipeline (denoise → voice detection → turn detection) running ONNX models directly, rather than shelling out to separate services.
Honest benchmarking baked in, including a denoiser comparison where the faster model was deliberately rejected because it hurt voice detection in noisy conditions. (I wrote about that here.)
The hard edges done on purpose: a dedicated guard for stale audio, a rate-limited pacer, and a set of strategies for the tricky question of when the agent has actually finished speaking.

It’s the kind of engineering I find most interesting: real-time, resource-bound, and judged by whether a live phone call feels natural.