Real-time voice AI agents

A phone call is the most unforgiving place to put an AI. There’s no spinner, no “thinking…”: if the system pauses for even a second, the caller talks over it, hangs up, or just senses something is off. I build the layer that has to get this right on real customer calls, in a half-dozen Indian languages and the code-switched mix people actually speak.

What it does

A live call flows through a tight loop: recognise speech, decide what to say, speak it back, fast enough that the conversation feels human. On top of that sits a layer of judgement: is this a person, a voicemail, an IVR menu? Should the agent take this call or hand it to a human? Each of those decisions has to be made mid-sentence, in the language the caller is using.

What I worked on

Tuning small language models for live calls. The interesting work isn’t the big model. It’s getting a small, fast model to make reliable decisions under a strict latency budget. I ran a measured prompt-engineering loop on a multilingual evaluation set and improved classification accuracy by double digits while nearly halving both the prompt size and the response latency. Most of the gains came from treating the prompt as an interface: putting the decisive instruction first, flattening taxonomies into concrete cues, and reframing fuzzy “judgement” questions as lookups the model can’t get wrong.
Killing mid-call stalls. A single bad turn-completion decision could send the model into multi-second retry loops while a caller waited in silence. I rebuilt the turn logic and developed a repeatable way to keep instructions alive inside very long prompts, validated across dozens of multilingual test cases until the stalls were gone.
Audio that survives the cloud. A real dropout turned out to have nothing to do with audio code: Kubernetes CPU throttling was starving the event loop and draining the playout buffer. I traced it through pod telemetry and the waveform of one specific call, then added a lookahead pacing cushion with an explicit, documented latency-vs-reliability tradeoff.
A faster pipeline in Go. When the Python runtime became the bottleneck, I started StrawGo, an open-source Go framework for the same real-time voice pipeline, built to fit far more concurrent calls on a single machine.
Observability you can debug with. Custom latency percentiles, provider failover events, and audio-pacing histograms feeding dashboards and traces, enough to take a customer complaint and walk it back to the exact millisecond it went wrong.

Why it’s hard

Everything is difficult at once: telephony-grade audio is low quality, the models are small on purpose, the languages mix freely, and the whole thing runs on shared infrastructure that can stutter at any moment. The work is making something that feels effortless to a caller out of all of that.