Skip to content

The bug wasn't in your code, it was in the scheduler

Real-time systems · Kubernetes · Debugging


A user reported that the voice agent went silent mid-call — a clean gap, then it came back. The obvious suspect is the audio code. The obvious suspect was innocent.

The trap

When real-time audio glitches, every instinct points at the audio path: the encoder, the buffer, the network. I spent a little while there and found nothing wrong. The audio code was doing exactly what it was told — it just wasn’t being given a chance to run.

Following the silence

The breakthrough was refusing to theorise and instead lining up three things in time: the waveform of the exact call that dropped, the process’s view of its own event loop, and the container’s CPU accounting from the platform. They told one story. The event loop had frozen for a stretch — not because it was busy, but because the kernel had taken the CPU away.

On shared infrastructure, a container gets a CPU quota, and when it exceeds that quota the scheduler throttles it: the process is frozen until the next accounting window. For a batch job, a few milliseconds of throttle is nothing. For a loop that has to emit an audio frame every few milliseconds, it’s a hole in someone’s sentence. The metric that named the culprit was a counter of how many times the process had been throttled — quietly climbing into the tens of thousands.

The fix, and the honest tradeoff

You can’t always stop the throttle, so the fix was to make the audio loop resilient to it: keep a small lookahead cushion of already-paced audio, so a brief freeze drains the cushion instead of the call. When the cushion is low, yield and immediately refill; when it’s full, pace normally.

That cushion isn’t free — it adds a bit of latency before the agent can react to someone interrupting. So I wrote the tradeoff down explicitly in the code: roughly a quarter-second of extra barge-in latency, deliberately accepted, because a smooth call matters more than shaving milliseconds off interruptions. A tradeoff you chose and documented is engineering; one you stumbled into is a future incident.

What I took from it

Two things. First, in a real-time system the question “is my code correct?” is incomplete — the real question is “is my code getting to run when it needs to?” Second, the fastest way out of a hard bug is to stop reasoning from the symptom and start correlating independent sources of truth until they agree. The waveform, the event loop, and the scheduler each told a partial story. Only together did they point at the door.