The deploy that quietly dropped database queries

Every so often, during a routine rolling deploy, a handful of requests would die with a connection-refused error talking to the database, and then everything was fine again. Intermittent, deploy-correlated, and exactly the kind of bug that’s easy to shrug off because it “heals itself.”

What was actually happening

The app talks to Postgres through a local connection pooler running as a sidecar in the same pod. When Kubernetes rolls a pod, it sends the shutdown signal to all the containers in that pod at the same moment. The app does the right thing: it stops taking new work and tries to finish the requests already in flight, which takes a few seconds. But the pooler, getting the same signal at the same time, just exits. So for the length of the drain window, the app is still trying to run queries through a pooler that’s already gone. Connection refused.

The trap is that nothing is wrong with either component. They’re both behaving correctly. The bug is in the ordering: the dependency (the pooler) tore down before the thing that depends on it (the app) had finished.

The one-line fix

Kubernetes gives every container a preStop hook and a grace period before it escalates to a hard kill. So the fix is to make the pooler simply wait:

lifecycle:
  preStop:
    exec:
      command: ["sleep", "30"]

Thirty seconds: comfortably less than the pod’s grace period, comfortably more than the app’s drain. The pooler receives the shutdown signal, sleeps through the window during which the app is still draining, and only then exits. No code change, no new dependency, no cost. The race condition is gone.

The general lesson

The pattern generalises to anything a slow-draining process depends on inside the same pod: a buffering log shipper, a local cache, a proxy. If component B can’t do its job without component A, and they receive the shutdown signal together, you have to hold A open long enough for B to finish. preStop: sleep is the crude, reliable way to do that.

But the real lesson is the debugging one: “is each component correct?” is the wrong question for a distributed shutdown. The right question is “in what order do these things stop, and who is still depending on whom while it happens?” Once I drew the shutdown sequence on paper, the fix was obvious. Before that, the connection code looked correct and told me nothing.