The four hours we spent chasing a missing log line.
One async consumer, no structured logs, a queue silently filling up — and the logging pattern we now require for every async job.
Summary
On May 2nd at 18:14 BRT, a Celery consumer on the analysis-worker service
silently stopped acknowledging tasks. The queue grew from ~30 messages to
~38,000 before paging triggered. Total customer impact: 4h12m of delayed
analyses, no data loss, no failed payments.
This document is the kind of postmortem we publish openly: what broke, why we didn't see it sooner, and what we changed.
Timeline
- 18:14 — A code deploy lands a new pre-fetch step in the worker.
- 18:17 — The first task hits the new code path, raises a
KeyError, and the worker's outertry/exceptswallows it. The task is not re-queued. - 18:21 — Queue length crosses 1,000.
- 20:09 — On-call gets paged on
queue.depth > 5000. - 22:26 — Root cause identified, hotfix deployed, drain begins.
Why the page came late
We had a queue-depth alert. It triggered. But by the time it triggered, the queue had been growing for almost two hours. The right alert was not "queue deep" — it was "consumer not consuming."
We didn't have that alert because we couldn't write it: the worker had no log line at the boundary of the new code path. The exception was swallowed silently. There was nothing for Datadog to count.
What we changed
Three durable changes:
- Every async handler now emits a
task.startedandtask.finishedlog line, structured (structlog), with the task name and id. This gives us a "consumer alive" metric we can alert on independently of queue depth. - Outer
try/exceptblocks in workers must log the exception and re-raise. We removed the silent swallow. - A Horion rule now flags any handler that catches
Exceptionwithout alog.exception(...)call.
What we did not change
We didn't blame the deploy or the engineer who wrote the code path. The bug was a one-character typo. The system failure was that a typo could silently take down a consumer for four hours.
If a typo can stall production for four hours, the system is wrong, not the typist.
The new rule plus the boundary logs mean the next typo gets caught at the first task, not at task 38,000.