Skip to main content
Stradiva
Journal / Engineering
Engineering

On-call is a design problem, not an engineering problem.

You can't fix on-call by paging more carefully. You fix it by giving the on-call engineer fewer reasons to open the laptop.

Daniel Okafor
Daniel Okafor
May 4, 2026 · 6 min read

Most on-call programs are designed by the people who have to be on-call. This produces predictable outcomes — careful runbooks, polite escalation policies, thoughtful schedules. None of it solves the underlying problem, which is that someone is being asked to be functional at 3am.

I used to think the answer was better paging. Fewer false positives, more context, smarter routing. It helped. It didn't fix the problem.

The pages that matter

Every page falls into one of three buckets:

  1. The system noticed something wrong and can't recover. This is the page that should wake you.
  2. The system noticed something wrong and is already recovering. This is the page that shouldn't exist.
  3. The system noticed something is going to be wrong eventually. This is the page that should arrive on Monday.
If you can move 80% of your pages from bucket 1 into bucket 2 or 3, on-call stops being a problem.

The infrastructure for this isn't dashboards. It's automatic recovery you trust, and a separate channel for the kind of warning that doesn't need a human at 3am.

Build trust in the automation

The hardest part is the trust. The first time you tell an engineer "you don't need to wake up — the system will heal", they don't believe you. They shouldn't believe you, until they've watched it heal twenty times in a row without paging them.

This is why incremental rollout of self-healing matters. You build it for one service, watch it work for a quarter, then turn it on for the next one. The trust compounds. Six months in, on-call stops looking like an emergency room and starts looking like a librarian's desk.

Daniel Okafor
Author
Daniel Okafor
Director of Platform

Daniel has spent ten years shipping firmware and grid-scale software. He writes about reliability culture and the math of operational risk.

Keep reading

Start where the next stack begins.