How we handle on-call rotations
On-call ensures every production reliability issue has a clear owner. Goal: restore quickly, escalate early, communicate clearly, and improve systems.
Weekly rotation
Each week has one primary and one secondary on-call engineer.
Primary
- Acknowledge alerts quickly
- Assess impact/severity
- Start mitigation
- Escalate early
- Declare incidents when impact exists
Secondary
- Remain reachable
- Cover planned gaps
- Take over when primary unavailable
- Assist diagnosis/mitigation/decisions
Expectations
Coverage must be explicit; swaps are arranged proactively for travel/PTO/illness.
Response principles
Prefer safe, reversible mitigation; use runbooks, rollback where needed, avoid long single-threaded debugging, and leave clear notes for helpers.
Post-incident duties
Capture context, create follow-up work, improve alerts/runbooks/tooling, and include residual risks in handoff.
Sustainability
Pager stress is a systems issue. Repeated noise triggers system improvements, not normalization.