How we handle on-call rotations

On-call ensures every production reliability issue has a clear owner. Goal: restore quickly, escalate early, communicate clearly, and improve systems.

Weekly rotation

Each week has one primary and one secondary on-call engineer.

Primary

Acknowledge alerts quickly
Assess impact/severity
Start mitigation
Escalate early
Declare incidents when impact exists

Secondary

Remain reachable
Cover planned gaps
Take over when primary unavailable
Assist diagnosis/mitigation/decisions

Expectations

Coverage must be explicit; swaps are arranged proactively for travel/PTO/illness.

Response principles

Prefer safe, reversible mitigation; use runbooks, rollback where needed, avoid long single-threaded debugging, and leave clear notes for helpers.

Post-incident duties

Capture context, create follow-up work, improve alerts/runbooks/tooling, and include residual risks in handoff.

Sustainability

Pager stress is a systems issue. Repeated noise triggers system improvements, not normalization.