Case: P1 incident response with structured runbook and postmortem
Production outage, SRE + on-call + engineering lead + product — 99.9% SLA commitment, customer-facing impact, executive visibility.
Antes
- Response coordinated via Slack — who was doing what was unclear at minute 15
- Rollback executed without documented confirmation that it was safe to do
- Postmortem written as a Google Doc two days later — action items assigned informally in the meeting, never followed up
Depois
- Incident runbook opened at page: detection, triage, containment, resolution as explicit steps with owners
- Rollback required documented confirmation from on-call lead before execution
- Postmortem structured as a flow run: root cause fields required, action items assigned to named engineers with due dates