<aside>
π‘
A no-fluff, chronological checklist for incident response and postmortems.
Use during live incidents or when facilitating a review.
</aside>
<aside>
β οΈ
This process is blameless.
Focus on systems, not individuals.
Draft the postmortem within 48hrs.
</aside>
Incident Response (During Incident)
- [ ] Acknowledge and classify the incident
- Immediately acknowledge the alert/page.
- Determine severity (e.g. Sev-1, Sev-2).
- Mobilize the appropriate on-call responders.
- [ ] Declare the incident & open a channel
- Create a dedicated Slack channel:
#inc-sev1-checkout
(Sev-1), #inc-sev2-api
, etc.
- Post announcement: whatβs broken, severity, Incident Commander.
- [ ] Assign roles
- Incident Commander (IC)
- Tech Lead (for troubleshooting)
- Comms Lead (for updates)
- Scribe (for notes/timeline)
- [ ] Notify stakeholders
- Internal updates in Slack/email.
- External comms via status page if customers impacted.
- [ ] Gather the response team
- Ensure all relevant engineers join Slack channel.
- Pin dashboards, logs, runbooks, Linear ticket.
- π‘ Escalate as needed: bring in SMEs, involve leadership, or rotate shifts if long-running.
- [ ] Log a timeline in real-time
- Record facts only:
β18:42 β Deployed version 5.2β
β18:47 β CPU usage 90% on DB cluster.β
- [ ] Mitigate and resolve
- IC coordinates who does what.
- Implement rollback/mitigation.
- [ ] Maintain communication cadence
- Post updates every 15β30 minutes.
- Note major milestones.
- [ ] Confirm resolution
- Declare resolved in Slack:
βResolved at 19:30 β reverted LB config.β
- Thank the team.
Post-Incident Documentation (Within 24β48h)
- [ ] 48-hour rule
- Draft postmortem started while details are fresh (within 48h).
- [ ] Open a postmortem doc
- Title:
Postmortem β <Incident Name> β <Date>
- Link Slack transcript, dashboards, Linear issue.
- [ ] Fill in incident facts
- Summary: what happened, when, impact.
Example: β2025-09-13, Sev-1 checkout outage, 80% of users impacted for 50 mins.β
- [ ] Document impact
- % of failed requests, # of customers, revenue loss, support tickets.
- [ ] Construct timeline
- From Slack notes. Chronological, factual, no blame.
- [ ] Identify root causes
- π‘ Ask: "How did our system allow this?" not "Who messed up?"
- π‘ Use 5 Whys.
- π‘ Focus on systems/process, not individuals.
- π‘ Capture what went well (e.g. automated failover worked).
- [ ] List action items
- Assign single owner + deadline. Track in Linear/Jira. Aim for >85% closure.
- Specific, verifiable.
Bad: βImprove monitoring.β
Good: βAdd alert for DB replication lag > 5s.β