<aside> 💡

A no-fluff, chronological checklist for incident response and postmortems.

Use during live incidents or when facilitating a review.

</aside>

<aside> ⚠️

This process is blameless.

Focus on systems, not individuals.

Draft the postmortem within 48hrs.

</aside>

Incident Response (During Incident)

[ ] Acknowledge and classify the incident
- Immediately acknowledge the alert/page.
- Determine severity (e.g. Sev-1, Sev-2).
- Mobilize the appropriate on-call responders.
[ ] Declare the incident & open a channel
- Create a dedicated Slack channel:#inc-sev1-checkout (Sev-1), #inc-sev2-api, etc.
- Post announcement: what’s broken, severity, Incident Commander.
[ ] Assign roles
- Incident Commander (IC)
- Tech Lead (for troubleshooting)
- Comms Lead (for updates)
- Scribe (for notes/timeline)
[ ] Notify stakeholders
- Internal updates in Slack/email.
- External comms via status page if customers impacted.
[ ] Gather the response team
- Ensure all relevant engineers join Slack channel.
- Pin dashboards, logs, runbooks, Linear ticket.
- 💡 Escalate as needed: bring in SMEs, involve leadership, or rotate shifts if long-running.
[ ] Log a timeline in real-time
- Record facts only: “18:42 – Deployed version 5.2” “18:47 – CPU usage 90% on DB cluster.”
[ ] Mitigate and resolve
- IC coordinates who does what.
- Implement rollback/mitigation.
[ ] Maintain communication cadence
- Post updates every 15–30 minutes.
- Note major milestones.
[ ] Confirm resolution
- Declare resolved in Slack: “Resolved at 19:30 – reverted LB config.”
- Thank the team.

Post-Incident Documentation (Within 24–48h)

[ ] 48-hour rule
- Draft postmortem started while details are fresh (within 48h).
[ ] Open a postmortem doc
- Title: Postmortem – <Incident Name> – <Date>
- Link Slack transcript, dashboards, Linear issue.
[ ] Fill in incident facts
- Summary: what happened, when, impact. Example: “2025-09-13, Sev-1 checkout outage, 80% of users impacted for 50 mins.”
[ ] Document impact
- % of failed requests, # of customers, revenue loss, support tickets.
[ ] Construct timeline
- From Slack notes. Chronological, factual, no blame.
[ ] Identify root causes
- 💡 Ask: "How did our system allow this?" not "Who messed up?"
- 💡 Use 5 Whys.
- 💡 Focus on systems/process, not individuals.
- 💡 Capture what went well (e.g. automated failover worked).
[ ] List action items
- Assign single owner + deadline. Track in Linear/Jira. Aim for >85% closure.
- Specific, verifiable. Bad: “Improve monitoring.” Good: “Add alert for DB replication lag > 5s.”