Revues post-incident : comment éviter les récidives

La phase la plus précieuse de la gestion des incidents se produit après la fin de la crise. Découvrez comment mener des revues post-incident efficaces.

If incident response is about restoring service, post-incident review (PIR) is about earning back time—by ensuring you don’t fight the same fire next month.

And yet, many PIRs fail in predictable ways:

They devolve into blame (“who broke it?”).
They generate vague action items (“improve monitoring”).
They happen weeks later, when context is gone.
They focus on “the root cause” while ignoring systemic contributors.

This article is a practical playbook to run PIRs that prevent repeats and measurably improve reliability.

What a PIR is (and isn’t)

A PIR is:

A structured learning session to understand what happened, why it happened, and how to reduce future risk.

A PIR is not:

A performance review
A legal interrogation
A place to “prove” a single root cause when reality is multi-factor

Timing: run it while memory is fresh

Recommended:

SEV1: PIR within 3–5 business days
SEV2: within 1–2 weeks
SEV3: optional (or do a lightweight review if it’s recurring)

If you wait too long, you lose the details that matter (decision points, ambiguous signals, human context).

Inputs: gather evidence before the meeting

Great PIRs are evidence-led. Collect:

Incident ticket(s)
Chat transcript and key threads
Timeline doc (events + decisions with timestamps)
Monitoring graphs and alerts
Deployments/changes near the start time
Vendor status updates (if relevant)

Assign someone (often the IC or a facilitator) to produce a first-pass timeline before the meeting.

A simple PIR structure (60–90 minutes)

1) Context (5 minutes)

Severity, duration, impact
A short statement of what users experienced

2) Timeline (15–25 minutes)

Walk through:

Detection: how did we learn about it?
Declaration: how quickly did we mobilize?
Mitigation: what reduced impact?
Resolution: what fixed the underlying issue?
Stabilization: what confirmed recovery?

Keep the timeline factual. Avoid interpretation until the next section.

3) Contributing factors (20–30 minutes)

Instead of hunting for one “root cause,” identify contributing factors across categories:

Technical: defects, capacity limits, brittle dependencies
Process: risky change practices, missing runbooks, unclear ownership
People: fatigue, handoff gaps, knowledge silos
Signals: missing alerts, noisy dashboards, unclear logs
Third parties: vendor outages, unclear escalation paths

Ask “what conditions made this incident possible or prolonged?”

4) Improvements and actions (15–25 minutes)

Turn findings into specific changes with owners and due dates.

5) Close (5 minutes)

Confirm the final narrative
Confirm action owners
Decide what to share and with whom

How to keep it blameless (and still accountable)

Blameless doesn’t mean “nobody is responsible.” It means:

We assume people made reasonable decisions with the information they had.
We examine the system that shaped those decisions.
We still assign owners to improvements.

Practical facilitator moves:

Replace “why did you…” with “what did you see that led to…”
Ask “what made the right action hard?”
Capture uncertainty explicitly (“At 10:12 we believed X based on Y”)

Action items that prevent repeats

The best action items:

Reduce likelihood (prevention) or reduce impact/time to mitigate (resilience)
Are testable and verifiable
Have a single owner and a due date

Examples of strong action items

“Add alert when payment success rate drops below 98% for 5 minutes; on-call runbook includes immediate rollback steps. Owner: A. Due: Feb 12.”
“Introduce feature flag kill switch for new checkout flow; integrate into incident checklist. Owner: B. Due: Mar 1.”
“Automate DB connection pool saturation dashboard + page on threshold; add synthetic checkout test every 2 minutes. Owner: C. Due: Feb 20.”

Weak action items (avoid)

“Improve monitoring”
“Refactor service”
“Be more careful”

If an action can’t be verified, it won’t stick.

Tracking: the difference between “we learned” and “we improved”

If you don’t track actions to completion, PIRs become performative.

Minimum system:

Put all PIR actions in the same tracker as other work (Jira/Linear/etc.)
Tag them (e.g., reliability, PIR)
Review them in an engineering leadership cadence (weekly/biweekly)

Consider a rule: no PIR is “closed” until actions are accepted or deliberately rejected with rationale.

Metrics to validate PIR effectiveness

You’re not measuring PIRs to create bureaucracy. You’re measuring whether they reduce risk.

Good metrics:

Repeat incident rate (same class of incident)
MTTR trend for top services
Action completion rate within 30/60/90 days
Change failure rate (if incidents are change-driven)

Common PIR failure modes (and fixes)

Too much narrative, not enough action
- Fix: cap discussion time; force action-writing with owners.
Focus on the last error (“human error”)
- Fix: ask what made the error possible; add guardrails.
No senior sponsorship
- Fix: leadership attends periodically and reinforces blameless culture.
No time allocated for reliability work
- Fix: reserve capacity for PIR actions; treat them as risk reduction.

A PIR template you can copy

Summary

Impact:
Duration:
Customer/user effect:
Severity:

Timeline (UTC/local)

10:02 Detection:
10:07 Declared SEV1:
10:15 Mitigation applied:
10:40 Impact reduced:
11:05 Resolved:

What went well

What went poorly

Contributing factors

Technical:
Process:
Signals/observability:
People/coordination:
Third-party:

Action items

Action | Owner | Due | Verification method

Incident response restores service. PIRs restore confidence—and create a compounding improvement loop. If you run PIRs quickly, keep them blameless and evidence-led, and track actions to closure, you’ll see fewer repeats and shorter outages over time.