Free guide
Disaster Recovery Runbook Template
A ready-to-adapt runbook structure for recovering critical systems under pressure — covering roles, recovery objectives, step-by-step procedures, and testing.
What a runbook is for
A disaster recovery runbook is the document your team follows when something has already gone wrong and the people who built the system may not be available. It exists so that recovery does not depend on memory, heroics, or a single person's phone being on.
The best runbooks are boring: specific, step-by-step, and tested. This template gives you the structure to build one for each critical system. Adapt the headings to your environment and keep it where it can be reached even when your primary site is down.
1. Scope and recovery objectives
Define, per system, the two numbers that drive every decision below.
| Term | Question it answers | Example |
|---|---|---|
| RTO (Recovery Time Objective) | How quickly must this be back? | 4 hours |
| RPO (Recovery Point Objective) | How much data can we afford to lose? | 15 minutes |
If you cannot state the RTO and RPO, you cannot judge whether your backups and architecture are good enough. Agree these with the business owner, not just IT.
2. Roles and contacts
Recovery fails when no one knows who decides. Name people, not job titles alone.
- Incident commander — declares the disaster, owns communication, makes the call to fail over.
- Technical lead — runs the recovery steps and confirms each one.
- Communications lead — updates stakeholders and, where relevant, customers.
- Business owner — confirms when the system is genuinely usable again.
Include primary and backup contacts with at least two channels each. Store this list somewhere that survives the outage — not only in the system you are recovering.
3. Declaration and escalation
- Define the criteria that constitute a "disaster" versus a routine incident.
- State who is authorised to declare one and how they are reached out of hours.
- Record the time of declaration — recovery clocks start here.
- Open a single shared channel for the incident to avoid scattered updates.
4. Recovery procedures
This is the heart of the runbook. Write each procedure as numbered, literal steps that someone unfamiliar with the system could follow.
Example structure per system
1. Verify the failure. Confirm the system is actually down and not a monitoring false positive.
2. Protect data. Stop processes that could corrupt or overwrite recoverable data.
3. Restore infrastructure. Provision the recovery environment (from infrastructure as code where possible).
4. Restore data. Recover from the most recent valid backup or replica; record which point-in-time was used.
5. Validate. Run the agreed health checks and a short functional test.
6. Cut over. Repoint DNS, load balancers, or integrations to the recovered system.
7. Confirm. Have the business owner confirm the system is usable.
What every step needs
- The exact command, console path, or tool to use.
- The expected result, so the operator knows it worked.
- What to do if it fails, including who to escalate to.
5. Communication plan
- Templates are pre-written for internal updates and, if needed, customer notices.
- A cadence is set (e.g. an update every 30 minutes even if there is no change).
- A single source of truth (status page or channel) is named.
6. Post-incident review
Recovery is not finished when the system is up. Within a few days:
- Reconstruct the timeline from declaration to full recovery.
- Compare actual recovery time against the RTO and RPO targets.
- Capture what slowed you down and assign owners to fix it.
- Update this runbook with anything you learned.
7. Testing schedule
An untested runbook is a hopeful document, not a recovery plan.
- Schedule at least one tabletop walkthrough per quarter.
- Schedule a full or partial live recovery test at least once or twice a year.
- Treat every test as a chance to find gaps before a real incident does.
Keep this runbook short enough that people will actually read it, and specific enough that they could act on it at 3 a.m. Review it whenever the underlying system changes — a runbook that describes last year's architecture is worse than none, because it inspires false confidence.