Disaster recovery testing: how often, what to test, and lessons learned

The testing gap

Ask any business leader whether they have a disaster recovery plan and most will say yes. Ask when it was last tested and the room goes quiet.

DR plans are written with good intentions and then filed away. Staff change, systems are migrated, new applications are deployed - and the plan drifts further from reality. When a real incident occurs, teams discover that the backup doesn’t restore, the failover path no longer exists, or nobody knows the sequence of steps.

Testing is the only way to close the gap between what you think will happen and what actually happens.

Why DR testing is neglected

The reasons are predictable:

Fear of disruption. Testing requires taking systems offline or simulating failures, which feels risky in a production environment.
Resource constraints. DR tests take time and involve staff who already have full workloads.
Perceived cost. Full failover tests can be expensive, especially if they require cloud burst capacity or temporary infrastructure.
False confidence. “We back up every night” is not a recovery strategy. Without a tested restore process, a backup is just a file.
No mandate. Unless compliance or a client contract demands testing, it falls to the bottom of the priority list.

None of these reasons survive contact with a real disaster. The cost of a failed recovery always exceeds the cost of testing.

Types of DR tests

Not every test needs to be a full-scale production failover. Different test types serve different purposes, and a mature DR programme uses all of them.

Tabletop exercise

The team gathers - physically or virtually - and walks through a disaster scenario step by step. There is no technical execution; the goal is to validate decision-making, communication, and role clarity.

When to use it: Quarterly. Tabletops are cheap, fast, and remarkably effective at exposing gaps in process and ownership.

What it reveals: Missing contact details, unclear escalation paths, assumptions about who does what, and dependencies that nobody documented.

Walkthrough test

Teams physically perform their assigned recovery tasks without actually failing over production systems. This might include logging into the backup console, verifying that restore scripts exist and are current, or confirming that alternate network paths are configured.

When to use it: Semi-annually. A walkthrough builds confidence that procedures are documented correctly and that staff know where to find them.

What it reveals: Outdated runbooks, changed passwords, expired certificates, and missing access permissions.

Simulation test

A realistic scenario is introduced - “the primary server room flooded overnight” - and teams respond as if it were real. Systems may be partially involved (e.g., restoring a non-production copy of a database), but production remains untouched.

When to use it: Annually. Simulations test both technical recovery and human coordination under pressure.

What it reveals: Time-to-recovery gaps, communication breakdowns, and whether your infrastructure actually supports the failover architecture you designed.

Full failover test

Production workloads are switched to backup or secondary infrastructure. This is the only test that truly validates your Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

When to use it: Annually for critical systems. Schedule during a maintenance window with stakeholder buy-in.

What it reveals: Whether your DR environment can actually handle production load, whether data replication is working as expected, and how long recovery genuinely takes.

What to test

A common mistake is testing only the easy parts. A thorough DR programme covers:

Backup restoration - can you actually restore from your most recent backup? How long does it take? Is the data complete?
Application recovery - restoring a VM or database is one thing; getting an application stack running end-to-end (web server, application server, database, integrations) is another.
Network failover - if your primary link goes down, does traffic route correctly to the secondary? Do DNS changes propagate in time?
Authentication and access - can users log in to recovered systems? Are directory services, SSO, and MFA functional in the DR environment?
Third-party dependencies - if your payment gateway, CRM, or cloud platform is down, what is the impact and what is your fallback?
Communication channels - if email is part of the outage, how do you coordinate the recovery team?
Data integrity - after restoration, is the data consistent? Are transactions complete? Do audit logs match?

Frequency recommendations

Test type	Frequency	Duration	Involvement
Tabletop exercise	Quarterly	1–2 hours	Incident response team, department leads
Walkthrough test	Semi-annually	Half day	IT operations, backup administrators
Simulation test	Annually	Full day	IT, operations, management
Full failover test	Annually (critical systems)	Maintenance window	IT, vendor support, management sign-off

Adjust frequency based on risk. If your organisation has experienced a real incident, increase testing cadence for the affected systems. If you’ve made significant infrastructure changes - a migration, a new backup solution, a new data centre - test before and after.

Documenting results

Every test should produce a written report that captures:

Scenario description - what was tested and why.
Participants - who was involved.
Timeline - when the test started, key milestones, and when recovery was confirmed.
Outcome - pass or fail against defined RTO and RPO targets.
Issues discovered - what went wrong, what was unexpected, and what was missing.
Remediation actions - specific steps to fix each issue, with owners and deadlines.
Lessons learned - broader observations about process, tooling, or culture.

This documentation serves as evidence for auditors, input for plan updates, and a baseline for measuring improvement over time.

Common failures found during testing

Years of DR testing across South African businesses have revealed a consistent set of problems:

Backups exist but can’t be restored. The backup job completes successfully every night, but nobody has tried to restore from it in months. When they do, the backup is corrupt, incomplete, or incompatible with the current system version.
Recovery takes three times longer than expected. The plan says RTO is four hours, but the actual restore takes twelve. Network bandwidth, data volume, and manual steps were underestimated.
Credentials are missing or expired. The DR environment requires a service account password that was changed six months ago and never updated in the runbook.
DNS and network configuration is wrong. Systems restore successfully, but users can’t reach them because DNS records, firewall rules, or load balancer configurations weren’t updated.
The plan assumes people who have left. The primary contact for database recovery left the company a year ago. Nobody was assigned as a replacement.
Dependencies are invisible. The application restores, but it depends on an API that runs on a different server - one that wasn’t included in the DR scope.
Communication fails. The incident response team tries to coordinate via email, but email is part of the outage.

Building a testing culture

DR testing should not feel like an audit or a punishment. Frame it as a learning exercise. Celebrate the issues found - each one is a problem you won’t face during a real disaster.

Involve business stakeholders, not just IT. The people who depend on recovered systems should understand the process and the limitations. This builds realistic expectations and ensures that recovery priorities reflect actual business needs.

ITHQ helps organisations design, execute, and improve their DR testing programmes. Our business continuity and disaster recovery services cover everything from backup architecture to full failover testing. We build on proven infrastructure and cloud engineering to ensure your recovery environment is ready when you need it.