Why Multi-Cloud Backups Fail Differently Than Single-Cloud
Multi-cloud backup has gotten complicated with all the “just add redundancy” advice flying around. I spent three years managing backups across AWS, Azure, and Google Cloud before a Tuesday morning taught me that redundancy and recoverability are not the same thing. We had an RDS instance in us-east-1 with snapshots in all three providers. Retention policies were set. Everything looked healthy on the dashboard. Then production data corrupted — and during the restore test, we found out the VPC peering config connecting our application to that database had never been backed up. Not once. The instance came back fine. The application couldn’t reach it. Six hours gone, chasing a configuration layer nobody knew we’d skipped.
That’s what makes multi-cloud backup genuinely different from single-cloud work. Cross-provider latency creates invisible gaps. Snapshot APIs behave differently on each platform — sometimes subtly, sometimes catastrophically. RPO windows that look clean on a dashboard can hide silent sync failures for weeks. Partial restores sometimes complete without a single error code, leaving you absolutely convinced the data is safe when it isn’t. These aren’t edge cases. They’re what happens when teams treat multi-cloud backup as the same game played on a bigger board.
Map Your Recovery Dependencies Before You Back Up Anything
Before you touch a backup tool — any of them — you need a dependency map. Not a diagram rotting in Confluence that nobody’s opened since Q3. An actual working document that answers three things: which workloads need each other to function, where data actually lives, and what configuration ties everything together across providers.
Start with stateful data stores. RDS, DynamoDB, Firestore, SQL Server instances — obvious starting points. But then map sideways. If your web application runs in AWS but pulls static assets from Azure Storage, that’s a dependency. If your data pipeline uses Pub/Sub in Google Cloud to trigger Lambda functions in AWS, that’s a dependency too. The application layer is where most recovery plans fall apart. Teams back up databases without backing up the routing rules, secrets managers, or network policies that allow applications to actually use those databases.
Here’s a concrete example. I worked with a team that had perfectly replicated their PostgreSQL database across three clouds — honestly impressive work. During a test restore, they spun up the RDS instance in Azure. The application sitting in AWS couldn’t connect. Why? The cross-cloud VPN tunnel config wasn’t part of the backup. Neither was the credential in AWS Secrets Manager pointing to the old database endpoint. They had backed up the data. They had not backed up the recovery path. Don’t make my mistake — or theirs.
Your dependency map should include:
- Compute resources and their data store locations
- Network configurations that enable cross-cloud communication
- Secrets, API keys, and credential stores
- IAM roles and trust relationships
- Load balancer rules and DNS records pointing to data sources
- Configuration files stored outside of databases
Teams that own this level of detail before something breaks don’t panic during recovery. They pull up a checklist and execute.
Choose the Right Replication Pattern for Each Workload
Probably should have opened with this section, honestly. I’ve watched teams pick tools before understanding what pattern they actually needed — and it never ends well. So, without further ado, let’s dive in.
Three patterns dominate multi-cloud backup: active-active, active-passive, and backup-only cold storage. Each trades cost against recovery speed. Most teams default to active-active for everything because it sounds resilient. It is resilient. It’s also expensive and often unnecessary for workloads that don’t justify it.
Active-active replication synchronously writes data to multiple clouds simultaneously. Millisecond failover, instant recoverability. You also pay for storage, compute, and egress across all three providers at once — every month, indefinitely. Use this for mission-critical workloads that cannot tolerate downtime: payment processing, real-time customer-facing applications, compliance workflows with strict uptime SLAs. Not for test environments. Not for batch jobs that can tolerate four hours of recovery.
Active-passive keeps a primary copy in one cloud and maintains standby replicas in others. Data syncs asynchronously — typically once per hour or once per day depending on your RPO tolerance. Recovery takes longer than active-active, but costs far less because idle secondary resources aren’t running at full price. This fits most production workloads: databases that need recovery options but aren’t processing transactions at microsecond latency, file systems, application state stores. The cost-to-recovery-time ratio here is where most teams should actually live.
Backup-only cold storage is exactly what it sounds like. Periodic snapshots in low-cost archives — S3 Glacier, Azure Archive tier — restored only during disaster scenarios. Recovery takes hours, sometimes longer. Cost per terabyte runs roughly 90% cheaper than active-passive. Use this for dev environments, historical data, log archives, and compliance backups you genuinely hope never to touch.
Native tools handle each pattern differently. AWS Backup manages snapshots across regions and accounts but requires separate handling for cross-cloud replication. Azure Backup Vault works well for Azure workloads — weaker multi-cloud orchestration, though. Google Cloud Backup and Disaster Recovery is newer and still Google Cloud-first in focus. Veeam and Commvault handle multi-cloud natively, but they cost significantly more and require agents or appliances to manage. I’m apparently a Veeam person and it works for me while GCP-native tooling never quite clicked for cross-provider work. Your mileage may vary depending on workload diversity and how much operational bandwidth your team actually has.
Test Cross-Cloud Restores Before You Actually Need Them
Restore drills are where most multi-cloud backup strategies collapse in practice. Teams test file-level recovery from a single backup location and call it done. They don’t test the thing that actually matters: recovering a complete workload from one cloud to another during a full provider outage.
Your drills need to simulate realistic failure modes. Not “restore a single file from backup.” Instead: the entire AWS us-east-1 region failed at 2:47 AM — restore the application stack to Azure using only what exists in cold backups. That scenario reveals the real problems. RTO estimates that look fine in a spreadsheet become impossible when you discover encryption keys lived only in the failed region. Restore sequences that assume service A finishes before service B starts fail because B had a dependency on a configuration file sitting in C.
Run these drills quarterly at minimum. Treat a failed drill as a production incident — because that’s exactly what it’s previewing. Document what broke, why it broke, and what you’re fixing before the next drill. Track three metrics specifically:
- RTO accuracy: How long did recovery actually take versus your documented estimate? A gap of more than 10% means your numbers are wrong.
- Restore sequencing errors: Did applications come up in the correct order? Did any service fail to connect because a dependency hadn’t finished restoring yet?
- Credential and key expiry: Did recovery credentials stored with the backup still carry valid permissions? Cloud credentials rotate. Backup credentials sometimes don’t follow.
I’ve watched teams complete restore drills successfully — clean results, everyone relieved — only to discover in an actual production incident that the IAM role used for restore operations had been deleted six months earlier because it looked unused in a cleanup audit. The drill used cached credentials that still worked. Production didn’t have them. Test using only what’s actually inside the backup. Not what’s cached locally on someone’s workstation.
Three Backup Misconfigurations That Cause Silent Data Loss
Encryption key mismatches between providers
You back up an encrypted RDS database from AWS using a KMS key. The backup replicates to Azure. During restore, Azure Backup cannot decrypt the snapshot — the KMS key is regional to AWS and Azure has no access path to it. Here’s the insidious part: the backup completes without an error. Restore silently fails or hangs indefinitely with no alert firing. Fix this by ensuring all cross-cloud backups use provider-agnostic encryption keys, or replicate encryption keys alongside the backup data itself. Then test this specifically during drills — not as an afterthought.
Retention policy conflicts creating silent gaps
AWS Backup applies a 30-day retention rule. Azure Backup applies 90 days. You assume you have 90 days of history. In reality, the oldest snapshots are deleted at 30 days because AWS cleans them before replication fully completes. Your actual retention window is 30 days — not 90. You don’t notice until you need a 60-day-old recovery point that simply doesn’t exist anymore. Fix this by reconciling retention policies across all providers at least monthly. A simple script counting snapshot objects in each cloud and alerting on divergence takes maybe two hours to write. That’s two hours well spent.
Clock skew in distributed databases creating inconsistent snapshots
PostgreSQL in AWS runs 3 minutes behind PostgreSQL in Google Cloud — slight NTP drift, easy to miss. You take snapshots to both providers simultaneously, but timestamps are offset. A transaction that should exist in both appears only in one. Recovery looks complete. Data is missing. No error was thrown anywhere in the pipeline. Fix this by synchronizing system clocks across all provider regions to within 100 milliseconds before taking snapshots. Use the managed NTP services each cloud vendor already provides. After any recovery, run data validation queries checking transaction counts and checksums before you declare the restore successful.
Multi-cloud backup strategies that actually prevent data loss are built on four things: mapping dependencies before anything else, matching the replication pattern to the workload rather than defaulting to whatever sounds most resilient, testing recovery ruthlessly against realistic failure scenarios, and closing the configuration gaps that automated tools simply don’t see. Start with the dependency map. Close the three silent failure modes before they find you in production. Schedule the quarterly drills and treat failed drills as incidents. That’s the checklist that separates teams who recover cleanly from teams who discover their gaps during actual outages — at 2 AM, six hours into a restore that should have taken forty minutes.
Stay in the loop
Get the latest multicloud hosting updates delivered to your inbox.