“`html
How to spot database replication lag in multi-cloud setups
Your customer places an order on AWS us-east-1. Thirty seconds later, they check their order status from a read replica in GCP us-central1, and the data isn’t there yet. Transaction fails. Support ticket arrives. That’s multicloud database replication lag in its most visible form — stale reads are always the first clue something’s broken.
I spent six months chasing this exact problem across AWS, GCP, and Azure before I figured out what was actually happening. The lag wasn’t always in the database itself — it was bleeding through the network layer, the connection pools, and the batch commit logic all at once.
The thing is, replication lag manifests completely differently depending on your setup. AWS-to-GCP? You’ll see lag in write-ahead logs (WAL) stacking up during cross-region replication. Azure-to-AWS tends to show lag as timeout errors on secondary reads rather than actual stale data. GCP’s Firestore to DynamoDB surfaces as inconsistent transaction ordering, not just raw latency. That’s what makes diagnosing this so endearing to us infrastructure folks — there’s no one signal.
So, without further ado, let’s dive in. The easiest diagnostic is running a simple lag query right now. If you’re using PostgreSQL with logical replication across clouds:
SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots;
The gap between restart_lsn and confirmed_flush_lsn tells you how much uncommitted WAL is backlogged. For MySQL, check this instead:
SHOW SLAVE STATUS\G | grep Seconds_Behind_Master;
That single metric — seconds behind master — is where most teams start looking. Anything above 1 second in a synchronous replication setup means your pipeline has a hard constraint somewhere. Anything above 5 seconds in async? Your customers are already seeing stale data.
Check network latency between your cloud providers first
Probably should have opened with this section, honestly.
Most replication lag isn’t the database’s fault at all. The network between your clouds is the first bottleneck to diagnose — and it’s almost always worse than you think. AWS to GCP across regions adds 80–150ms of baseline latency. AWS to Azure from the same regions? 40–60ms. That’s just the physical distance. Add VPN encryption, NAT gateways, and cross-provider peering delays, and you’re looking at 200–400ms of overhead before your database even attempts a write.
Start with a ping test from each cloud to every other. Open your terminal:
ping -c 10 [target-cloud-endpoint]
You’re hunting for packet loss (anything above 0.1% is bad) and latency variance. If one direction shows 45ms and the reverse shows 120ms, you’ve found asymmetric routing — a common culprit that absolutely breaks replication expectations.
MTR (My Traceroute) gives you the full hop-by-hop picture instead of just guessing. Install it with brew install mtr on macOS or apt-get install mtr on Linux, then run:
mtr -r -c 100 [target-ip]
I’ve caught VPN gateways adding 80ms of jitter and misconfigured security group rules dropping packets at the 7th hop using this tool. It’s free and works cross-cloud, which beats the alternatives.
Acceptable latency thresholds differ by replication mode. For synchronous replication — where writes wait for confirmation — anything over 50ms adds real write latency to your application. Most teams tolerate 20–30ms before users complain. Asynchronous replication hides network latency from writes but can drift weeks behind if backlog piles up. Acceptable thresholds there depend entirely on your use case, not the network.
Check your inter-cloud bandwidth next. Many teams cap VPN throughput at 1 Gbps by default. A 100MB transaction replicated synchronously over a 1 Gbps link takes 0.8 seconds, plus network latency. Scale that to 10 concurrent replication streams and you’re over 5 seconds of lag before the database does anything. Verify your actual provisioned bandwidth:
iperf3 -c [remote-endpoint] -t 30
Run it from each direction. Asymmetric bandwidth is another hidden killer — you might have 10 Gbps one way and 2 Gbps the other.
Identify bottlenecks in your replication pipeline
Once the network checks out, the database itself becomes the suspect.
Connection pool exhaustion is the first place to look. Your primary database opens one connection per replication worker. Most teams run 2–4 workers — laughably low for multi-cloud setups. Check your current worker count in PostgreSQL:
SHOW max_wal_senders;
How many active replication connections are actually running right now?
SELECT count(*) FROM pg_stat_replication;
If that’s maxed out, replication stalls entirely. Increase max_wal_senders to at least 8 for serious multi-cloud setups. The tradeoff: more memory on the primary, more CPU load during peak hours.
WAL log backlog is next on the list. The write-ahead log is the source of truth for replication. Overflow it and replicas can’t keep up. Check the current WAL queue size:
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') as wal_size;
Also check how far behind your slowest replica sits:
SELECT slot_name, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as backlog_bytes FROM pg_replication_slots WHERE slot_type='physical';
If backlog exceeds 500MB, your replicas can’t consume writes fast enough. This usually means a CPU or disk I/O bottleneck on the replica side, not the primary.
CPU during replication applies changes to the replica. Heavy indexes, complex foreign keys, or triggers slow this down considerably. Monitor replica CPU during peak write hours. If it’s above 60%, index the replica heavily or reduce write volume to that replica.
Disk I/O throughput limits replication speed on replicas — at least if you want consistent performance. On AWS RDS, check BinLogDiskUsage and ReadLatency in CloudWatch. On GCP Cloud SQL, check the database/replication/replica_lag metric. On Azure, check Replication lag (seconds) in Azure Monitor. If I/O wait (iowait in iostat) exceeds 30%, upgrade your storage class.
Network bandwidth caps are often invisible to teams. Many managed databases cap replication at 50–100 Mbps by default. I once spent three days optimizing a replication pipeline before realizing the RDS instance had its “enhanced monitoring” bandwidth limited to 25 Mbps. Check your provider’s documentation for your specific instance type.
Quick fixes that reduce lag without re-architecture
You’ve diagnosed the bottleneck. Now fix it without redesigning everything.
Increase replication worker threads. In PostgreSQL, boost max_parallel_apply_workers_per_subscription from 2 to 8 — this parallelizes write application on the replica. Tradeoff: higher CPU on the replica, potential deadlocks if you have row-level locking contention (rare). Impact: 30–50% lag reduction in most cases.
Adjust batch sizes for network efficiency. Most database replication sends small batches across the network, incurring overhead per batch. If your cloud link is high-latency but high-bandwidth — like a 100ms, 10 Gbps link — batch 1000 writes instead of 100. In PostgreSQL logical replication, tune logical_decoding_work_mem up to 256MB. Tradeoff: uses more memory on the primary. Impact: reduces round-trips, cuts lag by 20–40% on high-latency links.
Enable compression on the replication stream itself. PostgreSQL logical replication doesn’t compress by default, but you can compress at the WAL transport layer using wal_compression = on. Tradeoff: adds 5–10% CPU overhead on the primary, saves 30–60% bandwidth. Worth it if your cloud link is saturated. Impact: 10–30% lag reduction on bandwidth-constrained setups.
Tune network buffer sizes for stability. Increase wal_sender_timeout from the default 60 seconds to 300 seconds if you’re replicating across high-latency links. This prevents the primary from killing connections during slow network blips. Tradeoff: slower failure detection. Impact: prevents lag spikes from transient network issues. Set it conservatively — 180 seconds is usually safe.
Priority order: 1) Fix connection pool exhaustion first (highest impact, lowest effort). 2) Increase replication workers (15 minutes to apply, 30% reduction common). 3) Enable compression if bandwidth is constrained. 4) Tune batch sizes if you have latency but not bandwidth constraints.
When to accept lag vs when to redesign
Not all replication lag is fixable with tuning alone. Sometimes your architecture needs to change.
For analytics workloads, lag over 24 hours is acceptable. Your data warehouse refreshes nightly anyway. If you’re replicating between clouds for reporting, a 10-second lag is fine — don’t even bother with synchronous replication. Use asynchronous batch jobs instead. Lower cost, simpler operations.
For transactional workloads, the threshold sits at 500ms. Anything under that and most users won’t notice the difference. If you’re consistently above 500ms after tuning, you have a utilization problem — your primary database is overloaded or your replica is underpowered. Scale vertically before you redesign.
For real-time use cases — financial transactions, ride-sharing, anything where milliseconds matter — anything over 2 seconds breaks the application entirely. Stop trying to replicate across multiple clouds synchronously at that point. Use asynchronous replication to each cloud instead, keep a local cache layer (Redis, Memcached) updated in real-time, and accept that different clouds have slightly stale copies. The application reads from the local cache. Replicas sync in the background. This costs more (extra caching layer, more complex logic) but eliminates lag spikes.
Make the decision after a week of tuning: if lag is under 500ms, you’re done. If it’s between 500ms and 2 seconds, your architecture is the limit — upgrade instances or reduce replication volume. If it’s over 2 seconds consistently, redesign with async plus cache.
“`
Stay in the loop
Get the latest multicloud hosting updates delivered to your inbox.