Multi-Cloud Networking Mistakes That Inflate Your Latency

Why Multi-Cloud Latency Is Usually a Routing Problem

Multi-cloud networking has gotten complicated with all the vendor promises and architectural diagrams flying around. As someone who spent eighteen months in the trenches debugging latency issues across AWS and GCP environments, I learned everything there is to know about what actually breaks. Today, I will share it all with you.

Here’s the thing — it’s almost never raw cloud performance. AWS and GCP have bandwidth to spare. The problem lives in how you route traffic between them, and most teams don’t find that out until users are already angry.

Here’s the pattern I keep seeing. A team deploys workloads across two or three providers — redundancy, compliance, avoiding vendor lock-in, whatever the reason. Everything works. Then the complaints start. Network engineers scramble, checking instance types, verifying bandwidth allocation, running packet captures. And they miss it entirely: traffic is hairpinning.

But what is hairpinning? In essence, it’s when your data takes a path that makes no geographic sense. But it’s much more than that — it’s a silent performance killer baked into your architecture by default. A request leaves AWS in us-east-1, travels to a central hub in London, then bounces back across the Atlantic to GCP in us-central1. Three continents when it should be one hop. This happens because teams default to a single VPN tunnel or centralized security appliance for “visibility,” and everything funnels through it by design.

Mistake 1 — Routing All Traffic Through One Region

Probably should have opened with this section, honestly. It’s the most common thing I’ve debugged — by a wide margin.

Most teams establish their first multi-cloud connection using a single egress point. Maybe a VPN terminating in London. Maybe a NAT gateway in us-west-2. The reasoning makes sense on paper: centralize control, simplify firewall rules, run security tooling in one place. Works fine. Until it doesn’t.

I worked with a fintech company running exactly this setup. Primary app in AWS us-east-1. Real-time pricing database on GCP us-central1. All traffic between them routed through a VPN hub in London — because that’s where the security team had deployed their inline inspection appliances back in 2021. A user in New York was seeing 240ms latency on a database query that should have taken 8ms. The physical path was roughly 3,500 miles longer than it needed to be.

The fix is regional peering. Instead of one tunnel, you establish direct peering or dedicated interconnects in the regions where your workloads actually sit. AWS Direct Connect in us-east-1 talking directly to GCP Dedicated Interconnect in us-central1. Keep traffic on the shortest possible path between endpoints. That’s it.

This means:

  • Map your workload distribution across regions first. Don’t guess.
  • Establish peering at each region where you have active compute.
  • Use traffic policies or service mesh configuration to route intra-cloud calls directly — bypassing central hubs entirely.
  • Reserve the central hub for genuinely cross-cutting concerns, like backup traffic or compliance mirroring.

The latency improvement shows up fast. Regional direct connections typically land at 1–5ms between major cloud providers. Hairpinned traffic through a distant hub? You’re looking at 80–200ms. User-facing metrics reflect the difference within hours of cutting over.

Mistake 2 — Using Site-to-Site VPN Instead of Direct Interconnect

VPN feels fine at first. It’s cheap. Simple. You spin up a connection, configure routes, data flows. Then you pull latency percentiles and realize p95 is sitting around 45ms for traffic that should be doing 12ms.

Two things are working against you. Encryption overhead — roughly 2–5ms of CPU cost per direction under heavy throughput, worse on older instance generations — and the public internet routing problem. A VPN packet doesn’t take the shortest path between clouds. It takes whatever path BGP routing tables decide on that particular millisecond. Sometimes optimal. Usually not.

AWS Direct Connect and GCP Dedicated Interconnect are different animals entirely. Physical circuits — 1 Gbps, 10 Gbps, or 100 Gbps — that bypass the public internet completely. Traffic takes the same path every single time. Consistent latency. No encryption overhead, because you control the entire path end to end.

The cost objection comes up immediately, and I get it. Direct Connect runs roughly $0.30 per hour for the port — that’s about $2,160 per month — plus $0.02 per GB transferred. Move 100 GB monthly between clouds and you’re looking at around $2,200 total. Sounds like a lot. Compare it against one senior engineer spending a week chasing performance issues: $3,000–$5,000 in loaded cost, conservatively. Suddenly the circuit looks cheap.

I’m apparently the kind of person who has tried every possible workaround, and Direct Connect works for me while VPN-only setups never actually solve the problem long-term. Don’t make my mistake.

You don’t need a colo presence or exotic hardware to get started. AWS and GCP both connect through partner networks — Equinix and Megaport specifically. Megaport lets you provision cross-cloud circuits without touching physical infrastructure yourself. I’ve set one up in under two hours from a laptop.

If budget is genuinely tight, use Direct Connect for your highest-traffic paths and keep VPN as a fallback. Real-time queries go over Direct Connect. Batch jobs and logging go over VPN. That hybrid approach still gets you 60–70% latency improvement on the paths that actually matter.

Mistake 3 — Ignoring DNS Resolution Latency Across Clouds

DNS is the most underrated source of multi-cloud latency. Full stop. Services making cross-cloud API calls depend on DNS resolution to find endpoints — and if that lookup is slow or flaky, every single request starts with a 30–100ms penalty before real network traffic even begins.

Here’s what happens in the wild. You deploy an app on AWS with internal service discovery. You deploy a backend API on GCP. The AWS app calls that API using a public DNS name. Every request first queries a public resolver — which may not have the record cached, which may not route to your nearest nameserver. That’s 50–150ms of latency just to resolve an IP address. Before you’ve sent a single byte of actual payload.

Worse: under load, that resolver becomes a bottleneck. I’ve watched teams experience cascading failures during peak traffic because DNS queries piled up faster than public resolvers could answer. Response times didn’t degrade gradually. They collapsed all at once.

The fix is cloud-native private DNS zones. AWS Route 53 private hosted zones. GCP Cloud DNS. Azure Private DNS if you’re three-cloud. Stand up a private hosted zone — call it internal.company.com — that exists only within your VPC networks across both clouds. Services query the private zone, which resolves in 2–5ms consistently, regardless of what load looks like elsewhere.

To test what you’re dealing with right now, three commands:

  • dig command: dig your-api.example.com @8.8.8.8 — measures public DNS lookup time directly.
  • nslookup with timing: nslookup -debug your-service.internal.company.com — shows the exact query time at each step.
  • Loop timing: for i in {1..100}; do time dig your-api.example.com; done — surfaces outliers and inconsistency that averages hide.

Seeing public lookups above 20ms? Switch to private DNS. The improvement is immediate — often a 40–100ms reduction in end-to-end request time. That’s one of the quickest wins available in multi-cloud networking, and it costs almost nothing to implement.

One more thing: don’t build hybrid DNS setups where some services resolve via public DNS and others via private. That inconsistency creates debugging situations that will make you question your life choices. Pick one resolution mechanism per service and stick with it.

How to Diagnose and Benchmark Your Multi-Cloud Network

Suspecting latency but not sure where it lives? Run these diagnostics in order. Don’t skip ahead.

Step 1: Baseline throughput testing. Use iPerf3 to measure raw throughput between VMs on each cloud. Spin up a t3.large on AWS and an n1-standard-2 on GCP in your target regions — those instance types are cheap enough to run for an hour and then terminate. Run iPerf3 for 60 seconds in both directions.

Command: iperf3 -c [target-ip] -t 60 -R (reverse mode). Seeing less than 800 Mbps on a 1 Gbps circuit? Something is wrong with your path. That’s your starting signal.

Step 2: Hop analysis. Use mtr — not just traceroute — to visualize the actual path your traffic takes. mtr -c 100 [target-ip] sends 100 packets and shows you each hop with latency and packet loss percentages. Look for hops where latency spikes unexpectedly. That spike is your culprit — usually a routing table misconfiguration or a peering point that’s geographically distant from where you thought you were connecting.

Step 3: Cloud-native diagnostics. AWS Reachability Analyzer tests connectivity and measures latency between EC2 instances, even across accounts. GCP Network Intelligence Center shows routing paths and bottlenecks visually — it’s genuinely useful and underused. Run both tools to confirm your peering configuration is actually being used the way you think it is.

Benchmark targets to work from: AWS to GCP in the same region pair should land at 3–8ms. Different regions on the same continent, expect 8–25ms. Cross-continent is 30–90ms depending on distance. Outside those ranges, your routing is broken somewhere.

That’s what makes benchmarking endearing to us networking folks — the numbers don’t lie, even when the architecture diagrams do. So, without further ado, go run the diagnostics. One final point before you do: measure at the 50th, 95th, and 99th percentiles. Average latency lies. A connection showing 10ms average with 150ms p99 creates a terrible user experience, even if the mean looks perfectly healthy on a dashboard.

Marcus Chen

Marcus Chen

Author & Expert

Robert Chen specializes in military network security and identity management. He writes about PKI certificates, CAC reader troubleshooting, and DoD enterprise tools based on hands-on experience supporting military IT infrastructure.

81 Articles
View All Posts

Stay in the loop

Get the latest multicloud hosting updates delivered to your inbox.