Cloud Latency — Practical Fixes That Cut Response Times Without Rebuilding Your Stack
Cloud latency optimization has gotten complicated with all the half-baked advice flying around. The problem isn’t that the advice is wrong, exactly — it’s that everyone skips straight to “add a CDN” or “throw a cache at it” without doing a single diagnostic step first. As someone who spent three years running infrastructure for a SaaS platform handling around 4 million API requests daily, I learned everything there is to know about what happens when you optimize the wrong layer. Spoiler: it doesn’t just waste time. It can actively make things worse. We spent six weeks tuning application servers — six weeks — before someone finally ran a distributed trace and found that 80% of our P99 latency was sitting in database I/O the entire time. Don’t make my mistake.
This guide runs on concrete numbers. Not “CDNs reduce latency” but “a CDN cache hit serves content at roughly 20ms versus 150–300ms for an origin round-trip from the US West Coast to us-east-1.” Every technique here has a measurable delta attached. If you can’t measure the improvement, you’re guessing — and guessing is how you end up six weeks deep in the wrong problem.
Where Latency Actually Comes From in Cloud Applications
Before touching a single config file, you need to know which bucket your latency lives in. There are three — and they behave completely differently from each other:
- Network transit time — the round-trip between client and server, including DNS resolution, TCP handshake, TLS negotiation, and physical routing across the internet
- Application processing time — compute and memory operations inside your service: serialization, business logic, framework overhead
- Storage I/O time — database queries, file reads, cache lookups, anything touching persistent storage
But what is the core mistake here? In essence, it’s treating these three buckets as one problem. But it’s much more than a categorization error — a fix that helps network transit time does nothing for storage I/O. Throwing more CPU at application processing won’t touch a slow sequential table scan. These are genuinely separate problems that need separate solutions.
Run a distributed trace before you do anything else. AWS X-Ray, Datadog APM, OpenTelemetry — all of them give you a waterfall breakdown of a request. Look at your P99 latency — not P50, because P99 is what users actually hit during peak load — then split it into those three buckets. The biggest one gets your attention first.
On the platform I mentioned, our P99 sat at 1,400ms. Trace breakdown: network transit around 85ms, application processing around 180ms, storage I/O around 1,135ms. We had been optimizing application processing. Probably should have opened with this section, honestly.
CDN Placement — The Fastest Win for Most Applications
Burned by months of ignoring distributed tracing, I got obsessive about measuring CDN performance correctly. Most teams deploy CDN the same wrong way — add CloudFront or Fastly, watch P50 drop, declare victory, and never once look at cache hit ratio.
Cache hit ratio is the only number that matters for CDN performance. A CDN sitting at 40% cache hit ratio means 60% of requests still hammer your origin. You’ve added a network hop and a complexity layer for negative benefit on those requests. Bump that to 85% and you’re serving 85% of requests from an edge node at roughly 20ms. The remaining 15% that miss cache and travel to origin from the US West Coast to us-east-1? That’s 150–300ms depending on routing.
Moving from 40% to 85% cache hit ratio on a typical API workload cuts P95 latency by approximately 60%. No application code changes. No infrastructure overhaul — just configuration. That’s what makes CDN optimization endearing to us infrastructure engineers who’d rather not rebuild the whole stack on a Tuesday.
Here’s how to actually push cache hit ratio up:
- Cache-Control headers, set correctly. Static assets — JavaScript bundles, CSS, images, fonts — get
Cache-Control: public, max-age=31536000, immutable. One year. The file hash in the filename handles cache busting. User-specific API responses getCache-Control: private, no-store. The common mistake is setting short TTLs on static assets “just to be safe,” which quietly destroys your hit ratio. - Vary header management. A
Vary: Accept-Encodingheader means the CDN treats gzip and non-gzip as separate cache entries — fine, expected. AVary: Cookieheader means the CDN caches a separate copy for every unique cookie value, effectively bypassing the cache entirely for authenticated users. - Origin Shield. CloudFront’s Origin Shield adds a regional aggregation layer between edge nodes and your origin. When multiple edge nodes miss cache simultaneously, they collapse into a single origin request instead of thundering-herding. For origins that take 200ms or more to generate a response, this reduces both origin load and the blast radius of a cold cache.
One specific number worth keeping in your back pocket: CloudFront’s regional edge caches in Frankfurt, Tokyo, and São Paulo sit roughly 5–15ms from end users in those regions. Your EC2 instance in us-east-1 sits 100–250ms away from those same users. Every cache hit you add is 85–240ms shaved from a real person’s experience.
Database Query Latency — Index Before You Optimize Anything Else
Across maybe a dozen production systems, the pattern holds: most application latency lives in database query time. Not network. Not compute. Queries.
Run EXPLAIN ANALYZE on your slowest queries before touching anything else. The output tells you whether PostgreSQL or MySQL is doing a sequential scan — Seq Scan in Postgres output — on a large table. Sequential scans on tables above a few hundred thousand rows are almost always a bug. A sequential scan on a 10-million-row table might cost 800ms. The same query with a proper B-tree index on the filter column costs 2–5ms. That’s not a rounding error.
Three patterns that show up constantly:
- Missing index on foreign key columns. You have an index on the primary key. You probably don’t have one on
user_idin yourorderstable — every query filtering by user triggers a sequential scan. Add a composite index on your most common query patterns. - N+1 query patterns. One query fetches 50 records, then 50 individual queries grab a related record for each. In ORMs like ActiveRecord or SQLAlchemy this is invisible until you log query counts. The fix is eager loading —
includes,joinedload, etc. Measured impact on one endpoint I worked on: dropping from 51 queries to 1 cut response time from 340ms to 18ms. - RDS storage autoscaling during peak traffic. Amazon RDS expands storage automatically when you’re approaching capacity — and that expansion causes brief I/O latency spikes. I’ve personally watched 2–5 second query times during a storage expansion event at 2pm on a Wednesday. Pre-provision your production database storage with headroom. Do not rely on autoscaling for your primary production instance.
On EBS volume types: if you’re still on gp2, switch to gp3. This is essentially a free performance upgrade. The gp2 IOPS model ties performance to volume size — 3 IOPS per GB, so a 100 GB volume gets 300 IOPS and a 1 TB volume gets 3,000 IOPS. The gp3 baseline is a flat 3,000 IOPS regardless of volume size, with 125 MB/s throughput, and you can provision up to 16,000 IOPS independently of storage size for about $0.065 per provisioned IOPS-month above baseline. For most OLTP workloads, the consistent gp3 baseline eliminates the burst credit depletion spikes that gp2 volumes hit under sustained load — those spikes are maddening to diagnose.
For high-frequency OLTP — transaction processing systems doing thousands of small reads and writes per second — io2 Block Express might be the best option, as that workload requires up to 64,000 IOPS with sub-millisecond latency. That is because standard gp3 provisioned limits simply won’t hold at that throughput. Expensive. Worth it if traces confirm IOPS is actually your bottleneck.
gRPC vs REST — When the Protocol Switch Is Worth It
gRPC uses HTTP/2 multiplexing and Protocol Buffers binary serialization. On identical endpoints with identical application logic, gRPC typically delivers 30–50% latency reduction and 60–80% payload size reduction compared to REST/JSON for data-heavy responses. I’ve reproduced those numbers internally on microservice-to-microservice calls with large response objects — they hold up.
But here’s the honest assessment: it’s not worth switching for most use cases.
Simple CRUD APIs with small JSON payloads — under 1 KB — see minimal improvement. The serialization overhead difference between JSON and protobuf at that scale lands in single-digit milliseconds. Not worth the operational cost of maintaining .proto schema files and regenerating client stubs every time something changes. Web browser clients add another layer of friction: gRPC-Web requires a proxy layer, usually Envoy, which adds operational complexity that frequently cancels out the latency benefit for typical web applications.
Switch to gRPC when you have all of these:
- Microservice-to-microservice communication where you control both sides
- High call frequency — above 100 requests per second per instance, where HTTP/2 multiplexing reduces connection overhead meaningfully
- Response payloads above 5 KB where protobuf binary encoding saves real bandwidth and parse time
- Data pipelines where large structured objects move between services constantly
The schema-first development model of protobuf is genuinely a benefit for team coordination — but it’s a real upfront cost. Evaluate whether a 35% latency improvement on your microservice calls justifies two weeks of migration work. Sometimes it absolutely does. Often it doesn’t. Measure first.
DNS and Connection Management
Two things here that feel minor and aren’t.
DNS TTL Strategy
DNS TTL controls how long resolvers cache your records before re-querying authoritative servers. Long TTLs — 3,600 seconds or more — mean resolver lookup time is near zero on repeated requests. Short TTLs — 30 to 60 seconds — mean faster propagation when you update records, but more frequent resolver queries add 10–50ms of lookup latency on cold requests.
The practical approach: drop TTL to 60 seconds starting 24–48 hours before a planned migration or deployment that changes IP addresses. After things stabilize, push TTL back to 300–3,600 seconds. You get fast propagation when you need it and low resolver overhead during steady state. A lot of teams set low TTLs permanently — apparently out of general anxiety about deployments — and then eat the resolver overhead on every single cold connection forever.
Connection Pooling — Especially for Serverless
Each new database connection costs 20–50ms. TCP handshake, TLS negotiation, database authentication — the whole ceremony. On a traditional application server holding persistent connections, this is a one-time cost per server startup. On AWS Lambda, it’s potentially a cost per invocation, since Lambda creates fresh execution environments that each establish new connections from scratch.
RDS Proxy sits between Lambda functions and RDS, maintaining a warm connection pool. Lambda invocations connect to RDS Proxy — roughly 1ms since it’s internal VPC traffic — and RDS Proxy multiplexes those onto a smaller pool of persistent RDS connections. The measured reduction in database connection overhead for serverless architectures runs 60–70%. On a function spending 35ms per invocation just on connection establishment, that’s 21–24ms back in your budget. Per invocation. At scale that adds up fast.
For non-serverless applications, TCP keepalive prevents a similar problem — idle connections dropping and requiring a full TLS handshake, roughly 100ms, on the next request. Configure keepalive at the OS level (net.ipv4.tcp_keepalive_time in Linux, typically 60–120 seconds for cloud environments) and at the application level in your HTTP client or database driver. Most frameworks have this disabled or set to overly conservative defaults — check yours.
Putting It Together — Diagnose First, Then Stack These
Run a distributed trace. Break your P99 into the three buckets. Work through this list in order of impact for your specific workload. For a typical web API, the ranking I’d suggest: database queries first, CDN cache hit ratio second, connection management third, protocol selection fourth if microservices are in the picture.
While you won’t need to rebuild your entire stack, you will need a handful of reliable measurement tools — distributed tracing, query analyzers, and something tracking cache hit ratios over time. First, you should establish a real baseline — at least if you want to know whether your changes are actually doing anything. The numbers in this guide are real and repeatable, but your stack will produce different deltas. An application with tiny JSON responses and minimal caching opportunity won’t see 60% latency reduction from CDN work. One with Lambda hitting RDS directly will see dramatic improvement from RDS Proxy. Measure your baseline, apply one change, measure again. That’s the only approach that actually works.
Stay in the loop
Get the latest multicloud hosting updates delivered to your inbox.