Multi-Cloud Container Orchestration Failures and Fixes

Why Container Orchestration Breaks Across Multiple Clouds

Multi-cloud container orchestration has gotten messy with all the version incompatibilities flying around. Kubernetes is supposed to be portable — deploy once, run anywhere — that’s literally the entire pitch. Yet I’ve spent countless hours debugging pods that run perfectly on GCP but fail silently on AWS, or watching Azure mysteriously drop connections that work everywhere else. It shouldn’t happen in 2024, but it does.

The real issue isn’t one failure mode. It’s several working together.

Start with Kubernetes version skew. AWS EKS, Azure AKS, and GCP GKE all run Kubernetes, but they don’t always run the same version at the same time. You might have 1.27.4 on EKS and 1.28.1 on GKE — that two-version gap introduces API deprecations, feature gates, and scheduler behavior changes that cascade through your deployments. A pod spec that’s valid on one cluster silently fails on another. No error message, no warning. Just stops working.

Then there’s the managed service differences layered on top. EKS doesn’t control your control plane networking — you do. AKS handles more of that for you. GKE is the most opinionated about it. These differences mean CNI plugins behave differently. Your Calico or Flannel network that worked on one cloud introduces latency or packet loss on another. I learned this the hard way running a three-cloud failover test: our egress traffic policy worked on GKE’s native networking but got filtered on EKS because we hadn’t configured the security group correctly — obvious in hindsight, brutal to debug at 2 AM.

Image registry authentication is another gotcha. ECR on AWS requires IAM roles and Kubernetes secrets set up precisely. Azure Container Registry uses service principals. Artifact Registry on GCP uses workload identity. Same container images. Three entirely different authentication chains. One misconfiguration and your pods get ImagePullBackOff errors on a specific cloud while the others succeed.

And then there’s storage class mapping. EKS uses gp3 volumes by default. AKS defaults to managed disks with different IOPS characteristics. GKE has persistent disks with their own performance tiers. A StatefulSet that assumes gp3 behavior fails when it hits AKS’s different I/O patterns — that’s what makes multi-cloud storage endearing to nobody.

Diagnosing Pod Deployment Failures Across Clusters

When pods fail across clouds, you need systematic diagnostics. Probably should have opened with this section, honestly — skipping these steps costs hours.

Start with node status across all three clusters:

kubectl get nodes -o wide

Check the output. You’re looking for NotReady status, missing resources, or kernel version differences. On GKE, you’ll often see nodes with specific machine types — n2-standard-4. On AKS, they’re listed as agent pools. On EKS, they’re worker node EC2 instances. Same responsibility, different labels.

Next, validate resource allocation:

kubectl describe node <node-name>

This shows allocated CPU, memory, and ephemeral storage. Compare the same metric across clouds. I once found that AKS was allocating 1Gi more memory per node for system pods than GKE, causing our deployment to fit on GKE but fail on AKS with insufficient resources — even though we’d sized both clusters identically.

For actual pod failures, check events first:

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Events tell you if scheduling failed, image pulls failed, or readiness probes are hanging. See ImagePullBackOff? Check registry authentication immediately. See FailedScheduling with “Insufficient cpu”? Your resource requests are wrong for that cloud.

Then describe the specific pod:

kubectl describe pod <pod-name> -n <namespace>

Look at the Events section at the bottom. It’ll tell you exactly where the pod got stuck. If you see “pulling image” taking 3+ minutes on one cloud but 30 seconds on another, that’s a registry latency or authentication issue.

Finally, check logs — but be aware they might be empty if the container never started:

kubectl logs <pod-name> -n <namespace> --previous

The --previous flag gets logs from the last container incarnation if the pod has crashed and restarted.

Fixing Common Orchestration Issues in Each Cloud

AWS EKS — Managing IAM and Network Policies

EKS pods fail most often because of IAM role misconfigurations. Your cluster has a role, but individual pods need their own roles via IRSA (IAM Roles for Service Accounts). If you’re pulling from ECR, your pod service account needs the correct ECR policy attached:

aws iam attach-role-policy --role-name <role-name> --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

Then create the service account with the annotation:

kubectl annotate serviceaccount <sa-name> -n <namespace> eks.amazonaws.com/role-arn=arn:aws:iam::<account>:role/<role-name>

Restart your deployment: kubectl rollout restart deployment <name> -n <namespace>

EKS also enforces security groups on the pod network. If your pods can’t reach services in other clusters, check the worker node security group. It needs to allow inbound/outbound on the ports your services use — typically TCP/443 for API communication. Don’t make my mistake of assuming this is automatic.

Azure AKS — Handling Managed Identity and Storage

AKS uses managed identities instead of IAM roles. If your pod can’t pull from ACR, you need either a managed identity attached to the node pool or a docker-registry secret:

kubectl create secret docker-registry <secret-name> --docker-server=<registry>.azurecr.io --docker-username=<username> --docker-password=<password> -n <namespace>

Add this to your pod spec: imagePullSecrets: - name: <secret-name>

AKS managed disks behave differently than other clouds. If you have a StatefulSet using storage, the default storage class might have different IOPS characteristics. Create a custom storage class matching your actual requirements:

kubectl apply -f - <<EOF apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast-ssd provisioner: disk.csi.azure.com parameters: skuname: Premium_LRS cachingmode: ReadWrite EOF

Reference this in your PVC: storageClassName: fast-ssd

GCP GKE — Configuring Workload Identity and Service Mesh

GKE uses Workload Identity for pod authentication. It’s elegant — probably the best option here, as pod authentication requires integration between Kubernetes and cloud IAM. Bind your Kubernetes service account to a Google service account:

gcloud iam service-accounts add-iam-policy-binding <gsa-name>@<project>.iam.gserviceaccount.com --role roles/iam.workloadIdentityUser --member "serviceAccount:<project>.svc.id.goog[<namespace>/<ksa-name>]"

Annotate the Kubernetes service account:

kubectl annotate serviceaccount <ksa-name> -n <namespace> iam.gke.io/gcp-service-account=<gsa-name>@<project>.iam.gserviceaccount.com

GKE’s default network policies are permissive, which sometimes masks connectivity issues that show up on stricter clusters. If your microservices work on GKE but fail on EKS or AKS, enable network policies to match:

gcloud container clusters update <cluster-name> --enable-network-policy

Testing Orchestration Consistency Before Production

Before deploying production workloads across clouds, validate consistency. Deploy a test application with specific resource requirements and observe behavior:

kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: multi-cloud-test spec: replicas: 3 selector: matchLabels: app: test template: metadata: labels: app: test spec: containers: - name: test image: nginx:latest resources: requests: cpu: 100m memory: 128Mi limits: cpu: 250m memory: 256Mi EOF

Watch rollout timing across all three clusters:

kubectl rollout status deployment multi-cloud-test

Compare completion times. Significant variance indicates scheduler or network differences. Then test failover — delete pods on one cluster and verify they restart. Cross-cluster service discovery, if you’re using it, should handle the transition seamlessly. If it doesn’t, you’ve found a configuration gap before production.

Monitoring Container Health Across All Clouds

Observability prevents orchestration drift from becoming production incidents. You won’t need every monitoring tool out there, but you will need a handful of observability resources — at least if you care about reliability.

Aggregate logs from all three clusters into a single system. Datadog, New Relic, or open-source stacks like Loki work, but the approach matters.

Tag all logs with cluster identity:

--kubelet-extra-args="--node-labels=cloud-provider=aws,cluster=prod-east" on EKS

This makes filtering and correlation across clouds trivial. Set up alerts on pod restart rates per cloud. If GKE restarts pods 5x more often than EKS for the same workload, something’s wrong with that cluster’s configuration or capacity.

Monitor the kubelet itself. Different clouds sometimes ship different versions or configurations. Check kubelet logs directly on nodes for permission errors, volume mount issues, or network plugin problems — those signal cloud-specific issues before they hit application pods.

Finally, validate that your container runtime versions match. Run kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"t"}{.status.nodeInfo.containerRuntimeVersion}{"n"}{end}' across clusters. Container runtime skew between Docker, containerd, and CRI-O introduces subtle behavioral differences that manifest under load — I’m apparently the type who runs this check monthly while others never bother until things break.

Why Container Orchestration Breaks Across Multiple Clouds

Diagnosing Pod Deployment Failures Across Clusters

Fixing Common Orchestration Issues in Each Cloud

AWS EKS — Managing IAM and Network Policies

Azure AKS — Handling Managed Identity and Storage

GCP GKE — Configuring Workload Identity and Service Mesh

Testing Orchestration Consistency Before Production

Monitoring Container Health Across All Clouds

Marcus Chen

You Might Also Like

Multi-Cloud Vendor Lock-In Risks You Can Avoid Now

Cloud Latency — Practical Fixes That Cut Response Times Without Rebuilding Your Stack

Multi-Cloud Service Mesh Setup Without Breaking Latency

Stay in the loop