Azure to GCP: Pragmatic Migration for Cloud Engineering Teams
Avoiding avoidable downtime and keeping costs contained—these are the two priorities that can turn an Azure-to-GCP migration from a theoretical exercise into a technical achievement. Many teams face the migration question forced by merger, product fit, or simply a desire to better leverage Google’s stream-processing and ML toolchain. Whatever the driver, the underlying engineering problem remains: How to move actual, running workloads—without business interruption.
Azure vs. GCP: Key Engineering Drivers
Any team considering migration should review core differences, not marketing claims. Here’s a fact-based shortlist:
- ML/AI stack: Google Vertex AI, AutoML, and TPUs are fully managed, with smooth integration—Azure’s equivalents (e.g., Azure ML) often lag in ecosystem maturity.
- Analytics at scale: BigQuery’s serverless columnar engine outperforms Synapse for “big scan” workloads.
- Kubernetes: GKE is generally acknowledged to have faster rollout velocity and lower operational toil than AKS, especially post-1.22.
- Pricing dynamics: GCP’s custom VM types and sustained use discounts genuinely impact long-term cost structure (as confirmed in real TCO).
- Backbone network: GCP’s global load balancing can be a differentiator in transregional architectures.
Note: Most organizations don’t need every GCP feature, but missing a critical mapping results in expensive rework.
Step 1: Baseline, Inventory, and Stakeholder Buy-In
Inventory your assets—there’s no shortcut:
- Pull a CSV export of your Azure resources: az resource list --output table > ./azure-inventory.csv
- Classify: By workload type (stateless, stateful), regulatory zone, and uptime requirement.
- Dependencies: Gather network diagrams to untangle service mesh, Azure-specific PaaS (e.g., Service Bus), and external integrations.
Few migrations stall on compute; most derail on untracked dependencies or hidden data gravity. Review all DNS, identity providers, and hardwired endpoints.
Define goals with engineering leadership: is this a “lift-and-shift” (minimize code changes), or do you want to re-platform critical workloads for scalability and cost (e.g., moving a VM-based API to Cloud Run)? Document this up front. Skimp here, regret later.
Step 2: Service and API Mapping
Not all services translate one-to-one from Azure to GCP. Avoid assuming feature parity.
Azure Service | GCP Equivalent | Caveats |
---|---|---|
Virtual Machines (VM) | Compute Engine | Disk types, GPU availability differs |
Blob Storage | Cloud Storage | Lifecycle, inventory API differences |
SQL Database | Cloud SQL / Spanner | Spanner is not a direct SQL substitute |
AKS (Kubernetes) | Google Kubernetes Engine (GKE) | Ingress controllers, versioning |
Azure Functions | Cloud Functions | Timeout limits, triggers vary |
Cosmos DB | Firestore / Bigtable | Consistency, partitioning differences |
Azure AD | Cloud Identity, IAM | SAML/OIDC integration varies |
Gotcha: Azure’s Managed Identity does not directly map to GCP—service accounts must be architected and legacy code may require patching.
Step 3: GCP Environment Construction
Standard approach:
- Project creation w/ billing constraints: Use
gcloud projects create
followed by explicitbilling accounts link
. - IAM modeling: Establish custom roles, avoid Owner-wide permissions in production. Consider mapping existing RBAC patterns.
- Network topology: Replicate VNET structure (subnets, peering, VPN, firewall) using Terraform for repeatability. Sample snippet below:
resource "google_compute_network" "core" {
name = "azure-migrated-net"
auto_create_subnetworks = false
}
- API enablement: Automate with
gcloud services enable compute.googleapis.com storage.googleapis.com ...
. - Service Account hygiene: Create per-app service accounts. Rotate keys prior to cutover.
Note: Don’t assume Azure’s NSG rules match GCP’s VPC firewall semantics—test with staging firewalls ahead of time.
Step 4: Migrating Compute—VMs and Kubernetes Workloads
VM Hosting: Practical Methods
Approach: Export each VM disk as VHD, upload to Cloud Storage, import via gcloud compute images import
.
Example Windows Server migration (tested with 2019 and 2022):
- Azure:
Export via portal or
az vm export –n <vm-name> --resource-group <rg> --output vhd
- GCP:
Upload withgsutil cp
, then:
gcloud compute images import legacy-sql-2019 \
--source-file gs://my-bucket/vm-disk.vhd \
--os windows-server-2019 \
--timeout=6h
Known issue: Some Azure VMs with nested hypervisors fail import with the error:
FAILED_PRECONDITION: osconfig agent not available
.
Trade-off: Recreating the image from scratch on GCP sometimes yields better network driver performance than importing existing disks.
Container Workloads
Images:
Push images from Azure Container Registry to Google Artifact Registry:
docker login azurecr.io
docker pull azurecr.io/myteam/app:v2
docker tag azurecr.io/myteam/app:v2 us-central1-docker.pkg.dev/myproj/myrepo/app:v2
docker push us-central1-docker.pkg.dev/myproj/myrepo/app:v2
Manifests:
Update storage class spec (standard-rwo
in GKE ≠ managed-premium
on AKS). Ingress? Re-write for GKE Ingress controller or test with NGINX controller as your bridge.
Helm charts may need patching for cloud-provider-specific annotations. Don’t forget to audit for embedded Azure resource URIs.
Step 5: Data and Database Migration
Here’s where most migrations hit risk.
Databases:
If running SQL Server/Azure SQL Database, use Database Migration Service (DMS):
- Target Cloud SQL (MySQL/Postgres/SQL Server).
- Set up source as read-replica if possible.
- DMS configuration may require custom flags for collation or encryption (e.g.,
--enable_sqlserver_ae
).
Real-World Example
Migrated 3TB Azure SQL DB to Cloud SQL 14. Observed that initial DMS full transfer took ~8 hours. Change data capture (CDC) during cutover applied ~15 minutes of delta.
Common error:
ERROR: peer authentication failed for user ‘migration_user’
—> Caused by role mapping mismatch. Verify with SELECT * FROM pg_roles;
.
Blob Storage:
Storage Transfer Service recommended for >1TB. Parallelization is critical—set --max-worker-count
to at least 10 for large buckets.
Tip: Validate ACL and signed URL expiration semantics; they are subtly different from Azure to GCP.
Step 6: Validation and Pre-Cutover Testing
- End-to-end smoke tests; automate via your CI/CD (e.g., GitHub Actions, Cloud Build).
- Benchmark key workloads and review
ps
and kernel logs for errors like:
kernel: [ 4.712395] hv_balloon: Max. dynamic memory size: 13056 MB
- Validate all external APIs—especially OAuth redirect URIs—against GCP endpoints.
Some failures will only emerge under production load. Plan a full blue-green or shadow deployment if possible.
Step 7: Cutover, Observability, and Post-Launch Optimization
- DNS switch: Reduce TTL to 60 seconds 48 hours prior.
- Production switchover: Use maintenance windows for major service transitions.
- Instrumentation: Enable Cloud Monitoring (Stackdriver) and set up alerting on error rates, latency, and spending spikes.
- Review IAM: Identify any lingering global roles. Least privilege, always.
- Cost controls: Activate Budgets and Recommendations API.
Known cost pitfall: orphaned disks and reserved IPs driving up untracked spend.
Advanced Tips
- Use Terraform or Google’s Deployment Manager for repeatable infra creation; version configs in your source repo.
- For hybrid workloads, evaluate Anthos. Not always straightforward—egress charges and ops complexity can negate benefits for some teams.
- Security: Disable default service account token creation for all GKE namespaces (
automountServiceAccountToken: false
) unless explicitly needed.
Not a Silver Bullet
No migration is perfect on the first pass. Expect at least one major adjustment post-migration—often in IAM or DNS. Some scripts and error patterns (e.g., transient 429: RESOURCE_EXHAUSTED
during DMS cutover) are unavoidable in larger moves.
The process above emphasizes standard control points, but leaves room for engineering judgement. New features crop up; no checklist stays perfectly current. For SQL workloads, a skip—the downtime window during CDC sync is never exactly what the docs claim. Test everything.
Integrate lessons learned into your runbooks—because odds are, this won’t be your last cloud migration.
For code samples, nuanced errors, or questions about mixed-platform networking, bring them to the comments or DM.