Aws S3 To Gcp Storage

Aws S3 To Gcp Storage

Reading time1 min
#Cloud#Migration#Storage#AWS#GCP

Migrating Data from AWS S3 to Google Cloud Storage: Field Experience

Modern cloud strategies often start with object storage in AWS S3, only to shift later to Google Cloud Storage (GCS) for reasons ranging from tighter integration with GCP-native analytics, to cost controls, to stricter sovereignty requirements. Regardless of the motivation, S3-to-GCS migrations introduce their own operational headaches: bandwidth costs, ACL mismatches, and unpredictable performance bottlenecks.

Concrete, low-friction data transfer matters. Here’s what works in practice—details, warts, and edge cases included.


Pre-Migration Checklist

  • Data Sizing: Know your total object count and cumulative size. S3’s Inventory Reports help, especially for multi-million object buckets.
  • IAM Permissions:
    • AWS: Read and ListObject permissions required on all source data.
    • GCP: Grant roles/storage.objectCreator on destination GCS buckets; avoid broader roles unless necessary.
  • Bucket Configuration:
    • Create target GCS buckets in your target region. Configure Uniform bucket-level access if future ACL simplification is a goal.
  • Network Planning:
    • Budget for AWS egress and GCP ingress. Beware: inter-region or inter-continental transfers get expensive, fast.

Option 1: Direct gsutil Transfer

Since gsutil (tested: SDK v440.0.0, gsutil 5.27) supports S3 natively, skipping the intermediate disk reduces time and failure points.

Sample .boto configuration:

[Credentials]
aws_access_key_id = AKIA<REDACTED>
aws_secret_access_key = <REDACTED>

Check that boto is prioritized by your gsutil:

$ gsutil version -l | grep boto

Transfer Command:

gsutil -m cp -r "s3://source-bucket/data/*" "gs://dest-bucket/data/"
  • -m: Multithreaded. Omit if your link’s bursty.
  • Errors like NoSuchUpload or SlowDown from AWS? Retry with -o "GSUtil:parallel_process_count=2". S3 throttles are common under load.

Note: Metadata such as custom headers or ACLs may drop in transit; post-process adjustments may be required with gsutil setmeta.


Option 2: rclone — When You Need Synchronization or Fine Control

When dealing with >100 million small objects or complex key hierarchies, rclone (1.62+ recommended) is often faster and more resilient than gsutil.

Configure both remotes:

rclone config
# Define 's3src' for AWS S3 and 'gcstdest' for GCS

Copy Command:

rclone copy s3src:bucket/key-prefix gcstdest:bucket --progress --transfers=32

Hooks:

  • Performance: --transfers above 32 can starve local resources or encounter API rate limits.
  • Sync: Use rclone sync instead to match source/destination exactly. Beware: this deletes objects not in source.
  • Debugging: Watch for throttling:
    ERROR : ...: Failed to copy: RequestError: send request failed
    

Option 3: Programmatic — When Custom Logic or Filtering is Required

Python-based migrations using boto3 + google-cloud-storage offer maximum flexibility. Example: filtering by last-modified date.

import boto3
from google.cloud import storage

s3 = boto3.client('s3')
gcs = storage.Client()
gcs_bucket = gcs.bucket('my-gcs-bucket')

def migrate_s3_to_gcs(s3_key):
    s3_obj = s3.get_object(Bucket='source-s3', Key=s3_key)
    content = s3_obj['Body'].read()
    blob = gcs_bucket.blob(s3_key)
    blob.upload_from_string(content)
    print(f"-- Migrated {s3_key}")

# NOTE: Inefficient for objects >250 MB due to memory consumption.

Critically: For multi-GB objects, switch to chunked streaming. Otherwise, MemoryError is inevitable.


Post-Migration: Verification and Tune-Up

  • Data Integrity: Use rclone check or gsutil hash to confirm content MD5 across clouds.
  • ACL Normalization: S3’s bucket policies do not translate to GCS IAM. Use gsutil iam set or cloud console to update.
  • Versioning: S3’s versioned objects need custom handling. GCS can store multiple versions, but not by default.

Known Issue: Object Name Pitfalls

S3 permits characters and key structures (e.g., trailing slashes, non-printable ASCII) that can break GCS workflows. gsutil may quietly skip, error, or rename such objects. Always examine logs for:

CommandException: File not found: ...

Quick Summary Table

ApproachLarge ScaleMetadata RetentionFine FilteringNotes
gsutilGoodPartialNoEasiest for basic copy
rcloneGreatGoodNoResumes, resilient
Python SDKPoorFullYesFor complex/filtered tasks

Final Thoughts

No single tool handles every edge case. For bulk, rclone usually wins. For ACL or versioning precision, scripting is required. AWS and GCP rate limits will bite—test with a small sample before any significant transfer.

For ongoing bi-directional sync, revisit architectural assumptions. Point-to-point sync isn’t always worth the operational complexity.

Got a region-crossing transfer? Check for public data transfer partnerships or offline transfer appliances like AWS Snowball or GCP Transfer Appliance if network cost or time constraints are prohibitive.