Migrating Data from AWS S3 to Google Cloud Storage: Field Experience
Modern cloud strategies often start with object storage in AWS S3, only to shift later to Google Cloud Storage (GCS) for reasons ranging from tighter integration with GCP-native analytics, to cost controls, to stricter sovereignty requirements. Regardless of the motivation, S3-to-GCS migrations introduce their own operational headaches: bandwidth costs, ACL mismatches, and unpredictable performance bottlenecks.
Concrete, low-friction data transfer matters. Here’s what works in practice—details, warts, and edge cases included.
Pre-Migration Checklist
- Data Sizing: Know your total object count and cumulative size. S3’s Inventory Reports help, especially for multi-million object buckets.
- IAM Permissions:
- AWS: Read and ListObject permissions required on all source data.
- GCP: Grant
roles/storage.objectCreator
on destination GCS buckets; avoid broader roles unless necessary.
- Bucket Configuration:
- Create target GCS buckets in your target region. Configure
Uniform bucket-level access
if future ACL simplification is a goal.
- Create target GCS buckets in your target region. Configure
- Network Planning:
- Budget for AWS egress and GCP ingress. Beware: inter-region or inter-continental transfers get expensive, fast.
Option 1: Direct gsutil Transfer
Since gsutil
(tested: SDK v440.0.0, gsutil 5.27) supports S3 natively, skipping the intermediate disk reduces time and failure points.
Sample .boto configuration:
[Credentials]
aws_access_key_id = AKIA<REDACTED>
aws_secret_access_key = <REDACTED>
Check that boto
is prioritized by your gsutil
:
$ gsutil version -l | grep boto
Transfer Command:
gsutil -m cp -r "s3://source-bucket/data/*" "gs://dest-bucket/data/"
-m
: Multithreaded. Omit if your link’s bursty.- Errors like
NoSuchUpload
orSlowDown
from AWS? Retry with-o "GSUtil:parallel_process_count=2"
. S3 throttles are common under load.
Note: Metadata such as custom headers or ACLs may drop in transit; post-process adjustments may be required with gsutil setmeta
.
Option 2: rclone — When You Need Synchronization or Fine Control
When dealing with >100 million small objects or complex key hierarchies, rclone
(1.62+ recommended) is often faster and more resilient than gsutil
.
Configure both remotes:
rclone config
# Define 's3src' for AWS S3 and 'gcstdest' for GCS
Copy Command:
rclone copy s3src:bucket/key-prefix gcstdest:bucket --progress --transfers=32
Hooks:
- Performance:
--transfers
above 32 can starve local resources or encounter API rate limits. - Sync: Use
rclone sync
instead to match source/destination exactly. Beware: this deletes objects not in source. - Debugging: Watch for throttling:
ERROR : ...: Failed to copy: RequestError: send request failed
Option 3: Programmatic — When Custom Logic or Filtering is Required
Python-based migrations using boto3
+ google-cloud-storage
offer maximum flexibility. Example: filtering by last-modified date.
import boto3
from google.cloud import storage
s3 = boto3.client('s3')
gcs = storage.Client()
gcs_bucket = gcs.bucket('my-gcs-bucket')
def migrate_s3_to_gcs(s3_key):
s3_obj = s3.get_object(Bucket='source-s3', Key=s3_key)
content = s3_obj['Body'].read()
blob = gcs_bucket.blob(s3_key)
blob.upload_from_string(content)
print(f"-- Migrated {s3_key}")
# NOTE: Inefficient for objects >250 MB due to memory consumption.
Critically: For multi-GB objects, switch to chunked streaming. Otherwise, MemoryError
is inevitable.
Post-Migration: Verification and Tune-Up
- Data Integrity: Use
rclone check
orgsutil hash
to confirm content MD5 across clouds. - ACL Normalization: S3’s bucket policies do not translate to GCS IAM. Use
gsutil iam set
or cloud console to update. - Versioning: S3’s versioned objects need custom handling. GCS can store multiple versions, but not by default.
Known Issue: Object Name Pitfalls
S3 permits characters and key structures (e.g., trailing slashes, non-printable ASCII) that can break GCS workflows. gsutil
may quietly skip, error, or rename such objects. Always examine logs for:
CommandException: File not found: ...
Quick Summary Table
Approach | Large Scale | Metadata Retention | Fine Filtering | Notes |
---|---|---|---|---|
gsutil | Good | Partial | No | Easiest for basic copy |
rclone | Great | Good | No | Resumes, resilient |
Python SDK | Poor | Full | Yes | For complex/filtered tasks |
Final Thoughts
No single tool handles every edge case. For bulk, rclone
usually wins. For ACL or versioning precision, scripting is required. AWS and GCP rate limits will bite—test with a small sample before any significant transfer.
For ongoing bi-directional sync, revisit architectural assumptions. Point-to-point sync isn’t always worth the operational complexity.
Got a region-crossing transfer? Check for public data transfer partnerships or offline transfer appliances like AWS Snowball or GCP Transfer Appliance if network cost or time constraints are prohibitive.