Transfer Files To Google Cloud Storage

Transfer Files To Google Cloud Storage

Reading time1 min
#Cloud#Storage#Data#GCP#GoogleCloud#gsutil

Optimizing Google Cloud Storage File Transfers: Engineer's Guide

Moving terabytes to Google Cloud Storage (GCS) shouldn’t be an afterthought. Inefficient transfers inflate bandwidth bills, increase processing times, and degrade pipeline reliability—especially when dealing with analytics ingest, disaster recovery, or CI/CD artifacts.

Below: field-tested strategies, tool-specific configurations, and operational gotchas for fast, cost-effective GCS uploads.


Upload Inefficiency: Signs and Causes

  • Surging egress charges after repeated large dataset uploads.
  • Unpredictable delays syncing artifacts during nightly CI jobs.
  • Increased local CPU/memory usage due to unoptimized transfer tools.
  • Network saturation impacting non-transfer workloads.

The culprits: default settings (often single-threaded), uploading uncompressed data, or poor regional choices.


Tooling: gsutil, Storage Transfer Service, API—What Fits?

ToolStrengthsWeaknessesUse Case
gsutil (v5.25+)Scripting, ad hoc, supports composite/parallelLocal compute cost, basic error handlingDevOps automation
Storage Transfer ServiceScalable, scheduled, cross-cloud migrationHigher setup, limited local filesystem flexibilityBulk data migration
API / SDKFull integration, fine-tuned controlMore engineering effort, per-call costApp-level ingest/egress

Reality: Rapid bulk deploys use Storage Transfer Service. Day-to-day DevOps still leans on gsutil.


Parallel Composite Uploads: gsutil’s Key Flag

For files over 150MB (sometimes lower—test for your link characteristics), activate composite uploads:

gsutil -o "GSUtil:parallel_composite_upload_threshold=150M" cp VM-disk.img gs://project-bucket/

Why:

  • Utilizes all available local CPU/network sockets (parallel chunked POSTs).
  • Often 2-5x faster on 10Gbps+ pipes.

Known Issue:

  • Downloading composite objects from GCS with certain libraries (notably older Java SDKs) can trigger:
    400 Bad Request: Can not perform get on a composite object.
    
    Always check downstream tooling compatibility. If unsure, consider gsutil compose to reassemble after upload.

Compression: Shrink Before You Send

Raw logs? Application snapshots? Compress before transit. Even fast networks choke on millions of small files.

Efficient workflow:

tar -czf /tmp/logs-$(date +%F).tar.gz /var/log/an-app/
gsutil cp /tmp/logs-*.tar.gz gs://logs-bucket/
  • .tar.gz for hierarchical unstructured data
  • .zip when targeting Windows/unzip interoperability

Trade-off:
Local disk/CPU usage spikes during compression, but ~40–90% network savings for text-heavy datasets.


Network Bottlenecks & System Tuning

Throughput Suffering?
Check host-side TCP tuning:

  • Set larger buffers:
    sysctl -w net.core.rmem_max=16777216
    sysctl -w net.core.wmem_max=16777216
    
  • Confirm MTU: mismatched packet sizes cause retransmissions.
  • Watch for lurking IDS/IPS, NAT, or limited corporate proxies: silent throttling is common.

Quota Pain:
Hitting 429s (Too Many Requests) or GCS-side throttling?

  • Review GCS rate limits.
  • Split large uploads across buckets/projects when hitting hard maximums.

Bucket Placement: Data Gravity in Action

Don’t ignore physical locality—

  • Choose region (europe-west3 for Frankfurt, not the default US multi-region).
  • Reduces round-trip times, avoids transcontinental egress fees.
  • Combine with VPC Peering or Private Google Access for secure, direct paths within GCP.

Gotcha: Changing a bucket’s region post-creation is non-trivial—requires migration and ACL remapping.


Schedule Intelligently

ISPs and VPNs don’t charge equally at all hours. When possible, offload heavy transfers 01:00–05:00 local time.

Linux cron automation (example):

0 2 * * * /usr/local/bin/upload-logs.sh >> /var/log/gcs-upload.log 2>&1

Pair with gsutil’s -m flag for multi-threaded transfers:

gsutil -m cp /data/exports/*.csv gs://data-dump/

Automation: Robust Bash Upload Script

Handles both local compression and upload. Example, manually triggers:

#!/bin/bash
set -e
BUCKET="gs://archive-prod"
SRC="/mnt/vol/backups"
TGT="daily-backup-$(date +%F).tar.gz"

# Compress with progress
tar -czf $TGT --directory=$SRC . || { echo 'tar failed'; exit 1; }

# Parallel upload with gsutil
gsutil -o "GSUtil:parallel_composite_upload_threshold=100M" cp $TGT $BUCKET/ || {
    echo "Upload failed: $(date)" >> /var/log/gcs-upload-errors.log
    exit 2
}

# Optional: md5sum for audit logging
md5sum $TGT >> /var/log/gcs-upload-audit.log

rm $TGT

Tip:
Add trap handlers for signal cleanup (SIGINT, SIGTERM).


Monitoring, Error Handling, and Cost Tracking

Leverage GCP’s built-in metrics:

  • Cloud Monitoring:
    Create dashboards on storage.googleapis.com/write_object_latency.
  • Cloud Logging:
    Regex scan for "error": patterns in transfer logs.
  • Budgets & Alerts:
    Automate spending alerts on GCS usage (Billing → Budgets & alerts).

Non-obvious: gsutil cp returns 1 if any file fails, not which ones—parse stderr or use gsutil -L logfile for upload audit trails.


Conclusion

Raw bandwidth and GCS’s guarantees don’t substitute for engineering discipline:

  • Compress first, upload in parallel, batch where possible.
  • Co-locate buckets, automate for repeatability.
  • Monitor, alert, and optimize with real-world data.

There’s always another edge case. Multi-GB files? Test chunk size. Tens of millions of small files? Prefer storage transfer jobs over recursive cp. Not perfect, but good enough to keep pipelines humming, costs sane, and troubleshooting to a minimum.