Upload To Google Cloud

Upload To Google Cloud

Reading time1 min
#Cloud#Storage#Python#GoogleCloud#ParallelUploads#GCS

Efficient File Uploads to Google Cloud Storage: Parallel Composite Objects in Practice

Data engineering pipelines frequently choke on single-threaded uploads of large datasets. Standard linear uploads to Google Cloud Storage (GCS) don’t scale—particularly with files north of several gigabytes. Slow links, random disconnects, and monotonous progress bars. There’s a better way.


The Bottleneck: Single-Session Uploads

Traditional blob.upload_from_filename() pushes the entire file in one HTTP request. On anything bigger than a few hundred megabytes, these issues emerge:

  • Latency sensitivity. Long uploads magnify every network hiccup.
  • Retry overhead. Mid-transfer failure? Start the full upload again—or implement a fragile resume strategy.
  • Bandwidth underutilization. Single connection ≠ full link saturation.

Experienced teams avoid this pattern with composite object workflows.


The Approach: Parallel Composite Objects

GCS exposes a compose API, allowing you to splice together previously uploaded "component" objects. The pattern:

  1. Split a file locally (e.g., 100 MiB per part; tune as needed).
  2. Upload all parts in parallel as temporary objects. Temporary means naming with UUIDs or a prefix you’ll clean up.
  3. Compose those parts into a final GCS object via a single API call.
  4. Delete temporary component objects immediately post-compose to avoid extra cost.

This approach offers three tangible benefits:

RequirementLinear UploadPCO Workflow
Network resilienceLowHigh (small retries)
ThroughputLowHigh (parallelism)
Storage overheadNoneModerate (transient)

Example: Python Upload with Parallel Composite Objects (google-cloud-storage >=2.6.0)

Scenario: Uploading a 2.5GB ZIP file to a bucket with transient WiFi and 8 CPU cores.

import os
import concurrent.futures
from uuid import uuid4
from google.cloud import storage

BUCKET_NAME = "my-prod-bucket"
CHUNK_SIZE = 100 * 1024 * 1024  # 100 MiB
CLIENT = storage.Client()
BUCKET = CLIENT.bucket(BUCKET_NAME)

def _upload_chunk(local_path, remote_name):
    blob = BUCKET.blob(remote_name)
    try:
        blob.upload_from_filename(local_path)
    except Exception as e:
        print(f"[WARN] Failed uploading {remote_name}: {e}")
        raise
    return remote_name

def parallel_upload_and_compose(file_path, dest_blob_name):
    part_names = []
    chunk_files = []
    # Step 1: Split file, write local parts
    with open(file_path, 'rb') as src:
        idx = 0
        while True:
            chunk = src.read(CHUNK_SIZE)
            if not chunk:
                break
            chunk_path = f"/tmp/gcs_part_{uuid4().hex}_{idx}"
            with open(chunk_path, "wb") as outf:
                outf.write(chunk)
            part_name = f"{dest_blob_name}_part_{uuid4().hex}_{idx}"
            part_names.append(part_name)
            chunk_files.append(chunk_path)
            idx += 1
    if idx == 0:
        raise RuntimeError("Zero-sized files not supported")

    # Step 2: Parallel upload parts
    with concurrent.futures.ThreadPoolExecutor(max_workers=min(8, idx)) as pool:
        futures = [pool.submit(_upload_chunk, p, n) for p, n in zip(chunk_files, part_names)]
        concurrent.futures.wait(futures)

    # Step 3: Compose in steps if >32 parts
    blobs = [BUCKET.blob(n) for n in part_names]
    while len(blobs) > 32:
        grouped, group_names = [], []
        for i in range(0, len(blobs), 32):
            group = blobs[i:i+32]
            intermediate_name = f"{dest_blob_name}_compose_{uuid4().hex}_{i}"
            intermediate_blob = BUCKET.blob(intermediate_name)
            intermediate_blob.compose(group)
            grouped.append(intermediate_blob)
            group_names.append(intermediate_name)
        blobs = grouped
        # Note: Don't forget to schedule cleanup for these intermediates, too.
        part_names = group_names

    # Step 4: Final compose step (<33 blobs)
    dest_blob = BUCKET.blob(dest_blob_name)
    dest_blob.compose(blobs)

    # Step 5: Cleanup local & remote temps
    for fp in chunk_files:
        try: os.remove(fp)
        except FileNotFoundError: pass

    for n in part_names:
        try: BUCKET.blob(n).delete()
        except Exception: pass  # Typically NotFound

    print(f"Upload complete: gs://{BUCKET_NAME}/{dest_blob_name}")

if __name__ == "__main__":
    # Real-world usage: adjust accordingly.
    parallel_upload_and_compose("/mnt/data/bigdata-2024-06.zip", "archive/bigdata-2024-06.zip")

Known issue: If a compose call fails with 400: too many component objects, check len(blobs) and fallback to recursive composition as in the above loop. Do not attempt to compose more than 32 at once. This limit is as of GCS API v1, June 2024.


Observations from Live Deployments

  • Component objects are billed as individual objects until deletion. Compose, then delete—script both.
  • Failure recovery: Only faulty components need a reupload, unless composition itself fails in mid-flight.
  • Tuning CHUNK_SIZE affects both throughput and memory footprint on the upload host. For egress-limited links, test several values.
  • Filesystem space for split chunks may bottleneck if many large files are transferred in burst. Consider tempfile module with automatic cleanup.
  • For CLI use:
    gsutil -o "GSUtil:parallel_composite_upload_threshold=150M" cp largefile gs://bucket/target
    
    This auto-enables PCO for files >150 MiB. Chunk size defaults to 150 MiB here.

Alternate Approaches and Considerations

  • Resumable uploads (via upload_from_file(..., resumable=True)) can be robust for single files, but don't hit full link speed on multi-core systems.
  • PCO requires API permissions for both OBJECT_CREATE and OBJECT_COMPOSE.
  • Streaming uploads using Python generators on-the-fly split can reduce disk I/O, but complicate error recovery.
  • If strong object integrity is critical, validate with CRC32C or SHA256 after compose. Occasionally observed: mismatch on spotty WiFi with certain gsutil versions (<5.13).

References


Field note: Some storage monitoring tools lag on inventory updates after batch compose+delete cycles. Don’t trust gsutil ls output until eventual consistency catches up—usually within seconds, but not always.