Efficient File Uploads to Google Cloud Storage: Parallel Composite Objects in Practice
Data engineering pipelines frequently choke on single-threaded uploads of large datasets. Standard linear uploads to Google Cloud Storage (GCS) don’t scale—particularly with files north of several gigabytes. Slow links, random disconnects, and monotonous progress bars. There’s a better way.
The Bottleneck: Single-Session Uploads
Traditional blob.upload_from_filename()
pushes the entire file in one HTTP request. On anything bigger than a few hundred megabytes, these issues emerge:
- Latency sensitivity. Long uploads magnify every network hiccup.
- Retry overhead. Mid-transfer failure? Start the full upload again—or implement a fragile resume strategy.
- Bandwidth underutilization. Single connection ≠ full link saturation.
Experienced teams avoid this pattern with composite object workflows.
The Approach: Parallel Composite Objects
GCS exposes a compose API, allowing you to splice together previously uploaded "component" objects. The pattern:
- Split a file locally (e.g., 100 MiB per part; tune as needed).
- Upload all parts in parallel as temporary objects. Temporary means naming with UUIDs or a prefix you’ll clean up.
- Compose those parts into a final GCS object via a single API call.
- Delete temporary component objects immediately post-compose to avoid extra cost.
This approach offers three tangible benefits:
Requirement | Linear Upload | PCO Workflow |
---|---|---|
Network resilience | Low | High (small retries) |
Throughput | Low | High (parallelism) |
Storage overhead | None | Moderate (transient) |
Example: Python Upload with Parallel Composite Objects (google-cloud-storage
>=2.6.0)
Scenario: Uploading a 2.5GB ZIP file to a bucket with transient WiFi and 8 CPU cores.
import os
import concurrent.futures
from uuid import uuid4
from google.cloud import storage
BUCKET_NAME = "my-prod-bucket"
CHUNK_SIZE = 100 * 1024 * 1024 # 100 MiB
CLIENT = storage.Client()
BUCKET = CLIENT.bucket(BUCKET_NAME)
def _upload_chunk(local_path, remote_name):
blob = BUCKET.blob(remote_name)
try:
blob.upload_from_filename(local_path)
except Exception as e:
print(f"[WARN] Failed uploading {remote_name}: {e}")
raise
return remote_name
def parallel_upload_and_compose(file_path, dest_blob_name):
part_names = []
chunk_files = []
# Step 1: Split file, write local parts
with open(file_path, 'rb') as src:
idx = 0
while True:
chunk = src.read(CHUNK_SIZE)
if not chunk:
break
chunk_path = f"/tmp/gcs_part_{uuid4().hex}_{idx}"
with open(chunk_path, "wb") as outf:
outf.write(chunk)
part_name = f"{dest_blob_name}_part_{uuid4().hex}_{idx}"
part_names.append(part_name)
chunk_files.append(chunk_path)
idx += 1
if idx == 0:
raise RuntimeError("Zero-sized files not supported")
# Step 2: Parallel upload parts
with concurrent.futures.ThreadPoolExecutor(max_workers=min(8, idx)) as pool:
futures = [pool.submit(_upload_chunk, p, n) for p, n in zip(chunk_files, part_names)]
concurrent.futures.wait(futures)
# Step 3: Compose in steps if >32 parts
blobs = [BUCKET.blob(n) for n in part_names]
while len(blobs) > 32:
grouped, group_names = [], []
for i in range(0, len(blobs), 32):
group = blobs[i:i+32]
intermediate_name = f"{dest_blob_name}_compose_{uuid4().hex}_{i}"
intermediate_blob = BUCKET.blob(intermediate_name)
intermediate_blob.compose(group)
grouped.append(intermediate_blob)
group_names.append(intermediate_name)
blobs = grouped
# Note: Don't forget to schedule cleanup for these intermediates, too.
part_names = group_names
# Step 4: Final compose step (<33 blobs)
dest_blob = BUCKET.blob(dest_blob_name)
dest_blob.compose(blobs)
# Step 5: Cleanup local & remote temps
for fp in chunk_files:
try: os.remove(fp)
except FileNotFoundError: pass
for n in part_names:
try: BUCKET.blob(n).delete()
except Exception: pass # Typically NotFound
print(f"Upload complete: gs://{BUCKET_NAME}/{dest_blob_name}")
if __name__ == "__main__":
# Real-world usage: adjust accordingly.
parallel_upload_and_compose("/mnt/data/bigdata-2024-06.zip", "archive/bigdata-2024-06.zip")
Known issue: If a compose
call fails with 400: too many component objects
, check len(blobs)
and fallback to recursive composition as in the above loop. Do not attempt to compose more than 32 at once. This limit is as of GCS API v1, June 2024.
Observations from Live Deployments
- Component objects are billed as individual objects until deletion. Compose, then delete—script both.
- Failure recovery: Only faulty components need a reupload, unless composition itself fails in mid-flight.
- Tuning
CHUNK_SIZE
affects both throughput and memory footprint on the upload host. For egress-limited links, test several values. - Filesystem space for split chunks may bottleneck if many large files are transferred in burst. Consider
tempfile
module with automatic cleanup. - For CLI use:
This auto-enables PCO for files >150 MiB. Chunk size defaults to 150 MiB here.gsutil -o "GSUtil:parallel_composite_upload_threshold=150M" cp largefile gs://bucket/target
Alternate Approaches and Considerations
- Resumable uploads (via
upload_from_file(..., resumable=True)
) can be robust for single files, but don't hit full link speed on multi-core systems. - PCO requires API permissions for both OBJECT_CREATE and OBJECT_COMPOSE.
- Streaming uploads using Python generators on-the-fly split can reduce disk I/O, but complicate error recovery.
- If strong object integrity is critical, validate with CRC32C or SHA256 after compose. Occasionally observed: mismatch on spotty WiFi with certain
gsutil
versions (<5.13).
References
Field note: Some storage monitoring tools lag on inventory updates after batch compose+delete cycles. Don’t trust gsutil ls
output until eventual consistency catches up—usually within seconds, but not always.