S3 To Redshift

S3 To Redshift

Reading time1 min
#Cloud#Data#AWS#Redshift#S3#DataPipeline#Parquet#Compression#ETL

Optimizing S3 to Redshift Data Loads: Engineering Beyond COPY

Redshift’s COPY command is the entry point, not the finish line. Actual performance hinges on file format, compression, file sizing, and thorough monitoring—critical for pipelines handling terabytes daily and budgets that can’t spike without warning.


Parquet and ORC: Always Prefer Columnar

Loading raw CSV or JSON from S3 into Redshift isn’t just slow—it wastes both compute and storage. Both for throughput and cost, always use columnar formats. Experience shows:

  • Parquet (Apache, v1.12+) and ORC enable column pruning, vectorized reads, and effective schema handling.
  • Built-in compression schemes slash file sizes by up to 10x compared to uncompressed CSV.
  • Migrating from CSV to Parquet reduces load time by 60–80%, especially for wide tables (hundreds of columns).

Conversion can be done at ETL time. For Hadoop/Spark/Glue pipelines:

# PySpark 3.3.0, AWS Glue job snippet
df = spark.read.csv("s3://bucket/raw/data.csv", header=True)
df.write.parquet("s3://bucket/refined/data/", compression='snappy')

Redshift COPY usage:

COPY analytics.orders
FROM 's3://bucket/refined/data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RS-LOAD'
FORMAT AS PARQUET;

Note: Redshift’s support for Parquet is robust as of RA3 node types (>=1.0.37251), but always test with your schema first.


File Size and Count: Don’t Starve Slices

Redshift parallelizes loads by assigning files to slices. Too few files chokes concurrency; too many creates S3 throttling and high planning overhead. The “goldilocks” range is:

Per-file sizeTarget file count
100MB–1GB10x–20x number of slices

If you run a 4-node ra3.4xlarge cluster (each node: 16 slices), aim for 640 files at ~250MB each. Miss this and your COPY can bottleneck on just a handful of slices, dragging throughput down linearly.

Practical pain point: When data engineers push a single 30GB Parquet, COPY uses just 1/64th of total capacity—disk usage spikes on that node, cache churns, load stalls.


Compression: Gzip, Snappy, Zstd (But Not All Equal)

CSV/JSON: Gzip is almost always best—decent compression, fast decompression, and full support in COPY.

Parquet: Prefer Snappy or Zstd. Snappy is lightweight (great for analytics, almost no CPU penalty); Zstd (supported since Redshift 1.0.35791+) favors cold storage due to higher compression ratios at some decompression cost.

COPY sales.raw_events
FROM 's3://bucket/events/2024/05/11/events-*.parquet'
IAM_ROLE 'arn:aws:iam::123456789012:role/RS-LOAD'
FORMAT AS PARQUET;

Known issue: Bzip2 is slower and rarely justified unless maximizing space is critical. Also, avoid mixing CSV compression codecs for a single load; COPY only supports one codec flag per execution.


Manifests: Control, Idempotency, and Cost

Ad-hoc file ingestion risks duplicates (retries, mis-named files) and partial loads. Manifests enumerate S3 objects to load—useful for auditability and explicitness.

Sample manifest structure:

{
  "entries": [
    { "url":"s3://bucket/refined/data/file-01.snappy.parquet", "mandatory":true },
    { "url":"s3://bucket/refined/data/file-02.snappy.parquet", "mandatory":true }
  ]
}
COPY user_sessions
FROM 's3://bucket/refined/data/manifest.json'
IAM_ROLE 'arn:aws:iam::123456789012:role/RS-LOAD'
FORMAT AS PARQUET
MANIFEST;

Tip: Manifests allow selective reprocessing. Combine with atomic S3 directory renames (or Glue catalog versions) for rock-solid pipelines.


Parallelization: Exploit Slices and Node Types

A common misconfiguration is underutilizing node slices. Each node handles multiple slices—more work, more parallelism. Example: 8-node ds2.xlarge Redshift (2 slices each) = 16 slices.

Optimal: Number of files loaded ≈ 10–20× total slices. But don't over-provision to thousands of tiny files—see: “Too many S3 requests” in CloudTrail logs.

Side note: Redshift Spectrum (external tables on S3) can offload massive ad hoc queries, but isn’t suited for high-frequency inserts; ideal for append-only, batch-scanned workloads.


Monitoring: STL/SVL Views and Real-World Diagnostics

After COPY, check system views for anomalies:

SELECT filename, line_number, colname, type, raw_line
  FROM stl_load_errors
 WHERE starttime > SYSDATE - INTERVAL '1 hour';

Look for error signatures:

Malformed line: unexpected character found in column "event_time"

For performance profiling:

SELECT slice, COUNT(*) as files_loaded
  FROM svl_s3log
 WHERE pid = pg_last_copy_id()
 GROUP BY slice;

Watch for large standard deviations—means load skew, revisit file distribution.


Automate ETL and COPY: Glue, Lambda, Data API

  • Schedule AWS Glue ETL jobs to preprocess, partition, and convert source data.
  • Use AWS Lambda to trigger Redshift COPY via Data API (v1.0.15+).
  • Metrics and errors should push to CloudWatch; failed jobs optionally send notifications (SNS, PagerDuty integration).

Summary Checklist

  • Prefer columnar formats (Parquet, ORC) for all but transient loads.
  • Compress using Snappy (Parquet) or Gzip (CSV); avoid Bzip2 unless mandatory.
  • File size: 100MB–1GB; file count: 10–20× cluster slice count.
  • Use manifest files for determinism and safe retries.
  • Exploit parallelism—don’t leave slices idle.
  • Monitor with STL/SVL logs, automate anomaly detection.
  • Redshift Spectrum is viable for cold/archival data—not frequent, incremental writes.
  • Always test COPY at target scale; edge cases (quoting, nulls, type inference) still crop up post-migration.

Transferring petabyte-scale datasets, trimming daily runtime from 55 minutes to 9, is possible—but only by addressing every layer from S3 object layout to cluster slice utilization. COPY by itself is just orchestration—the real levers are in the engineering details above.

No approach is flawless; continual load monitoring and periodic pipeline rebalancing are part of Redshift ops. Other shops may patch with more nodes—costly, and rarely the real fix.

If you uncover edge-case COPY bugs, or have benchmarks showing different trade-offs, drop line or compare logs. There’s always a trick or two waiting in the STL tables.