S3 To Redshift

S3 To Redshift

Reading time1 min
#Cloud#Data#AWS#Redshift#S3#DataPipeline#Parquet#Compression#ETL

Optimizing Data Transfers from S3 to Redshift: Beyond the Basics

Most guides stop at the COPY command, but optimizing your S3 to Redshift data pipeline requires mastering file formats, compression, and parallel loading strategies—here's what no one tells you about squeezing every millisecond and dollar from your transfers.

Efficiently moving data from Amazon S3 into Amazon Redshift is the backbone of many modern data architectures. Businesses rely on this pipeline for near-real-time analytics and cost-effective warehousing, but the default COPY command with straightforward settings can leave much performance and cost savings on the table. In this post, I’ll share practical, battle-tested tips to optimize your S3 to Redshift data transfers and get the most bang for your buck.


1. Choose the Right File Format: Parquet and ORC Over CSV/JSON

The most common beginner mistake is loading plain CSV or JSON files. While simple to generate, these formats are inefficient to parse and load.

Why?

  • Columnar formats like Parquet and ORC store data by column, which Redshift can ingest faster.
  • They support schema information, reducing guesswork in COPY and mistakes.
  • Smaller file sizes due to built-in compression reduce S3 retrieval time and storage cost.

Example

Convert your data to Parquet before upload:

# Using AWS Glue or Apache Spark, convert CSV to Parquet
spark.read.csv('s3://your-bucket/raw/data.csv').write.parquet('s3://your-bucket/optimized/parquet/')

Then load it with COPY specifying the format:

COPY your_table
FROM 's3://your-bucket/optimized/parquet/'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
FORMAT AS PARQUET;

2. Leverage Compression Wisely

Compressed files reduce network I/O and S3 storage costs, but there is a balance between compression ratio and CPU overhead during decompression.

Recommended compressions:

  • For CSV/JSON: Gzip or Bzip2, but Gzip is faster to decompress.
  • For Parquet/ORC: compression codec like Snappy (lightweight) or Zstd (high compression with moderate CPU).

Example COPY with Gzip CSV:

COPY your_table
FROM 's3://your-bucket/raw/data.csv.gz'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
CSV
GZIP;

Generally, use columnar formats with native compression instead of compressed CSVs when possible.


3. Optimize File Size and Number of Files: The Goldilocks Zone

Too many tiny files mean overhead due to job startup and manifest processing. Too few huge files reduce parallelism and can cause memory issues on Redshift clusters.

Best practices:

  • Aim for files between 100 MB and 1 GB.
  • Distribute the data evenly across files to maximize parallel COPY operations.

If you generate your own ETL jobs, partition and write your files accordingly.


4. Use Manifest Files for Controlled, Reliable Loads

Manifest files explicitly list the files to load and guarantee Redshift knows exactly what to ingest, avoiding partial loads or duplication.

How to create a manifest:

A JSON manifest file looks like this:

{
  "entries": [
    {"url":"s3://your-bucket/optimized/file1.parquet","mandatory":true},
    {"url":"s3://your-bucket/optimized/file2.parquet","mandatory":true}
  ]
}

Load with the COPY command pointing to the manifest:

COPY your_table
FROM 's3://your-bucket/optimized/manifest.json'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
FORMAT AS PARQUET
MANIFEST;

5. Parallelize Copy by Leveraging Redshift’s Multiple Slices

Amazon Redshift distributes workload across slices (each node has multiple slices), so you want enough files to load concurrently.

Suppose your cluster has:

  • 2 nodes,
  • each with 16 slices = 32 slices total.

Try to load ~10x the slices in files — i.e., 320 files — so Redshift can launch multiple parallel loading processes.


6. Use Redshift Spectrum or PARTITIONED External Tables (If Real-Time Isn't a Must)

If immediate transfer is not critical, use Redshift Spectrum to query S3 directly or setup external tables and partitions. This offloads compute from Redshift while accessing data in place.


7. Monitor and Tune with STL and SVL System Views

Use system tables to examine COPY performance:

SELECT query, filename, starttime, endtime, rows
FROM stl_load_errors
WHERE query = pg_last_copy_command();

and

SELECT slice, max_rows, avg_rows, stddev_rows
FROM svl_qlog
WHERE query = pg_last_copy_command();

Look for skewed data loads or errors.


Bonus: Automate with AWS Glue and Redshift Data API

  • Use AWS Glue jobs to convert and compress files automatically.
  • Trigger loads via Lambda calling Redshift Data API with optimized COPY commands.

Summary Checklist

  • ✅ Use columnar file formats (Parquet / ORC)
  • ✅ Compress files smartly (Snappy/Zstd for Parquet)
  • ✅ Optimize file sizes (~100MB-1GB) and count (10x slices)
  • ✅ Use manifest files for managed COPY jobs
  • ✅ Parallelize loads to exploit slices
  • ✅ Monitor COPY command metrics for bottlenecks
  • ✅ Consider Redshift Spectrum for less latency sensitive workloads

By moving beyond the basic COPY command and addressing file format, compression, parallelization, and job management, you can cut S3 to Redshift transfers from minutes to seconds, all while lowering your AWS costs.

Got your own optimization wins or puzzling failures? Drop a comment below — let’s crack the performance code together!