Mastering Efficient Data Export: How to Seamlessly Transfer Large Datasets from Snowflake to Amazon S3
Think exporting data is just running a simple unload command? Think again. This guide dives deep into optimizing data movement from Snowflake to S3, debunking common myths and exposing costly pitfalls that can blindside your data engineering team.
Why Efficient Data Export Matters
When building scalable data lakes or preparing datasets for downstream processing, transferring data from Snowflake to Amazon S3 is a core step. It’s easy to overlook complexity here—after all, Snowflake’s COPY INTO
command looks straightforward enough.
But in reality, inefficient exports can lead to:
- Longer job runtimes
- Higher AWS egress charges
- Increased Snowflake compute credits usage
- Fragmented files causing slower downstream reading
Mastering this export process means saving both time and budget while making your overall data pipeline more robust.
Step 1: Understand the Basics — The COPY INTO Command
Snowflake’s workhorse for exports is the COPY INTO
SQL command. Here’s a simple example that unloads a table into S3:
COPY INTO 's3://my-bucket/my-export-folder/'
FROM my_database.my_schema.my_table
STORAGE_INTEGRATION = my_s3_integration
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' COMPRESSION = GZIP)
HEADER = TRUE
MAX_FILE_SIZE = 50000000; -- Max file size of ~50MB per unload file
This command tells Snowflake:
- Where to unload (
s3://my-bucket/my-export-folder/
) - What data (
my_table
) - How to format files (CSV with gzip compression)
- How big the export files should be (~50 MB max)
Pro Tip: Defining file sizes can drastically impact export runtime and downstream usability.
Step 2: Use Storage Integrations for Secure, Scalable Transfers
Never embed AWS keys directly in your SQL commands! Instead, use Storage Integrations which provide role-based access with fine-grained permissions.
To create one:
CREATE STORAGE INTEGRATION my_s3_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/my-snowflake-role'
STORAGE_ALLOWED_LOCATIONS = ('s3://my-bucket/my-export-folder/');
Using integrations not only enhances security but also helps reduce authorization overhead for large or frequent exports.
Step 3: Pick the Right File Format and Compression
Export file format has significant impact on export speed, file size, and ease of later consumption.
Common picks:
Format | Pros | Cons |
---|---|---|
CSV | Universal, easy debugging | Larger files, limited schema |
Parquet | Columnar compression, query fast | Slightly more complex setup |
JSON | Semi-structured flexibility | Larger files, parsing overhead |
For large exports, Parquet with Snappy compression is usually best because it reduces storage costs and accelerates reads in analytics engines like Athena or Redshift Spectrum.
Example file format creation for Parquet:
CREATE FILE FORMAT my_parquet_format
TYPE = PARQUET
COMPRESSION = SNAPPY;
Use this in your EXPORT:
COPY INTO 's3://my-bucket/my-export-folder/'
FROM my_database.my_schema.my_table
FILE_FORMAT = (FORMAT_NAME = 'my_parquet_format');
Step 4: Optimize Parallelism & File Sizes
Snowflake parallelizes exports by splitting output into multiple files based on MAX_FILE_SIZE
. Small files hurt both performance and cost (each S3 PUT costs money), while enormous files can slow downstream processing.
Aim for 50–100 MB per unload file; adjust using MAX_FILE_SIZE
.
Alternatively, explicitly target the number of files by limiting query size or using multi-step unloads if your dataset is immense:
COPY INTO 's3://my-bucket/my-export-folder/'
FROM (
SELECT * FROM my_database.my_schema.my_table WHERE mod(hash(cast(id as string)),10) = 0
)
FILE_FORMAT = (FORMAT_NAME='my_parquet_format')
MAX_FILE_SIZE=100000000;
Run similar queries changing the WHERE clause hash condition to partition export workload manually, producing controlled chunked output files.
Step 5: Monitor Costs — Bandwidth & Compute Credits
Snowflake charges compute credits based on warehouse size and runtime during unload jobs.
Tips to minimize credits:
- Use appropriately sized warehouses (M-L instead of XL or bigger)
- Suspend warehouses immediately after exports finish via automation scripts or tasks
- Monitor warehouse credit usage with Snowflake’s UI or queries against
ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
AWS side bandwidth costs matter when moving large datasets out of Snowflake’s region or between regions. To minimize egress charges:
- Keep Snowflake and S3 buckets within the same AWS region.
- Avoid excessive r/w jobs caused by retries or inefficient transfers.
Step 6: Automate & Monitor Your Exports
Large-scale production pipelines require automation and alerting around data exports.
Sample shell script using SnowSQL CLI:
snowsql -a <account> -u <user> -w <warehouse> -d <db> -s <schema> -q "
COPY INTO 's3://my-bucket/my-export-folder/'
FROM my_table
FILE_FORMAT = (FORMAT_NAME='my_parquet_format')
MAX_FILE_SIZE=83886080;
"
if [ $? -ne 0 ]; then
echo "Export failed at $(date)" | mail -s "Snowflake Export Alert" your-email@example.com
fi
Build orchestration with Airflow/Dagster/Prefect tasks calling this script or direct API calls ensuring retries/logging are in place.
Summary Checklist for Efficient Snowflake → S3 Export
- Use
COPY INTO
with Storage Integrations for secure access. - Select optimal file formats – Parquet + Snappy preferred.
- Tune
MAX_FILE_SIZE
between 50–100 MB to balance parallelism & costs. - Size warehouses smartly; suspend when idle.
- Keep Snowflake & S3 buckets in same AWS region where possible.
- Automate exports with scripting & monitoring.
- Consider manual partitioned queries for very large datasets.
Exporting from Snowflake doesn’t have to be a black box mysterious operation—by mastering these practical steps you can save hours of wait time, reduce cloud bills, and keep your pipelines humming smoothly.
Got special cases or tips around exporting? Share your experiences in the comments!
Happy exporting!