Snowflake To S3

Snowflake To S3

Reading time1 min
#Cloud#Data#AWS#Snowflake#DataExport#S3

Mastering Efficient Data Export: Snowflake to Amazon S3

Efficient transfer of large datasets from Snowflake to Amazon S3 is a frequent pain point when scaling data lakes or supplying data to external consumers. Poorly engineered exports manifest as high egress bills, bloated compute usage, and S3 buckets littered with tiny or inconsistently structured files—bottlenecks that surface downstream in Athena queries or Spark jobs.


The Core: COPY INTO for S3 Exports

Snowflake’s COPY INTO command serves as the operational workhorse for data exports. Basic CSV dumps are trivial:

COPY INTO 's3://analytics-bucket/unloads/order_table/'
FROM analytics.public.orders
STORAGE_INTEGRATION = sf_s3_prod_int
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' COMPRESSION = GZIP)
HEADER = TRUE
MAX_FILE_SIZE = 52428800; -- 50 MB

But default COPY usage leaves much to be desired: improper format, wrong file chunking, and security missteps. Consider real file counts and verify actual file size distribution in S3 after each unload—expect more variation than the docs suggest.

Gotcha: Files often deviate from MAX_FILE_SIZE due to row boundaries and partitioning logic. If precise file sizing is business critical, pre-aggregate data into batches first.


Secure, Operationally Sound Storage Integrations

Embedding AWS Access Keys in scripts is a hard anti-pattern. Instead:

CREATE STORAGE INTEGRATION sf_s3_prod_int
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = S3
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake_s3_writer_v2'
  STORAGE_ALLOWED_LOCATIONS = ('s3://analytics-bucket/unloads/');

Audit role permissions: scope access narrowly and attach only minimal S3 actions (s3:PutObject, s3:GetObject). Snowflake generates an external_id under-the-hood for trust boundary; validate this in the IAM trust relationship.

Note: Region mismatches between Snowflake’s virtual warehouse and the S3 bucket result in non-trivial egress fees. Always match regions unless legal constraints dictate otherwise.


File Formats and Compression: Trade-offs in Production

FormatStrengthsWeaknesses
ParquetColumnar, efficient, supports schema evolutionSlight learning curve, less transparent for ad-hoc debugging
CSVUniversally supported, trivial to readBulky, slower, type issues
JSONSchema-flexible, feeds semi-structured pipelinesInefficient for volume, high parsing cost

Production environments typically favor Parquet with Snappy due to payload size reduction and broad compatibility with Spark/Athena/Trino.

CREATE FILE FORMAT orders_parquet 
  TYPE = PARQUET 
  COMPRESSION = SNAPPY;

Parquet’s columnar nature results in compressed files that are a fraction of the CSV exports. Still, if downstream targets include old BI tools, be realistic: sometimes CSV is “good enough.”


Throughput Optimization: Parallelism and File Chunks

Small files (hundreds of KB to a few MB) stifle S3 performance at scale—manifesting as “small files problem” in most Hadoop or Spark jobs. Oversize files (over 1 GB) become unwieldy for batch processing, especially given Lambda’s invocation limits or downstream pipeline timeouts.

Empirically, 50–150 MB per file is the functional sweet spot in mid-size data platforms (~1–10 TB). Tune this via MAX_FILE_SIZE, but monitor actual versus expected output.

Manual partitioning can provide deterministic file groupings. Example: for a 2 billion row dataset partitioned by hash:

COPY INTO 's3://analytics-bucket/unloads/orders/'
FROM (
  SELECT * FROM analytics.public.orders 
  WHERE MOD(ABS(HASH(order_id)),16) = 2
)
FILE_FORMAT = (FORMAT_NAME = 'orders_parquet')
MAX_FILE_SIZE = 104857600;

Automate sweeps over the hash modulus value to control parallelizable workloads in Airflow.


Monitoring and Cost Controls

Snowflake: Compute Credits

  • Warehouse Sizing: Use XSMALL or SMALL for test exports; upsize only for heavy transforms.

  • Auto-Suspend: Attach a schedule or use a post-export SQL procedure to suspend the warehouse, e.g.:

    ALTER WAREHOUSE EXPORT_WH SET SUSPEND = TRUE;
    

    Or leverage Snowflake Tasks to control start/stop logic.

  • Usage Logging: Query precise credit consumption with:

    SELECT *
    FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
    WHERE WAREHOUSE_NAME = 'EXPORT_WH'
          AND START_TIME >= CURRENT_DATE - INTERVAL '1 DAY';
    

    Track peaks for trending.

AWS: Egress and API Costs

  • Keep S3 targets in the same region as your Snowflake account whenever possible.
  • Batch API calls—avoid repeated micro-writes. Each S3 PutObject is an API charge.

Known issue: Unplanned retries from flaky network conditions will balloon costs if not managed. Ensure transfer scripts include robust backoff and deduplication logic.


Automation in Practice: Shell Script + SnowSQL

Orchestration may be via cron, Airflow, Lambda, or other tools. Simple shell automation example for audit-logged exports:

snowsql -a $SNOWFLAKE_ACCOUNT -u $DB_USER -w EXPORT_WH -d analytics -s public -q "
COPY INTO 's3://analytics-bucket/unloads/orders/' 
FROM orders 
FILE_FORMAT = (FORMAT_NAME='orders_parquet') 
MAX_FILE_SIZE=134217728; -- 128 MB" 2>&1 | tee /tmp/export_orders.log

if grep -q 'error' /tmp/export_orders.log; then
  mail -s "Snowflake Export Failure" dataops-team@example.com < /tmp/export_orders.log
fi

Automate log parsing—Snowflake errors may only appear as warnings, not hard process exits.

Non-obvious tip: for regular, massive exports, consider S3 multipart upload with lifecycle rules to auto-delete incomplete parts due to rare job interruptions—hard to spot otherwise.


Checklist: Production-Grade Snowflake-to-S3 Exports

  • Use Storage Integrations. Never use static AWS credentials.
  • Explicit File Formats: Prefer Parquet + Snappy for analytics. CSV if wider compatibility is required.
  • Tune MAX_FILE_SIZE: Target 50–150 MB per object.
  • Optimize Warehouse Sizing: Right-size, suspend immediately after.
  • S3 and Snowflake in Same Region: Eliminate cross-region bandwidth where possible.
  • Partition Exports for Scalability: Hash-based partitioning for massive tables.
  • Automate and Alert: Export runs must be alert-monitored for both success and partial failure.
  • Post-process Verification: Scripted count of files and bytes to ensure consistency after transfer.

Data export engineering goes well beyond SQL syntax. For sustained operational reliability—and fewer 3 a.m. alerts—refine configuration iteratively, measure real costs, and automate everything you can. Exporting is never just a one-liner in production.

Questions, edge cases, or battle stories? Reach out—healthy paranoia prevents outages.