How to Seamlessly Migrate Complex Workloads from Redshift to Snowflake Without Breaking Queries
Mass migrations between data warehouses can derail analytics if handled naively. Redshift and Snowflake share similarities but have fundamental architectural divergences—these can cripple existing pipelines, introduce subtle data quality issues, or degrade report performance overnight.
Below: a workflow proven to retain business logic and minimize risk throughout Redshift-to-Snowflake migrations. The focus is on scenario realism—if your org has 50–500+ tables, mixed workloads, and critical dashboards in play, take note.
Key Architectural Divergences (Know Your Target)
Start by mapping the gaps—not just on paper, but in behavior and tuning.
- Separation of Compute and Storage: Redshift (as of 2023, 2.x series) ties scalability to cluster sizing; with Snowflake (esp. Enterprise Edition), any data can be computed by any virtual warehouse on demand. Vertical scaling is obsolete, horizontal scaling is norm.
- SQL Dialect & Procedural Differences: Redshift supports PL/pgSQL for UDFs and procs; Snowflake exposes JavaScript UDFs and procedural SQL with distinct caveats (e.g., limited control flow, different error handling).
- Physical Optimization Artifacts: DISTKEY and SORTKEY in Redshift influence data layout heavily. Snowflake’s micro-partitioning is opaque and automatically maintained—manual tuning is rare.
- Data Types: Redshift’s SUPER for semi-structured data ≠ Snowflake’s VARIANT. Numeric handling, timestamp precision, and handling of empty strings vs NULL differ subtly.
Copy-pasting DDLs or SQL? Expect silent breakage or degraded query plans unless these are handled upfront.
Audit and Catalog: The Foundation Phase
Attempting migration without a precise inventory is reckless. Begin with metadata extraction and usage profiling.
- Extract Schema Definitions: Use
pg_dump -s
for basic DDL. For full fidelity—including encoding, comments, and custom datatypes—prefer the AWS Schema Conversion Tool (SCT) or psql scripting. - Inventory Dynamic Assets: List all stored procedures, UDFs, scheduled jobs (e.g., AWS Data Pipeline, Airflow DAGs), and non-standard ETL patterns.
- Extract Historical Workloads: Pull query logs via STL_QUERY and STL_WLM tables, then filter for long-running or business-critical queries. Example (Redshift version 1.0.37037+):
SELECT userid, query, starttime, endtime, db FROM stl_query WHERE starttime >= '2024-03-01';
- Note Edge Configurations: E.g., WLM queue assignments; user-level privilege hacks.
Caution: Relying solely on automated schema converters—for large codebases (>100 SQL scripts) or multi-workspace environments—may lead to inapparent coverage gaps.
Migration Approach: Big Bang or Phased Rollout?
Neither approach is perfect.
Full Lift-and-Shift:
- Data Export: Use Redshift UNLOAD to S3 in Parquet—a necessity for >1TB datasets or schema evolution downstream.
- Bulk Load: Leverage Snowflake’s
COPY INTO
from external stage. Parquet or ORC are superior to CSV for retaining types (particularly decimals/booleans). - Mass Syntax Conversion: Use SCT for initial conversion; manually patch corner cases (ordering, date math, index hints).
Phased/Segmented:
- Selective Table Refactoring: Prioritize by data freshness or analytics dependencies.
- Query-by-Query Validation: Rewrite, test, and rollout individual dashboards.
- Hybrid Data Model: Temporarily sync data between warehouses for parallel validation.
Trade-off: Big bang risks a long freeze window; phased rollouts double maintenance. For mixed workloads (~500GB, multiple ETL), a staged approach is safer.
Data Export & Import: Precision Steps
Batch data exports as follows:
-- Redshift export (ensure Redshift UNLOAD permission and IAM role attached)
UNLOAD ('SELECT * FROM sales')
TO 's3://data-migration-bucket/sales/'
IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftS3ExportRole'
PARQUET;
Note: Prefer disabling parallelism (PARALLEL OFF
) if you require deterministic file splits (some downstream loaders have issues).
Load to Snowflake:
COPY INTO sales
FROM @ext_stage/sales/
FILE_FORMAT = (TYPE = PARQUET)
PATTERN = '.*\.parquet';
Gotcha: Default string trimming differs between Redshift and Snowflake—set TRIM_SPACE=FALSE
if whitespace sensitivity is required.
SQL Syntax & Function Mapping: Where Migrations Fail Quietly
Feature | Redshift | Snowflake | Remarks |
---|---|---|---|
Distribution keys | DISTKEY(col) | No direct equivalent | Use clustering keys sparingly |
Sort keys | SORTKEY(col) | - | Relies on micro-partitions |
JSON queries | SUPER , json_extract_path | VARIANT , : syntax | Incompatible syntax |
Procedures | PL/pgSQL | JavaScript/SQL | Rewrites common |
Window functions | Full ANSI | Full ANSI | Comparable |
Example (Redshift → Snowflake function fix):
-- Redshift
SELECT listagg(category, ',') WITHIN GROUP (ORDER BY category) AS categories FROM items;
-- Snowflake
SELECT listagg(category, ',') WITHIN GROUP (ORDER BY category) AS categories FROM items;
Most functions are portable, but inspect timezone math and JSON operations by hand.
Validation: Dual-Run Everything
No matter how robust your migration tooling, always dual-run business-critical queries:
- Row Count Comparison: Sample queries, full table counts.
- Aggregate Consistency:
SUM()
,AVG()
, including handling of NULLs. - Edge Type Checking: Watch for subtle float rounding, timezone skew, or string encoding.
- Explain Plans: Execution time parity isn’t guaranteed, but logic flow should roughly align.
Automate diffs (Python, dbt, or Great Expectations). Sample assertion—Snowflake vs Redshift row count:
# Example using snowflake-connector-python and redshift_connector
assert sf_count == rs_count, "Mismatch after migration for table sales"
Known issue: Redshift UNLOAD sometimes strips trailing whitespace, which isn't flagged by direct type comparison.
Post-Migration: Automate Testing, Monitor, and Educate
- Deploy synthetic query monitors on new Snowflake workloads. Alert any anomalous error spikes (
SQL compilation error: syntax error line 1...
). - Empower analysts to self-check with migration “cheat sheets” (list of changed functions, new error surfaces).
- Schedule post-migration audits for incremental loads—early failures propagate silently if only initial full-load was verified.
If issues like increased query latency emerge, first review warehouse sizing and auto-suspend configs (Snowflake’s default can lead to “cold start” delays after periods of inactivity).
Conclusion (Intentionally Midstream): Don’t Trust Schema Converters Blindly
Skipping custom validation in favor of “automated conversion” is an anti-pattern. Real assurance depends on layered checks—DDL comparison, sample queries, real workloads—before switching BI traffic.
Minimal Example: Table Migration and Query Alignment
-- Step 1: Redshift UNLOAD with Parquet for type fidelity
UNLOAD ('SELECT * FROM orders')
TO 's3://my-data-bucket/redshift/unload/orders_'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftExportRole'
FORMAT PARQUET;
-- Step 2: Snowflake import from S3 stage
COPY INTO orders
FROM @s3_stage/redshift/unload/orders_
FILE_FORMAT = (TYPE = PARQUET);
Query adjustment:
-- Redshift
SELECT DATE_TRUNC('month', order_date) FROM orders;
-- Snowflake (capitalization matters)
SELECT DATE_TRUNC('MONTH', order_date) FROM orders;
Note: Parameter case is significant in Snowflake and silently ignored in Redshift.
Pragmatic tip: For massive migrations, script random “canary queries” to run against every imported table—catches misaligned columns and broken defaults (one missed default will break ETL downstream). For procedures spanning >500 lines, rewrite piecemeal; automated tools often mangle error handling blocks.
Redshift-to-Snowflake isn't a one-click journey. Treat migration as a controlled, iterative refactor: catalog, export, import, validate, repeat. That’s how you keep analytics running and stakeholders off your back.