Mastering Efficient Data Migration: How to Seamlessly Move Data from Aurora to Redshift

Critical reporting workloads periodically cripple transactional databases—especially once product usage scales and Aurora’s primary becomes wastefully over-provisioned just to support aggregation queries. Analytics belongs elsewhere. Amazon Redshift is designed for OLAP, featuring a columnar engine, result set caching, and large result streaming.

Yet, data migration from Amazon Aurora (PostgreSQL compatible, tested with v13.x) to Amazon Redshift (tested with ra3 nodes, v1.0.45213, June 2024) raises familiar operational headaches: schema design divergence, CDC, data type mismatches, and data consistency under ongoing change.

Strategic Pitfalls When Migrating Aurora to Redshift

Schema Drift and Data Modeling:
Aurora tends toward highly normalized OLTP schemas; Redshift delivers best performance with denormalized models (e.g., flattened star schemas—see Example 1 below). Simply copying tables is rarely enough.
Side note: Redshift’s limited support for complex datatypes—its SUPER type handles some JSON, but has caveats.
Downtime and Lag:
Bulk data loads (“lift-and-shift”) lock source tables or create replication lag. Attempting “zero-downtime” migration without robust CDC typically introduces data gaps.
Data Consistency:
Consistency issues spike during periods of high write IOPS to Aurora, as DMS or homegrown CDC can miss last-minute in-flight updates. Neither AWS DMS (as of 3.5.2) nor Glue guarantees atomic cutover without special handling.
Performance and Cost:
COPY into Redshift is fast (parallelized import, up to ~250k rows/sec/node for CSV/Parquet), but S3 transfer costs and Redshift slot contention bottleneck under wide table loads.

Migration Blueprint: Aurora to Redshift

1. Data Modeling

Schema Audit: Identify key analytical tables and critical dimensions.
- Most queries benefit from pre-joined, wide tables (orders_fact, customer_dim).
- Example: Map Aurora's normalized orders, order_items, products into a single Redshift orders_fact with embedded item arrays (if SUPER/JSON needed).
Type Mapping:
- Standard types (INTEGER, VARCHAR) port directly;
- Problem areas:
  - UUID in Aurora -> VARCHAR(36) or CHAR(16) in Redshift (trade-off: less space, no native constraint).
  - JSONB/JSON in Aurora -> Redshift SUPER (requires RA3/DS2 nodes).
- Known issue: Aurora’s arrays, enums, and extensions (citext, postgis) have no direct Redshift equivalent.

2. Initial Bulk Data Movement

Use S3 as a data interchange layer, not direct streaming.

Aurora Export:
```
UNLOAD ('SELECT * FROM orders') 
TO 's3://acme-data-migration/orders_' 
IAM_ROLE 'arn:aws:iam::1234567890:role/redshift-unload-role'
FORMAT AS PARQUET;
```
(Aurora PostgreSQL supports UNLOAD to Parquet since 2022; for large tables, use LIMIT/OFFSET splits, or write by primary key range to parallelize).
Gotcha: UNLOAD applies SELECT snapshot at command start; avoid running during bulk INSERTs.
Redshift Import:
```
COPY orders_fact FROM 's3://acme-data-migration/orders_' 
IAM_ROLE 'arn:aws:iam::1234567890:role/redshift-unload-role' 
FORMAT AS PARQUET;
```
- MAXERROR 5 occasionally helps skip corrupt rows during initial loads.
- For very large tables, consider COPY ... REGION 'us-east-1' STATUPDATE ON COMPUPDATE ON for adaptive stats.

3. Change Data Capture (CDC)

To keep Redshift in sync after the bulk load:

AWS DMS (Database Migration Service):
- Set up a replication instance (tested: dms.r5.large, with multi-AZ for reliability).
- DMS reads Aurora's WAL or binlog, applies changes (INSERT/UPDATE/DELETE) to Redshift.
- DMS CDC setup sample config:
```
{
  "TargetMetadata": {
    "TargetSchema": "public",
    "FullLobMode": false,
    "LobChunkSize": 64,
    "LimitedSizeLobMode": true,
    "LobMaxSize": 32
  }
}
```
  - Known issue: DMS CDC may lag behind on high-throughput tables (~5-10 min), and will error on unsupported DDL ops (e.g., changing column type).
  - Always enable table mapping rules for exclusions; avoid system tables.
Alternative:
AWS Glue or a custom Debezium/Kinesis pipeline, streaming binlog change events to S3 and batch-loading to Redshift.
Pro: more custom transformations possible; Con: operational complexity.

4. Validation & Incremental Sync

Checksums:
Periodically verify counts and checksums (e.g. md5(concat_ws(',',...))) for key columns across Aurora and Redshift.
Lag Tracking:
Monitor DMS CDCLatencySource and Redshift stl_load_commits for ingestion delay.

Error Monitoring:
Watch for typical DMS errors:

ERROR   Error executing statement for table `public.orders`: Record size exceeds the Redshift column limit.

Adjust column widths or data slices accordingly.

5. Post-Migration Table Optimization

Sort/dist keys:
Example: Orders analytics expect frequent date range scans and customer aggregations.
```
CREATE TABLE orders_fact (
  ...
)
DISTKEY(customer_id) SORTKEY(order_date);
```
- Vacuum:
  Schedule nightly vacuum. Failure to do so leads to table bloat and slow queries.
- Analyze:
  Run ANALYZE after large loads or schema changes.

Practical Example

Scenario:
Production Aurora PostgreSQL (v13.8) running an ecommerce dashboard. orders table: 120 million rows (~50GB), with continuous writes.

Steps Taken:

Full table snapshot UNLOAD to S3 as Parquet (runtime: ~55 min, 3 r5.4xlarge Aurora readers).
COPY into Redshift orders_fact (runtime: ~35 min, 4 ra3.4xlarge nodes, 8-way parallel).
Set up DMS CDC with batch apply every 5 minutes.
Consistency validated via record count and rolling aggregations.
Found that batch CDC failed to handle Aurora's order_updated_at touch updates on high-velocity tables; mitigated by switching to WAL slot-based replication.
Week 1: End-users report query slowdown; performance improved by changing DISTSTYLE from EVEN to KEY (on customer_id) due to frequent customer segmentation analyses.

Non-Obvious Tip

If you have columns updated frequently but that don’t matter for analytics (e.g., last_viewed_at), exclude them from the CDC process. This reduces update noise and Redshift load. Use DMS mapping rules:

{
  "rules": [
    {
      "rule-type": "transformation",
      "rule-id": "1",
      "rule-name": "exclude-cols",
      "rule-action": "remove-column",
      "object-locator": {
        "schema-name": "public",
        "table-name": "orders"
      },
      "value": "last_viewed_at"
    }
  ]
}

Challenge	Practical Solution
Schema mismatch	Model denormalized target schema up front; anticipate Redshift column limits (e.g., max 1,600 columns/table).
Large, slow initial load	Use UNLOAD/COPY with multiple files, batch by PK range.
Ongoing updates	Use DMS CDC, minimize scope to critical analytics fields.
Validation	Implement periodic record count/checksum cross-checks.
Slow queries post-migration	Revisit dist/sort key strategy, run ANALYZE and VACUUM.

Migration between Aurora and Redshift will always involve friction—balancing load windows, cost, and ongoing change. Accept imperfection; CDC will lag under heavy write churn. Prioritize a sound initial model and robust operational monitoring over “perfect” automation. There are new DMS features (e.g., per-table parallelism) worth watching, but operational reliability trumps novelty.

Ready to break analytics traffic out of your Aurora stack? Audit your schema, prepare an S3-based pipeline, and test partial loads early—it’s rarely smooth, but it beats troubleshooting 2 a.m. slow queries in production.

Aurora To Redshift