Optimizing Data Migration from MySQL to Redshift: An Engineer’s Framework for Performance and Cost

Massive tables, sluggish dashboards, escalating RDS costs—at some point, most scaling orgs hit the wall with MySQL’s OLTP roots. Redshift offers a columnar, distributed analytics engine purpose-built for complex queries across billions of rows. But migrating is not just a technical lift-and-shift—it’s a rearchitecture, with trade-offs. Here’s a field-tested process for moving analytics workloads from MySQL 5.7+ to Amazon Redshift (tested up to RA3 node types), focusing on a balance of throughput, cost, and minimization of downtime. Confirmation: this isn't a one-size-fits-all. Your edge cases may differ.

1. Assessing Source Data: Inventory, Volume, Query Profile

Start with a query audit, not schema export. Target analytic workloads first:

Table Inventory: Filter to reporting/BI tables. Ignore transactional logs, ephemeral state, or entities feeding operational apps.
Growth Analysis: Daily/weekly ingest rates. Check for outlier tables exceeding 500M rows—Redshift behaves differently with truly “big” tables.
Query Scan Patterns: Gather slow logs; grep for SELECT statements scanning or joining on large tables.
Temporal Relevance: Often, 80% of queries need only a sliding six-month window. Consider staging history in S3 or a separate Redshift schema (“archive”).

Practical tip: Percona Toolkit (pt-table-usage) helps cataloging which columns get filtered or joined on.

2. Redesigning Schema for Analytics

Avoid 1:1 schema mapping. Instead:

Compression Encodings: Use ZSTD for most columns; don’t accept the default RAW. Run ANALYZE COMPRESSION post-load for tuning.
Sort and Dist Keys:
- SORTKEY: Choose a timestamp or dimension with the highest filter cardinality.
- DISTSTYLE: Align on high-cardinality keys for join-heavy workloads (DISTKEY), but beware of skew. Sometimes, EVEN is less risky for sparse connections.
Schema Simplification: Flatten out lookup tables—denormalize where joins outnumber updates.

Gotcha: Overusing ALL distribution style on small tables wrecks storage—use only if <2GB and referenced everywhere.

Example DDL:

CREATE TABLE orders (
    order_id BIGINT ENCODE ZSTD,
    customer_id INT ENCODE ZSTD,
    order_date DATE ENCODE ZSTD,
    total_amount DECIMAL(10,2) ENCODE ZSTD
)
DISTKEY(customer_id)
SORTKEY(order_date);

This covers most BI queries filtering on recent orders per customer.

3. Data Extraction: Batch, Parallelism, Consistency

Avoid mysqldump for multi-TB datasets; it’s slow and risks lock contention.

AWS DMS:
- Perfect for ongoing replications or large-volume initial loads. Enables CDC (Change Data Capture), keeping source and target loosely synchronized.

Manual Batching:

Split by temporal windows (per month, quarter).

Use SELECT INTO OUTFILE with explicit WHERE clauses:

SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
INTO OUTFILE '/tmp/orders_Q1_2024.csv'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';

Compress outputs (gzip). Upload directly to S3.

Note: For tables with write spikes, snapshot at the lowest business usage window to avoid partial data or FK constraint issues.

4. Loading to Redshift: COPY Command and Parallelization

Redshift’s COPY handles bulk ingest. Key parameters overlooked in less-experienced shops:

Use CSV GZIP for simple schemas; prefer PARQUET for wide tables to exploit columnar benefits.
Set MAXERROR but monitor load errors (SVL_LOAD_ERRORS). UTF-8 character issues and invisible NULLs trip up many initial migrations.
Load data in parallel per file—don’t serially copy 1TB through a single node.

Example:

COPY orders
FROM 's3://analytics-bucket/orders_Q1_2024.csv.gz'
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account>:role/RedshiftLoader'
CSV GZIP TIMEFORMAT 'auto'
MAXERROR 50;

Tip: Use manifest files for >1000 files to ensure consistency.

5. Post-Load: Data Integrity Checks and Optimization

After ingest:

Row count validation (difference should be <=1% due to soft deletes, but zero is ideal).
Query benchmarking: Run production-grade report queries.
EXPLAIN plans: Look for “DS_DIST_NONE” in joins for optimal distribution. Skew can be spotted in SVL_QUERY_SUMMARY.
Re-analyze stats: VACUUM and full ANALYZE required after initial load.

Gotcha: Old Redshift versions (<1.0.38113) sometimes skip statistics collection on newly loaded large tables. Run ANALYZE manually on them.

6. Operational Tuning: Scaling and Cost Controls

Budget overruns usually hit in the first month. Avoid unnecessary spends:

Feature	What It Does	When to Enable
Concurrency Scaling	Adds compute nodes for query spikes	Only for peak BI hours
Reserved Instances	Up to 75% savings over on-demand after 3+ months stability	Once baseline is predictable
Spectrum External Tables	Query S3 directly for archival or cold data	For rarely queried partitions

Side note: Short-lived ad hoc queries can be split to dedicated WLM queues to prevent starvation of standard reporting jobs.

7. Incremental Loads and Automation

Set up periodic ETL between production MySQL and Redshift. DMS CDC pipelines or tools like Airflow or dbt are preferred. Incremental logic should checksum new data; avoid full reloads.

Example Airflow DAG fragment (pseudo):

load_orders = BashOperator(
    task_id='load_orders',
    bash_command="""
    aws s3 cp ... &&
    psql -c "COPY ..."
    """
)

Monitor ETL pipeline failures via CloudWatch or custom SLAs in Airflow.

8. Non-Obvious Caveats

MySQL ENUM data types break Redshift COPY silently—convert to VARCHAR.
Redshift has a 64-table-per-query join limit. Merge/flatten if necessary.
Timestamps: always convert to UTC before loading. Avoid downstream timezone surprises in BI.

Closing Notes

A successful MySQL to Redshift migration is neither trivial nor formulaic. Schema tuning, careful extraction routines, and systematic validation are mandatory. Done well, analytics latency drops, and reporting pipelines become future-proof, but operational diligence persists beyond day one. Most importantly: don’t over-index on “best practices” at the expense of your team’s actual workload patterns.

Further details, error logs, or pipeline orchestration issues? Raise below. Someone’s tripped over it before.

Mysql To Redshift