Optimizing Data Migration from MySQL to Redshift: An Engineer’s Framework for Performance and Cost
Massive tables, sluggish dashboards, escalating RDS costs—at some point, most scaling orgs hit the wall with MySQL’s OLTP roots. Redshift offers a columnar, distributed analytics engine purpose-built for complex queries across billions of rows. But migrating is not just a technical lift-and-shift—it’s a rearchitecture, with trade-offs. Here’s a field-tested process for moving analytics workloads from MySQL 5.7+ to Amazon Redshift (tested up to RA3 node types), focusing on a balance of throughput, cost, and minimization of downtime. Confirmation: this isn't a one-size-fits-all. Your edge cases may differ.
1. Assessing Source Data: Inventory, Volume, Query Profile
Start with a query audit, not schema export. Target analytic workloads first:
- Table Inventory: Filter to reporting/BI tables. Ignore transactional logs, ephemeral state, or entities feeding operational apps.
- Growth Analysis: Daily/weekly ingest rates. Check for outlier tables exceeding 500M rows—Redshift behaves differently with truly “big” tables.
- Query Scan Patterns: Gather slow logs; grep for
SELECT
statements scanning or joining on large tables. - Temporal Relevance: Often, 80% of queries need only a sliding six-month window. Consider staging history in S3 or a separate Redshift schema (“archive”).
Practical tip: Percona Toolkit (pt-table-usage
) helps cataloging which columns get filtered or joined on.
2. Redesigning Schema for Analytics
Avoid 1:1 schema mapping. Instead:
- Compression Encodings: Use
ZSTD
for most columns; don’t accept the default RAW. RunANALYZE COMPRESSION
post-load for tuning. - Sort and Dist Keys:
- SORTKEY: Choose a timestamp or dimension with the highest filter cardinality.
- DISTSTYLE: Align on high-cardinality keys for join-heavy workloads (
DISTKEY
), but beware of skew. Sometimes,EVEN
is less risky for sparse connections.
- Schema Simplification: Flatten out lookup tables—denormalize where joins outnumber updates.
Gotcha: Overusing ALL
distribution style on small tables wrecks storage—use only if <2GB and referenced everywhere.
Example DDL:
CREATE TABLE orders (
order_id BIGINT ENCODE ZSTD,
customer_id INT ENCODE ZSTD,
order_date DATE ENCODE ZSTD,
total_amount DECIMAL(10,2) ENCODE ZSTD
)
DISTKEY(customer_id)
SORTKEY(order_date);
This covers most BI queries filtering on recent orders per customer.
3. Data Extraction: Batch, Parallelism, Consistency
Avoid mysqldump
for multi-TB datasets; it’s slow and risks lock contention.
- AWS DMS:
- Perfect for ongoing replications or large-volume initial loads. Enables CDC (Change Data Capture), keeping source and target loosely synchronized.
- Manual Batching:
- Split by temporal windows (per month, quarter).
- Use
SELECT INTO OUTFILE
with explicit WHERE clauses:SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31' INTO OUTFILE '/tmp/orders_Q1_2024.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';
- Compress outputs (
gzip
). Upload directly to S3.
Note: For tables with write spikes, snapshot at the lowest business usage window to avoid partial data or FK constraint issues.
4. Loading to Redshift: COPY Command and Parallelization
Redshift’s COPY
handles bulk ingest. Key parameters overlooked in less-experienced shops:
- Use
CSV GZIP
for simple schemas; preferPARQUET
for wide tables to exploit columnar benefits. - Set
MAXERROR
but monitor load errors (SVL_LOAD_ERRORS). UTF-8 character issues and invisible NULLs trip up many initial migrations. - Load data in parallel per file—don’t serially copy 1TB through a single node.
Example:
COPY orders
FROM 's3://analytics-bucket/orders_Q1_2024.csv.gz'
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account>:role/RedshiftLoader'
CSV GZIP TIMEFORMAT 'auto'
MAXERROR 50;
Tip: Use manifest files for >1000 files to ensure consistency.
5. Post-Load: Data Integrity Checks and Optimization
After ingest:
- Row count validation (difference should be <=1% due to soft deletes, but zero is ideal).
- Query benchmarking: Run production-grade report queries.
- EXPLAIN plans: Look for “DS_DIST_NONE” in joins for optimal distribution. Skew can be spotted in
SVL_QUERY_SUMMARY
. - Re-analyze stats:
VACUUM
and fullANALYZE
required after initial load.
Gotcha: Old Redshift versions (<1.0.38113) sometimes skip statistics collection on newly loaded large tables. Run ANALYZE
manually on them.
6. Operational Tuning: Scaling and Cost Controls
Budget overruns usually hit in the first month. Avoid unnecessary spends:
Feature | What It Does | When to Enable |
---|---|---|
Concurrency Scaling | Adds compute nodes for query spikes | Only for peak BI hours |
Reserved Instances | Up to 75% savings over on-demand after 3+ months stability | Once baseline is predictable |
Spectrum External Tables | Query S3 directly for archival or cold data | For rarely queried partitions |
Side note: Short-lived ad hoc queries can be split to dedicated WLM queues to prevent starvation of standard reporting jobs.
7. Incremental Loads and Automation
Set up periodic ETL between production MySQL and Redshift. DMS CDC pipelines or tools like Airflow or dbt are preferred. Incremental logic should checksum new data; avoid full reloads.
Example Airflow DAG fragment (pseudo):
load_orders = BashOperator(
task_id='load_orders',
bash_command="""
aws s3 cp ... &&
psql -c "COPY ..."
"""
)
Monitor ETL pipeline failures via CloudWatch or custom SLAs in Airflow.
8. Non-Obvious Caveats
- MySQL
ENUM
data types break Redshift COPY silently—convert toVARCHAR
. - Redshift has a 64-table-per-query join limit. Merge/flatten if necessary.
- Timestamps: always convert to UTC before loading. Avoid downstream timezone surprises in BI.
Closing Notes
A successful MySQL to Redshift migration is neither trivial nor formulaic. Schema tuning, careful extraction routines, and systematic validation are mandatory. Done well, analytics latency drops, and reporting pipelines become future-proof, but operational diligence persists beyond day one. Most importantly: don’t over-index on “best practices” at the expense of your team’s actual workload patterns.
Further details, error logs, or pipeline orchestration issues? Raise below. Someone’s tripped over it before.