AWS Glue: Automating RDS to Redshift Migration
Manually extracting data from RDS and loading it into Redshift is error-prone and, once your data size grows beyond a few gigabytes, simply doesn’t scale. Maintenance-heavy ETL scripts, ad hoc exports—these methods break down under operational pressure. Glue addresses this by removing most of the custom scripting, providing a resilient, auditable, and repeatable migration pipeline.
Below is a pragmatic blueprint based on recent production deployments. Target: MySQL 8.0 RDS and Redshift Serverless (Ra3 nodes, 2023-09), but approach applies to other engines. Substitute appropriately.
Pre-Flight Checklist
- RDS instance must reside in a VPC with inbound rules allowing traffic from Glue's managed (serverless) Spark environment. Outbound 5439/tcp to Redshift is required.
- Use IAM roles granting
glue:CreateJob
,rds-db:connect
, andredshift:CopyFromS3
as needed. - No overlapping table names in destination schema unless using overwrite/load staging logic.
1. Establishing Glue Connections
Glue relies on Connections for JDBC access. Go to AWS Glue Console → "Connections" → "Add connection".
- Engine: Choose JDBC (PostgreSQL, MySQL, etc.).
- JDBC URL pattern, e.g.:
jdbc:mysql://rds-instance.abcdefghijk.us-east-1.rds.amazonaws.com:3306/mydb
- Credentials: Never store plaintext passwords. Integrate with AWS Secrets Manager. Pass secret ARN rather than embedding in the connection config.
Network configuration requires selection of the correct VPC, subnet, and security group. Failure here yields cryptic timeouts:
An error occurred while calling o45.getCatalogSource: Could not connect to database server
2. Cataloging RDS Schemas with Glue Crawler
A Glue Crawler introspects RDS schemas and tables, updating the Glue Data Catalog.
Steps:
- Set source as the JDBC connection above, targeting intended schemas/tables. Limit scope; do not crawl system tables.
- Configure output database in the Data Catalog. Avoid the default; isolation supports auditing and downstream reuse.
- Run Crawler. Typical errors:
connection refused
: networking misconfigured.permission denied
: IAM role missingglue:GetConnection
.
- Output: cataloged tables (schema, types). Non-obvious tip: if fractional seconds are present in datetime, ensure Redshift schema supports equivalent precision (use
TIMESTAMP(6)
in Redshift).
3. Preparing Redshift
- Schema setup: create target schemas if absent.
- Table creation: Either generate via Glue job (dynamic frame's schema), or predefine tables for type/constraint control.
Grant INSERT
and SELECT
on schema to the Glue execution role’s mapped Redshift user.
Note
Redshift will stage intermediate files in S3; make sure the referenced bucket (redshift_tmp_dir
) exists and Glue has s3:PutObject
rights.
4. Building a Glue ETL Job
Within Glue Console → "Jobs" → "Add job".
- Language: Spark (Python 3.0+). Jobs tested with Glue version 4.0.
- Source: point to Data Catalog entries (from Step 2).
- Transform (optional): cleaning, type conversions, SCD if relevant.
Actual job code (extract–transform–load pattern):
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Extraction
src = glueContext.create_dynamic_frame.from_catalog(
database="prod_rds_catalog",
table_name="orders",
transformation_ctx="src"
)
# Simple transformation (e.g., drop nulls)
cleaned = Filter.apply(
frame=src,
f=lambda rec: rec["id"] is not None
)
# Load: Redshift via JDBC
glueContext.write_dynamic_frame.from_jdbc_conf(
frame=cleaned,
catalog_connection="redshift-jdbc-prod",
connection_options={
"dbtable": "dw.orders_staging",
"database": "dw",
"preactions": "TRUNCATE TABLE dw.orders_staging" # Not transactional; beware data gaps
},
redshift_tmp_dir="s3://etl-temp-bucket/glue-redshift/",
transformation_ctx="rs_sink"
)
A few caveats:
preactions
is a blunt tool—ensure downstream readers are insulated during reloads.- For incremental loads, filter source data on an updated timestamp or synthetic column (watermark pattern).
- Job retries: enable up to 3 and set job bookmarks appropriately to avoid reprocessing.
Monitoring and Scheduling
- Scheduling: Configure via Glue Triggers or AWS EventBridge. Cron expressions enable fine control:
"cron(0 3 * * ? *)"
for nightly jobs. - Monitoring: CloudWatch integration provides logs/alerts. Watch for memory errors: if you see
increase job DPUs or filter data further upstream.ExecutorLostFailure (executor 18 exited caused by one of the running tasks)
Table: Common Design Pitfalls
Issue | Symptom | Mitigation |
---|---|---|
VPC misconfiguration | Timeouts/crawler failures | Validate security group, NACL rules |
Data type mismatches | Load errors in Glue/Redshift | Explicit mapping in ApplyMapping |
Password in script | Security risk | Use Secrets Manager/ENV vars |
S3 temp bucket not versioned | Data loss risk | Enable bucket versioning |
Non-Obvious Tips
- For very large tables, use Glue’s partitioning (by date or key) to parallelize extraction and loading.
This can slash migration times by orders of magnitude.partitions = ["order_date"]
- Staging via S3 (
redshift_tmp_dir
) introduces latency and potential object churn. Periodically lifecycle old files to control costs. - Glue DynamicFrame bookmarks allow seamless incremental loads if source rows have monotonically increasing IDs or timestamps.
Summary
AWS Glue enables scalable, fault-tolerant migration from RDS to Redshift, outclassing script-based ETL approaches in both reliability and auditability. Most real-world issues involve IAM/connection permissions or type mismatches; always validate with test loads before production scheduling. Don’t ignore monitoring—production loads can fail silently if misconfigured.
Alternatives like AWS DMS may be preferable for change data capture, but for periodic analytical snapshots, Glue shines.
Known Issue (as of Glue 4.0):
Cross-region RDS-to-Redshift writes incur increased S3 temp costs and transient 5xx errors—prefer region-local processing if at all possible.
For production, keep Glue jobs idempotent and test against a dev Redshift schema before turning on large-scale schedules.