AWS Glue: Automating RDS to Redshift Migration

Manually extracting data from RDS and loading it into Redshift is error-prone and, once your data size grows beyond a few gigabytes, simply doesn’t scale. Maintenance-heavy ETL scripts, ad hoc exports—these methods break down under operational pressure. Glue addresses this by removing most of the custom scripting, providing a resilient, auditable, and repeatable migration pipeline.

Below is a pragmatic blueprint based on recent production deployments. Target: MySQL 8.0 RDS and Redshift Serverless (Ra3 nodes, 2023-09), but approach applies to other engines. Substitute appropriately.

Pre-Flight Checklist

RDS instance must reside in a VPC with inbound rules allowing traffic from Glue's managed (serverless) Spark environment. Outbound 5439/tcp to Redshift is required.
Use IAM roles granting glue:CreateJob, rds-db:connect, and redshift:CopyFromS3 as needed.
No overlapping table names in destination schema unless using overwrite/load staging logic.

1. Establishing Glue Connections

Glue relies on Connections for JDBC access. Go to AWS Glue Console → "Connections" → "Add connection".

Engine: Choose JDBC (PostgreSQL, MySQL, etc.).

JDBC URL pattern, e.g.:

jdbc:mysql://rds-instance.abcdefghijk.us-east-1.rds.amazonaws.com:3306/mydb

Credentials: Never store plaintext passwords. Integrate with AWS Secrets Manager. Pass secret ARN rather than embedding in the connection config.

Network configuration requires selection of the correct VPC, subnet, and security group. Failure here yields cryptic timeouts:

An error occurred while calling o45.getCatalogSource: Could not connect to database server

2. Cataloging RDS Schemas with Glue Crawler

A Glue Crawler introspects RDS schemas and tables, updating the Glue Data Catalog.

Steps:

Set source as the JDBC connection above, targeting intended schemas/tables. Limit scope; do not crawl system tables.
Configure output database in the Data Catalog. Avoid the default; isolation supports auditing and downstream reuse.
Run Crawler. Typical errors:
- connection refused: networking misconfigured.
- permission denied: IAM role missing glue:GetConnection.
Output: cataloged tables (schema, types). Non-obvious tip: if fractional seconds are present in datetime, ensure Redshift schema supports equivalent precision (use TIMESTAMP(6) in Redshift).

3. Preparing Redshift

Schema setup: create target schemas if absent.
Table creation: Either generate via Glue job (dynamic frame's schema), or predefine tables for type/constraint control.

Grant INSERT and SELECT on schema to the Glue execution role’s mapped Redshift user.

Note
Redshift will stage intermediate files in S3; make sure the referenced bucket (redshift_tmp_dir) exists and Glue has s3:PutObject rights.

4. Building a Glue ETL Job

Within Glue Console → "Jobs" → "Add job".

Language: Spark (Python 3.0+). Jobs tested with Glue version 4.0.
Source: point to Data Catalog entries (from Step 2).
Transform (optional): cleaning, type conversions, SCD if relevant.

Actual job code (extract–transform–load pattern):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Extraction
src = glueContext.create_dynamic_frame.from_catalog(
    database="prod_rds_catalog",
    table_name="orders",
    transformation_ctx="src"
) 

# Simple transformation (e.g., drop nulls)
cleaned = Filter.apply(
    frame=src,
    f=lambda rec: rec["id"] is not None
)

# Load: Redshift via JDBC
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=cleaned,
    catalog_connection="redshift-jdbc-prod",
    connection_options={
        "dbtable": "dw.orders_staging",
        "database": "dw",
        "preactions": "TRUNCATE TABLE dw.orders_staging"  # Not transactional; beware data gaps
    },
    redshift_tmp_dir="s3://etl-temp-bucket/glue-redshift/",
    transformation_ctx="rs_sink"
)

A few caveats:

preactions is a blunt tool—ensure downstream readers are insulated during reloads.
For incremental loads, filter source data on an updated timestamp or synthetic column (watermark pattern).
Job retries: enable up to 3 and set job bookmarks appropriately to avoid reprocessing.

Monitoring and Scheduling

Scheduling: Configure via Glue Triggers or AWS EventBridge. Cron expressions enable fine control: "cron(0 3 * * ? *)" for nightly jobs.
Monitoring: CloudWatch integration provides logs/alerts. Watch for memory errors: if you see
```
ExecutorLostFailure (executor 18 exited caused by one of the running tasks)
```
increase job DPUs or filter data further upstream.

Table: Common Design Pitfalls

Issue	Symptom	Mitigation
VPC misconfiguration	Timeouts/crawler failures	Validate security group, NACL rules
Data type mismatches	Load errors in Glue/Redshift	Explicit mapping in ApplyMapping
Password in script	Security risk	Use Secrets Manager/ENV vars
S3 temp bucket not versioned	Data loss risk	Enable bucket versioning

Non-Obvious Tips

For very large tables, use Glue’s partitioning (by date or key) to parallelize extraction and loading.
```
partitions = ["order_date"]
```
This can slash migration times by orders of magnitude.
Staging via S3 (redshift_tmp_dir) introduces latency and potential object churn. Periodically lifecycle old files to control costs.
Glue DynamicFrame bookmarks allow seamless incremental loads if source rows have monotonically increasing IDs or timestamps.

Summary

AWS Glue enables scalable, fault-tolerant migration from RDS to Redshift, outclassing script-based ETL approaches in both reliability and auditability. Most real-world issues involve IAM/connection permissions or type mismatches; always validate with test loads before production scheduling. Don’t ignore monitoring—production loads can fail silently if misconfigured.

Alternatives like AWS DMS may be preferable for change data capture, but for periodic analytical snapshots, Glue shines.

Known Issue (as of Glue 4.0):
Cross-region RDS-to-Redshift writes incur increased S3 temp costs and transient 5xx errors—prefer region-local processing if at all possible.

For production, keep Glue jobs idempotent and test against a dev Redshift schema before turning on large-scale schedules.

Aws Glue Rds To Redshift