Efficient Data Migration: Moving S3 Data into Amazon RDS with AWS Glue

Manual ETL jobs to move S3 data into RDS are brittle, slow to adapt, and tend to create operational debt. If scalable analytics is the goal, AWS Glue offers a pragmatic alternative—reducing repetitive work and tightly integrating with AWS-native services.

Why AWS Glue for S3-to-RDS ETL?

The typical alternatives—script-based data pumps or scheduled Lambda imports—usually require babysitting and struggle with scale or evolving schemas. By contrast, AWS Glue (v3.0+ recommended as of 2024) provides:

Serverless compute: No cluster management, and scaling is handled automatically.
Metadata governance: Glue Data Catalog enables discoverable, versioned, and queryable schemas.
Seamless integration: Built-in connectors for S3 as well as JDBC-compliant RDS engines.
ETL job versioning and monitoring: Restart, resume, and debug using job bookmarks and logs.
Cost granularity: You’re billed for DPU time; optimize jobs to control costs.

Known issue: Initial JDBC connection times to RDS from Glue jobs can be significant, especially in environments with VPC and NAT Gateways. Expect cold starts.

Prerequisites

Prepare and verify the following:

Resource	Purpose	Notes
S3 Bucket	Source data (CSV, Parquet, etc.)	Uniform file schema
RDS	Target database (MySQL, PostgreSQL, etc.)	Open from Glue’s VPC
IAM Role	S3 read, RDS connect, GlueJob execution	Least privilege
Network	VPC configured for Glue/RDS connectivity	Subnet, Security Group

You’ll also need AWS CLI v2.x or use the Console, and minimal experience with Glue ETL setup.

1. S3 Data Organization

Dumping files arbitrarily in S3 is a recipe for crawl failures. Organize by prefix (e.g., s3://my-data-bucket/sales/2024/). Consistent schemas across files are mandatory. If ingestion fails, check for mixed column types or missing headers—errors look like:

AnalysisException: Found duplicate column(s) in the data schema[...]

2. Cataloging: Create a Glue Data Catalog Database

In AWS Glue Console:

Navigate: Data Catalog → Databases → Add database
Example name: sales_glue_db
Store all related tables for this pipeline in a single namespace

3. Crawl S3 for Schema Discovery

Set up a Glue Crawler:

Crawler name: sales_s3_crawler
Data store: S3 (s3://my-data-bucket/sales/2024/)
IAM Role: Must cover S3, Catalog actions
Target database: sales_glue_db
Scheduling: On-demand, or periodic if incoming data is frequent

Post-crawl, inspect the generated Glue table (sales_2024 or similar). Take note if nested columns are not inferred as expected—Parquet works best for accurate schema detection.

4. RDS Networking and Security

Don’t proceed until RDS is reachable by Glue jobs.

Ensure RDS is in a VPC with the right subnet group
Allow incoming TCP traffic on the database port (e.g., 5432 for PostgreSQL) from the Glue job’s security group
Confirm RDS endpoint, DB name, creds
Double-check DNS resolution—if using PrivateLink, add appropriate Route53 entries

Quick test from a bastion in the same VPC:

psql -h <rds-endpoint> -U glue_test -d target_db

If this fails, Glue connectivity will too.

5. Configure Glue Connection (JDBC) to RDS

AWS Glue Console → Connections → Add connection
Type: JDBC
Specify:
- Name: conn_rds_sales
- JDBC URL: For MySQL,
  jdbc:mysql://<rds-endpoint>:3306/<db-name>
  or for PostgreSQL,
  jdbc:postgresql://<rds-endpoint>:5432/<db-name>
- Network: Select matching VPC, subnets, security groups
- Auth: Enter username/password, but for production use AWS Secrets Manager with reference via ${secretName}

Test and save. A failed network test here usually means subnet or security group misconfiguration.

6. Build the ETL Job

Job name: s3_sales_to_rds_etl
Role: Grants access to S3, Glue Catalog, RDS network
Glue version: Prefer 3.0+ for improved Spark/AWS SDK support
Script source: Let Glue auto-generate, then modify as needed

Template ETL logic (PySpark/Glue):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Input: Read S3/Catalog data
df = glueContext.create_dynamic_frame.from_catalog(
    database="sales_glue_db",
    table_name="sales_2024"
)

# Optional transformation (side note: always handle nulls explicitly)
df = df.resolveChoice(specs=[('amount', 'cast:double')])

# Output: Write to RDS
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame = df,
    catalog_connection = "conn_rds_sales",
    connection_options = {
        "dbtable": "sales_import",
        "database": "<db-name>"
    },
    transformation_ctx = "datasink"
)

job.commit()

Non-obvious tip: By default, batch inserts can struggle with very wide tables or large blobs. If you encounter JDBC server closed the connection unexpectedly errors, tune the JDBC driver’s batch size (or chunk input data).

7. Running and Monitoring the Job

Trigger via AWS Console or CLI:

aws glue start-job-run --job-name s3_sales_to_rds_etl

Monitor from the Jobs > Runs section. For hangs or failures, fetch logs:

CloudWatch log group: /aws-glue/jobs/output
Search for serialization errors or timeouts

Verify target table population using your SQL client; if performance is poor, review Glue’s DPU usage and job parallelism settings.

Production Readiness — Gotchas and Pro Tips

Partition Pruning: For big S3 datasets, configure partition indexes on your source; crawler and job both benefit.
Job Bookmarks: Enable to incrementally load only new data. This reduces redundant writes/reads.
Automatic Table Creation: RDS tables must exist before Glue writes; auto-creation is not supported for all engines.
Sensitive data: Use Secrets Manager rather than hardcoding credentials. Link Secrets Manager secret to the Glue connection.
Error Handling: RDS constraints (e.g., unique keys) can lead to silent dropped rows or cryptic JDBC error logs.

Alternative: For real-time or very high throughput, consider AWS DMS or direct COPY into Redshift instead.

Summary

AWS Glue makes S3-to-RDS migration significantly more maintainable than legacy scripting or ad-hoc Lambda functions. Automated schema discovery, managed orchestration, and robust monitoring reduce operational friction.

Not perfect—JDBC overhead, limitations on transactionality, and cold starts all apply. But for periodic, batch ingestion into RDS, this pattern is hard to beat.

Example Job Flow (ASCII Diagram)

[S3 Bucket] --(Glue Crawler)--> [Glue Data Catalog] --(Glue ETL Job)--> [RDS Instance]

Note: For cross-region S3/RDS setups, expect higher latency and possible VPC Endpoint adjustments.

Have experience optimizing Glue for heavy S3-to-RDS loads, or ever hit a puzzling crash? Reach out or drop a note—always interested in field-level war stories or unconventional solutions.

Aws Glue S3 To Rds