Efficient Data Migration: Moving S3 Data into Amazon RDS with AWS Glue
Manual ETL jobs to move S3 data into RDS are brittle, slow to adapt, and tend to create operational debt. If scalable analytics is the goal, AWS Glue offers a pragmatic alternative—reducing repetitive work and tightly integrating with AWS-native services.
Why AWS Glue for S3-to-RDS ETL?
The typical alternatives—script-based data pumps or scheduled Lambda imports—usually require babysitting and struggle with scale or evolving schemas. By contrast, AWS Glue (v3.0+ recommended as of 2024) provides:
- Serverless compute: No cluster management, and scaling is handled automatically.
- Metadata governance: Glue Data Catalog enables discoverable, versioned, and queryable schemas.
- Seamless integration: Built-in connectors for S3 as well as JDBC-compliant RDS engines.
- ETL job versioning and monitoring: Restart, resume, and debug using job bookmarks and logs.
- Cost granularity: You’re billed for DPU time; optimize jobs to control costs.
Known issue: Initial JDBC connection times to RDS from Glue jobs can be significant, especially in environments with VPC and NAT Gateways. Expect cold starts.
Prerequisites
Prepare and verify the following:
Resource | Purpose | Notes |
---|---|---|
S3 Bucket | Source data (CSV, Parquet, etc.) | Uniform file schema |
RDS | Target database (MySQL, PostgreSQL, etc.) | Open from Glue’s VPC |
IAM Role | S3 read, RDS connect, GlueJob execution | Least privilege |
Network | VPC configured for Glue/RDS connectivity | Subnet, Security Group |
You’ll also need AWS CLI v2.x or use the Console, and minimal experience with Glue ETL setup.
1. S3 Data Organization
Dumping files arbitrarily in S3 is a recipe for crawl failures. Organize by prefix (e.g., s3://my-data-bucket/sales/2024/
). Consistent schemas across files are mandatory. If ingestion fails, check for mixed column types or missing headers—errors look like:
AnalysisException: Found duplicate column(s) in the data schema[...]
2. Cataloging: Create a Glue Data Catalog Database
In AWS Glue Console:
- Navigate: Data Catalog → Databases → Add database
- Example name:
sales_glue_db
- Store all related tables for this pipeline in a single namespace
3. Crawl S3 for Schema Discovery
Set up a Glue Crawler:
- Crawler name:
sales_s3_crawler
- Data store: S3 (
s3://my-data-bucket/sales/2024/
) - IAM Role: Must cover S3, Catalog actions
- Target database:
sales_glue_db
- Scheduling: On-demand, or periodic if incoming data is frequent
Post-crawl, inspect the generated Glue table (sales_2024
or similar). Take note if nested columns are not inferred as expected—Parquet works best for accurate schema detection.
4. RDS Networking and Security
Don’t proceed until RDS is reachable by Glue jobs.
- Ensure RDS is in a VPC with the right subnet group
- Allow incoming TCP traffic on the database port (e.g., 5432 for PostgreSQL) from the Glue job’s security group
- Confirm RDS endpoint, DB name, creds
- Double-check DNS resolution—if using PrivateLink, add appropriate Route53 entries
Quick test from a bastion in the same VPC:
psql -h <rds-endpoint> -U glue_test -d target_db
If this fails, Glue connectivity will too.
5. Configure Glue Connection (JDBC) to RDS
- AWS Glue Console → Connections → Add connection
- Type: JDBC
- Specify:
- Name:
conn_rds_sales
- JDBC URL: For MySQL,
jdbc:mysql://<rds-endpoint>:3306/<db-name>
or for PostgreSQL,
jdbc:postgresql://<rds-endpoint>:5432/<db-name>
- Network: Select matching VPC, subnets, security groups
- Auth: Enter username/password, but for production use AWS Secrets Manager with reference via
${secretName}
- Name:
Test and save. A failed network test here usually means subnet or security group misconfiguration.
6. Build the ETL Job
- Job name:
s3_sales_to_rds_etl
- Role: Grants access to S3, Glue Catalog, RDS network
- Glue version: Prefer 3.0+ for improved Spark/AWS SDK support
- Script source: Let Glue auto-generate, then modify as needed
Template ETL logic (PySpark/Glue):
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Input: Read S3/Catalog data
df = glueContext.create_dynamic_frame.from_catalog(
database="sales_glue_db",
table_name="sales_2024"
)
# Optional transformation (side note: always handle nulls explicitly)
df = df.resolveChoice(specs=[('amount', 'cast:double')])
# Output: Write to RDS
glueContext.write_dynamic_frame.from_jdbc_conf(
frame = df,
catalog_connection = "conn_rds_sales",
connection_options = {
"dbtable": "sales_import",
"database": "<db-name>"
},
transformation_ctx = "datasink"
)
job.commit()
Non-obvious tip: By default, batch inserts can struggle with very wide tables or large blobs. If you encounter JDBC server closed the connection unexpectedly
errors, tune the JDBC driver’s batch size (or chunk input data).
7. Running and Monitoring the Job
Trigger via AWS Console or CLI:
aws glue start-job-run --job-name s3_sales_to_rds_etl
Monitor from the Jobs > Runs section. For hangs or failures, fetch logs:
- CloudWatch log group:
/aws-glue/jobs/output
- Search for serialization errors or timeouts
Verify target table population using your SQL client; if performance is poor, review Glue’s DPU usage and job parallelism settings.
Production Readiness — Gotchas and Pro Tips
- Partition Pruning: For big S3 datasets, configure partition indexes on your source; crawler and job both benefit.
- Job Bookmarks: Enable to incrementally load only new data. This reduces redundant writes/reads.
- Automatic Table Creation: RDS tables must exist before Glue writes; auto-creation is not supported for all engines.
- Sensitive data: Use Secrets Manager rather than hardcoding credentials. Link Secrets Manager secret to the Glue connection.
- Error Handling: RDS constraints (e.g., unique keys) can lead to silent dropped rows or cryptic JDBC error logs.
Alternative: For real-time or very high throughput, consider AWS DMS or direct COPY into Redshift instead.
Summary
AWS Glue makes S3-to-RDS migration significantly more maintainable than legacy scripting or ad-hoc Lambda functions. Automated schema discovery, managed orchestration, and robust monitoring reduce operational friction.
Not perfect—JDBC overhead, limitations on transactionality, and cold starts all apply. But for periodic, batch ingestion into RDS, this pattern is hard to beat.
Example Job Flow (ASCII Diagram)
[S3 Bucket] --(Glue Crawler)--> [Glue Data Catalog] --(Glue ETL Job)--> [RDS Instance]
Note: For cross-region S3/RDS setups, expect higher latency and possible VPC Endpoint adjustments.
Have experience optimizing Glue for heavy S3-to-RDS loads, or ever hit a puzzling crash? Reach out or drop a note—always interested in field-level war stories or unconventional solutions.