Optimizing AWS Glue ETL Workflows for High-Throughput Data Transfer from RDS to S3
Out-of-the-box AWS Glue jobs for RDS-to-S3 migration exhibit unnecessary latency and resource overhead, especially at scale. Engineers routinely encounter bottlenecks tied to poorly partitioned JDBC reads, misconfigured networking, and S3 output layouts that undermine downstream query performance.
Direct-to-the-core: This guide covers hard-learned patterns for maximizing Glue efficiency in production-grade RDS-to-S3 ETL pipelines—touching on VPC wiring, parallelized extraction, Spark executor tuning, and data partitioning.
Network and Permissions: RDS ↔ Glue Connectivity
Before scripting the first ETL job, double-check fundamental prerequisites. Skipping these often results in connection errors such as:
com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
Checklist:
- VPC configuration: Ensure Glue jobs run inside the same VPC as RDS, or have the correct VPC endpoint for cross-VPC access. Route tables and NACLs matter—miss one, and you're debugging for hours.
- Security groups: Permit inbound traffic on the proper port (default 3306 MySQL, 5432 PostgreSQL) from Glue’s job subnets.
- IAM role: Attach policies for S3 (read/write), Glue, Secrets Manager (for DB creds; plain text is a liability), and CloudWatch.
Gotcha: Lambda-style default roles lack S3 permissions—verify via IAM policy simulator.
JDBC Connection: Secure and Maintainable
Establish a reusable JDBC connection in Glue. For real environments, avoid embedding credentials; instead, create a secret in AWS Secrets Manager. Example secret JSON:
{
"username": "etl_user",
"password": "*******"
}
Glue Console Steps:
- Add connection → JDBC
- Supply RDS endpoint, port, DB name
- Select VPC, subnets, and security group (as above)
- Reference secret via ARN
Verify connectivity: Test connection from console. Failure here (e.g., Timeout: failed to connect to database
) almost always points to VPC or SG misconfiguration.
Extraction: Partitioned Reads for Throughput
The underlying Spark engine in Glue defaults to single-threaded reads if you don’t specify partitioning. On a 10M-row table, you’ll wait—sometimes hours.
Sample PySpark job for MySQL (Glue 4.0, Python 3.10):
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.job import Job
import sys
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init('rds-to-s3-etl', {})
db_url = "jdbc:mysql://rds.internal:3306/appdb"
options = {
"url": db_url,
"dbtable": "transactions",
"user": "<username>", # inject from Secrets Manager where possible
"password": "<password>", # don't hardcode—rotate and fetch
"partitionColumn": "txn_id", # must be numeric, indexed
"lowerBound": "1",
"upperBound": "8000000",
"numPartitions": "16" # tune up to RDS max connections
}
df = glueContext.create_dynamic_frame.from_options(
connection_type="mysql",
connection_options=options,
)
# Write Parquet, partitioned by month (critical for Athena cost control)
glueContext.write_dynamic_frame.from_options(
frame=df,
connection_type="s3",
connection_options={
"path": "s3://data-warehouse/transactions/",
"partitionKeys": ["transaction_month"]
},
format="parquet"
)
job.commit()
Considerations:
partitionColumn
should be uniformly distributed; avoid columns with heavy skew.- RDS instance max connections = (Glue workers * partitions per worker) + buffer for DB overhead. Exceed and you’ll see
[ERROR] SQLNonTransientConnectionException: Too many connections
.
Incremental Load Tip: For recurring jobs, apply pushdown predicates:
"dbtable": "(SELECT * FROM transactions WHERE updated_at > '2024-06-01T00:00:00Z') AS recent"
Glue’s bookmark feature tracks high-water marks but isn’t always sufficient under schema evolution.
Spark Resource Tuning
Poor Spark executor configuration leads to wasted money or OOM failures. Glue 3.0+ supports these parameters:
Param | Example Value | Comment |
---|---|---|
spark.executor.memory | '4g' | More for big tables, watch container cap |
spark.executor.cores | '2' | More = fewer parallel executors per node |
spark.sql.shuffle.partitions | '64' | Excess = S3 write pressure |
Set via Job parameters or inline:
conf = spark.sparkContext._conf.setAll([
('spark.executor.memory', '4g'),
('spark.executor.cores', '2'),
('spark.sql.shuffle.partitions', '64')
])
Known Issue: Excessive executor count can hit RDS throttling—balance against database throughput, not just Spark speed.
Output Partitioning: S3 Considerations
Dumping all data into one Parquet file cripples Athena and Redshift Spectrum scans. Partition output on time-based or high-cardinality filter fields.
Example output structure:
s3://data-warehouse/transactions/
├─ year=2024/
│ ├─ month=06/
│ │ └─ part-*.parquet
Side note: For streaming ingestion, avoid daily partitions (tiny S3 files); consolidate post-facto via compacting jobs.
Monitoring, Error Handling, and Cost Control
Glue emits metrics to CloudWatch by default. Create alarms for:
- Job failures (metric:
Glue Jobs Failed
) - Elapsed time > steady-state norm (metric:
Glue Job Elapsed Time
) - S3 partition size anomalies (custom metrics via AWS Lambda or Athena CTAS on S3 Inventory)
Log Parsing Example:
Look for Data read completed
and Job completed successfully
to time ETL, or JDBC connection closed prematurely
when hitting DB throttling.
Enable job bookmarks where possible to avoid double-processing and introduce idempotence.
Final Thoughts
Highly tuned Glue workflows demand investment upfront—understanding both the cloud network mesh and the internals of partitioned JDBC reads. Trade-offs abound: Most RDS instances can’t handle as much concurrency as Spark would like; S3 partition design for your BI pattern will differ from “simple backups.”
Still, a typical optimization journey:
First run, job takes 4.5 hours and costs $20; after above steps, you cut runtime to 20 minutes at $2, all while outputting Athena-ready partitions.
Alternatives exist (EMR, DMS, custom Lambdas), but for PySpark-compatible ETL orchestrated natively on AWS, Glue remains hard to beat.
References
Recommended: Revisit CloudWatch logs after every significant schema or partitioning change—early detection of problems saves more time than clever code.