Aws Glue Rds To S3

Aws Glue Rds To S3

Reading time1 min
#Cloud#BigData#DataEngineering#AWSGlue#ETL#RDS

Optimizing AWS Glue ETL Workflows for High-Throughput Data Transfer from RDS to S3

Out-of-the-box AWS Glue jobs for RDS-to-S3 migration exhibit unnecessary latency and resource overhead, especially at scale. Engineers routinely encounter bottlenecks tied to poorly partitioned JDBC reads, misconfigured networking, and S3 output layouts that undermine downstream query performance.

Direct-to-the-core: This guide covers hard-learned patterns for maximizing Glue efficiency in production-grade RDS-to-S3 ETL pipelines—touching on VPC wiring, parallelized extraction, Spark executor tuning, and data partitioning.


Network and Permissions: RDS ↔ Glue Connectivity

Before scripting the first ETL job, double-check fundamental prerequisites. Skipping these often results in connection errors such as:

com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

Checklist:

  • VPC configuration: Ensure Glue jobs run inside the same VPC as RDS, or have the correct VPC endpoint for cross-VPC access. Route tables and NACLs matter—miss one, and you're debugging for hours.
  • Security groups: Permit inbound traffic on the proper port (default 3306 MySQL, 5432 PostgreSQL) from Glue’s job subnets.
  • IAM role: Attach policies for S3 (read/write), Glue, Secrets Manager (for DB creds; plain text is a liability), and CloudWatch.

Gotcha: Lambda-style default roles lack S3 permissions—verify via IAM policy simulator.


JDBC Connection: Secure and Maintainable

Establish a reusable JDBC connection in Glue. For real environments, avoid embedding credentials; instead, create a secret in AWS Secrets Manager. Example secret JSON:

{
  "username": "etl_user",
  "password": "*******"
}

Glue Console Steps:

  • Add connection → JDBC
  • Supply RDS endpoint, port, DB name
  • Select VPC, subnets, and security group (as above)
  • Reference secret via ARN

Verify connectivity: Test connection from console. Failure here (e.g., Timeout: failed to connect to database) almost always points to VPC or SG misconfiguration.


Extraction: Partitioned Reads for Throughput

The underlying Spark engine in Glue defaults to single-threaded reads if you don’t specify partitioning. On a 10M-row table, you’ll wait—sometimes hours.

Sample PySpark job for MySQL (Glue 4.0, Python 3.10):

from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.job import Job
import sys

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init('rds-to-s3-etl', {})

db_url = "jdbc:mysql://rds.internal:3306/appdb"
options = {
    "url": db_url,
    "dbtable": "transactions",
    "user": "<username>",      # inject from Secrets Manager where possible
    "password": "<password>",  # don't hardcode—rotate and fetch
    "partitionColumn": "txn_id",            # must be numeric, indexed
    "lowerBound": "1",
    "upperBound": "8000000",
    "numPartitions": "16"                   # tune up to RDS max connections
}

df = glueContext.create_dynamic_frame.from_options(
    connection_type="mysql",
    connection_options=options,
)

# Write Parquet, partitioned by month (critical for Athena cost control)
glueContext.write_dynamic_frame.from_options(
    frame=df,
    connection_type="s3",
    connection_options={
        "path": "s3://data-warehouse/transactions/",
        "partitionKeys": ["transaction_month"]
    },
    format="parquet"
)

job.commit()

Considerations:

  • partitionColumn should be uniformly distributed; avoid columns with heavy skew.
  • RDS instance max connections = (Glue workers * partitions per worker) + buffer for DB overhead. Exceed and you’ll see [ERROR] SQLNonTransientConnectionException: Too many connections.

Incremental Load Tip: For recurring jobs, apply pushdown predicates:

"dbtable": "(SELECT * FROM transactions WHERE updated_at > '2024-06-01T00:00:00Z') AS recent"

Glue’s bookmark feature tracks high-water marks but isn’t always sufficient under schema evolution.


Spark Resource Tuning

Poor Spark executor configuration leads to wasted money or OOM failures. Glue 3.0+ supports these parameters:

ParamExample ValueComment
spark.executor.memory'4g'More for big tables, watch container cap
spark.executor.cores'2'More = fewer parallel executors per node
spark.sql.shuffle.partitions'64'Excess = S3 write pressure

Set via Job parameters or inline:

conf = spark.sparkContext._conf.setAll([
    ('spark.executor.memory', '4g'),
    ('spark.executor.cores', '2'),
    ('spark.sql.shuffle.partitions', '64')
])

Known Issue: Excessive executor count can hit RDS throttling—balance against database throughput, not just Spark speed.


Output Partitioning: S3 Considerations

Dumping all data into one Parquet file cripples Athena and Redshift Spectrum scans. Partition output on time-based or high-cardinality filter fields.

Example output structure:

s3://data-warehouse/transactions/
  ├─ year=2024/
  │   ├─ month=06/
  │   │    └─ part-*.parquet

Side note: For streaming ingestion, avoid daily partitions (tiny S3 files); consolidate post-facto via compacting jobs.


Monitoring, Error Handling, and Cost Control

Glue emits metrics to CloudWatch by default. Create alarms for:

  • Job failures (metric: Glue Jobs Failed)
  • Elapsed time > steady-state norm (metric: Glue Job Elapsed Time)
  • S3 partition size anomalies (custom metrics via AWS Lambda or Athena CTAS on S3 Inventory)

Log Parsing Example:
Look for Data read completed and Job completed successfully to time ETL, or JDBC connection closed prematurely when hitting DB throttling.

Enable job bookmarks where possible to avoid double-processing and introduce idempotence.


Final Thoughts

Highly tuned Glue workflows demand investment upfront—understanding both the cloud network mesh and the internals of partitioned JDBC reads. Trade-offs abound: Most RDS instances can’t handle as much concurrency as Spark would like; S3 partition design for your BI pattern will differ from “simple backups.”

Still, a typical optimization journey:
First run, job takes 4.5 hours and costs $20; after above steps, you cut runtime to 20 minutes at $2, all while outputting Athena-ready partitions.

Alternatives exist (EMR, DMS, custom Lambdas), but for PySpark-compatible ETL orchestrated natively on AWS, Glue remains hard to beat.


References

Recommended: Revisit CloudWatch logs after every significant schema or partitioning change—early detection of problems saves more time than clever code.