Optimizing Data Workflows: Efficiently Loading AWS Glue ETL Outputs into Amazon RDS

Bulk-moving data from S3 through AWS Glue into an Amazon RDS database—efficient, but also a source of common pitfalls: jitter from misconfigured networking, slow inserts due to tiny batch sizes, entire jobs failing because of incorrect JDBC drivers, and transient network blips that pile up into operational debt. Below, a practical workflow used in production for loading daily processed web logs into RDS MySQL with minimal friction.

Direct Glue-to-RDS Loading: Problem, Not Just Method

Most ETL jobs land their output in S3 for cost and durability. But analytics teams frequently demand relational access—ad hoc JOINs, real-time dashboards—sometimes within seconds of data landing. Middleman workflows (Lambda bridges, custom import scripts) bring complexity and wasted time. Glue supports JDBC out of the box, but getting glue jobs to write efficiently and securely to RDS, especially across subnet boundaries or with credential constraints, isn’t plug-and-play.

Environment Setup

Assumptions:

AWS Glue 4.0 or later (for Spark 3.x), Python 3.9
Amazon RDS MySQL 8.0.33 in a private subnet
AWS Secrets Manager holding DB credentials

1. RDS Instance Configuration

Key requirements:

RDS instance must be in the same VPC and at least one common subnet/AZ as the Glue job, else connection drops (Communications link failure in logs).
Security group attached to RDS must have inbound TCP open on 3306 from your Glue job’s security group.
Parameter group: set max_connections according to your ETL throughput needs; batch writes consume more connections in bursts.

2. Table Schema Preparation

Example table:

CREATE TABLE web_logs (
  id INT PRIMARY KEY AUTO_INCREMENT,
  user_agent VARCHAR(255) NOT NULL,
  request_time DATETIME(6),
  response_code INT,
  bytes_sent INT
);

Note: Avoid nullable columns unless necessary—PySpark’s schema inference interprets missing fields as nulls, creating subtle bugs.

Glue Connection: Setting Up JDBC

Glue Console → Connections → Add connection
- Type: JDBC
- URL: jdbc:mysql://<rds-endpoint>:3306/<dbname>?serverTimezone=UTC&useSSL=true
- Credentials: Instead of embedding, reference a Secrets Manager secret ARN
- Network: Assign VPC/subnet matching RDS, test connection before proceeding

Gotcha: Glue sometimes caches DNS; recreate the connection if RDS endpoint changes.

ETL Job Script: PySpark Example

PySpark Glue job preview (truncated for clarity):

from awsglue.context import GlueContext
from pyspark.sql.functions import to_timestamp, col
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext())
spark = glueContext.spark_session

# Input: S3 JSON logs (example path)
df = spark.read.json("s3://data-prod-logs/web/2024-06-01/*.json")

# Transform: clean up data
out_df = (
  df
  .select(
    col("user_agent"),
    to_timestamp(col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss").alias("request_time"),
    col("response_code").cast("int"),
    col("bytes_sent").cast("int"),
  )
)

# JDBC write configuration
jdbc_url = "jdbc:mysql://rds-prod.cluster-123456.us-west-2.rds.amazonaws.com:3306/appdb"
conn_options = {
  "user": "glue-loader",
  "password": "<REDACTED>",  # Use AWS Glue connection or parametrize
  "driver": "com.mysql.cj.jdbc.Driver",
}

# Key performance knob: batchsize
out_df.write \
    .jdbc(
      url=jdbc_url,
      table="web_logs",
      mode="append",
      properties={**conn_options, "batchsize": "1000", "rewriteBatchedStatements": "true"}
    )

Common error:
If you see
java.sql.SQLNonTransientConnectionException: Public Key Retrieval is not allowed,
add the parameter allowPublicKeyRetrieval=true in your JDBC URL.

Security Notes

Always reference credentials via AWS Secrets Manager or Glue Connection, never hardcode.
Glue jobs require an IAM role with access to S3 input/output locations, connection secrets, and the relevant VPC.
Session policies can limit scope if multiple jobs share the same role.

Performance & Reliability Tips

Batch Inserts: PySpark’s default is small batches. Tune batchsize (start at 1000, not higher than 8000 for RDS) and monitor CPU spikes on your RDS instance.
Network Placement: RDS and Glue jobs should be in the same region and subnet type; cross-AZ incurs latency.
Retries: Implement error-catching and rerun logic. The JDBC driver does not retry by default on Communications link failure.
Monitoring: Enable CloudWatch logs for Glue jobs (/aws-glue/jobs/output) and RDS enhanced monitoring. Watch for throttling or table locks.
Staging: For heavy upserts, load into a temp table, then use an atomic RENAME TABLE or INSERT ... ON DUPLICATE KEY UPDATE.

Example: Failure Case and Resolution

If streaming more than 500K records/hour and notice MySQL server has gone away errors:

Lower batchsize, bump wait_timeout on RDS parameter group, or paginate inserts in application logic.
Alternatively, offload large initial syncs to S3→RDS Data Import (e.g., MySQL’s LOAD DATA FROM S3 for multi-GB batches) before using Glue for incremental loads.

Non-obvious Tip

Glue’s JDBC write is synchronous—partition-wide failures halt the job but may still insert partial records. Wrapping insert logic with an explicit DataFrame repartition (try repartition(4) for small RDS) can help control parallel writes and lower lock contention.

(Side Note)

Aurora users: Prefer the Aurora Data API for serverless or Lambda-integrated patterns. Classic JDBC from Glue is high-throughput but less suited for connectionless workloads.

Summary

Direct ETL from AWS Glue to RDS—when tuned with correct network, batch sizing, and secure credential management—removes unnecessary hops and enables rapid analytical workflows atop operational data. Not perfect for massive loads (consider S3 staging + RDS import for those), but for incremental high-frequency jobs, this approach keeps pipelines clean and latency low. Alternatives (e.g., using DMS for CDC, or streaming via Kinesis Data Firehose) exist, but add moving pieces. Sometimes, less is more.

Questions? Dive into AWS Glue docs for the latest supported connector versions. For deeper schema-evolution or job orchestration scenarios, consider Step Functions or Airflow as supplementary glue (pun intended).

Aws Glue To Rds