Rds To S3

Rds To S3

Reading time1 min
#Cloud#AWS#Data#RDS#S3#DataArchival

Automating Data Archival: RDS to S3 for Scalable, Cost-Efficient Storage

Keeping multi-terabyte cold datasets in Amazon RDS usually comes down to inertia. “It’s easier” is a common refrain, but each gigabyte left to age out on RDS storage consumes IOPS and dollars—sometimes aggressively. Cold data doesn’t belong on expensive transactional storage. The better approach: automate its export to Amazon S3, unlocking lower costs and easier access for analytics, compliance, or backup.

Critically, keeping cold data inside RDS increases storage costs and slows down queries, index rebuilds, and snapshot operations. Even minor maintenance windows expand. Offloading relic data to S3 can trim costs by an order of magnitude while freeing up operational capacity.

Below, several approaches for building a robust data archival pipeline from RDS→S3. Assume AWS managed service support, IAM best practices, and at least moderate SQL/ETL familiarity.


Use Cases & Benefits

ConsiderationRDS (Baseline)S3 (After Export)
Storage Cost$0.115–0.23/GB-mo (gp2)$0.026/GB-mo (Standard)
PerformanceDegraded by table bloatOnly live data in RDS
GrowthUp to instance limitVirtually unlimited
QueryingSQL only, rigid schemaAthena, Glue, Spark, etc.
DurabilitySingle region, backup11x9s, multi-region option

Direct UNLOAD: Aurora PostgreSQL Example

Best fit for Aurora PostgreSQL 12.4+, no external ETL required. UNLOAD supports direct export to S3 with role-based access. The following exports all records older than a year to Parquet:

UNLOAD ('SELECT * FROM archive_table WHERE created_at < NOW() - INTERVAL ''1 year''')
TO 's3://prod-data-archive/snapshots/table_202406/'
IAM_ROLE 'arn:aws:iam::123456789012:role/AuroraS3Export'
FORMAT PARQUET
PARALLEL OFF;

Notes:

  • PARALLEL ON can overload small tables—test performance.
  • IAM role must include s3:PutObject, s3:GetObject, and s3:ListBucket.
  • Output file structure in S3 will be partitioned if PARALLEL ON.

Gotcha: Exported Parquet may require schema evolution logic in downstream data lakes. Table constraints and indexes aren’t transferred.


For RDS MySQL, SQL Server: ETL with AWS Glue

Aurora MySQL, standard RDS MySQL (8.x/5.7), and SQL Server lack built-in unload-to-S3. Glue ETL jobs or PySpark scripts running on Glue, ECS, or EC2 fill the gap. Example for RDS MySQL using Glue and JDBC:

from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Setup context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueContext = GlueContext(SparkContext())
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

connection_options = {
    "url": "jdbc:mysql://rds-prod-db.cluster-xyz.eu-west-1.rds.amazonaws.com:3306/finance",
    "user": "etl_user",
    "password": "retr0pass!"
}
# Query historic records (~15M rows per run)
df = glueContext.spark_session.read.jdbc(
    url=connection_options['url'],
    table="(SELECT * FROM invoices WHERE created_at < DATE_SUB(NOW(), INTERVAL 366 DAY)) AS src",
    properties={
        "user": connection_options['user'],
        "password": connection_options['password']
    }
)

df.write.mode("overwrite").parquet("s3://prod-data-archive/snapshots/invoices_202406/")

job.commit()

Trade-off: Spark (or Glue) handles large volumes but JDBC pulls can be slow—test with production-size partitions. If throttled at RDS, adjust Glue memory params.


Lightweight Scripting: Python + Lambda/EC2/Fargate

For smaller tables or when AWS Glue is overkill, cron-style scheduled scripts suffice. Example with PyMySQL and boto3. Chunks are uploaded as CSV:

import pymysql, boto3, csv, io, datetime

s3_bucket = "prod-data-archive"
prefix = "exports/orders_{:%Y%m%d}.csv".format(datetime.date.today())
rds = pymysql.connect(
    host="rds-prod-instance",
    user="archiver",
    password="5Tr0ngKey",
    db="erp"
)
cursor = rds.cursor()
cursor.execute(
    "SELECT * FROM orders WHERE created_at < NOW() - INTERVAL 370 DAY"
)
csv_buf = io.StringIO()
csv_writer = csv.writer(csv_buf)
csv_writer.writerow([desc[0] for desc in cursor.description])
csv_writer.writerows(cursor.fetchall())
cursor.close()
rds.close()

boto3.client('s3').put_object(
    Bucket=s3_bucket,
    Key=prefix,
    Body=csv_buf.getvalue()
)
  • Scheduling: Wrap in Lambda for datasets <512 MB, or EC2/Fargate for larger volumes.
  • Error case: If dataset > Lambda memory/time limit, logs will show Task timed out after 900.00 seconds.

Side Note: Native MySQL SELECT INTO OUTFILE S3 is not available in standard RDS.


Robustness & Automation Best Practices

  • IAM Least Privilege: Assign only required s3:PutObject and RDS read permissions.
  • Incremental Loads: Track latest archive date/ID and only export deltas to avoid duplicate storage.
  • Partitioning: Write files partitioned by month or year (s3://bucket/table/year=2023/month=01/). Supports Athena and Glue catalog partitioning.
  • Monitoring: Redirect logs to CloudWatch; trigger SNS alerts on exception. Validate S3 object count after each run.
  • File Formats: Use Parquet unless downstream needs raw CSV. Parquet compresses 3-5x and is preferred in Athena/Redshift Spectrum.
  • S3 Object Lifecycle: Apply S3 lifecycle rules to transition older files to Glacier Deep Archive if compliance allows.

Implementation Notes

No single pipeline fits every workload. Direct UNLOAD covers most Aurora Postgres use cases. Large RDS MySQL tables with frequent writes run best via Glue ETL. For tactical one-off exports, Python with boto3 is quickest, though not transactional.

Trade-offs remain: Parquet conversion isn’t perfect for nested or heavily normalized schemas. Network egress becomes a bottleneck if exporting petabyte-scale archives—consider VPC endpoints or Direct Connect.

Sample S3 Lifecycle Policy JSON:

{
  "Rules": [
    {
      "ID": "Archive-old-backups",
      "Prefix": "snapshots/",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        }
      ]
    }
  ]
}

Summary

Automated archival from RDS to S3 is essential for both cost management and operational agility. Strip cold data from transactional stores and the benefits compound: reduced RDS instance size, faster backups, and easier compliance audits. Run a single table migration first. Revisit retention and export schedule quarterly. Not all exports will work flawlessly on day one—expect a few iterations.

Further optimization: catalog S3 data using AWS Glue Data Catalog for direct Athena queries, or integrate Redshift Spectrum for federated analytics.

If you need full CloudFormation pipelines or error handling patterns, those can be added per requirement (known issue: direct bucket notification triggers for Athena partitions can lag on high-object-count exports).

Consider this pipeline a maintenance task—write it, monitor it, then let it run itself.