Automating Data Archival: Seamlessly Exporting Amazon RDS Data to S3 for Cost-Effective and Scalable Storage
Most teams keep large amounts of cold data locked in RDS because “it’s just easier.” But what if your data archive could be as agile and accessible as your live database? Archiving historical or infrequently accessed data directly inside Amazon RDS not only racks up costs but also can negatively impact your database's performance. The solution? Automate exporting your RDS data to Amazon S3 — a cost-effective, scalable, and durable storage service — and unlock new possibilities for analytics without burdening your primary database.
In this post, I’ll walk you through why and how to set up an automated pipeline to export your Amazon RDS data to S3 reliably, using practical tools and examples. By the end, you’ll have a clear blueprint to reduce cost, improve performance, and keep your archived data accessible.
Why Export from Amazon RDS to S3?
- Cost savings: Storing massive historical datasets in RDS can be expensive since it charges for provisioned storage and IOPS. S3 is far cheaper per GB/month.
- Performance: Large tables with infrequently accessed historical records slow down queries, backups, and replication.
- Durability & Scalability: S3 offers 11 nines of durability and allows virtually infinite storage growth.
- Flexibility: Archived data in S3 can feed analytical jobs with AWS Glue, Athena, or Redshift Spectrum — unlocking new insights without loading that data back into your operational DB.
Common Patterns for Exporting RDS Data to S3
1. Native UNLOAD
commands (PostgreSQL Aurora)
If you’re running Amazon Aurora PostgreSQL, you can offload query results directly into Parquet or CSV files on S3 using the native UNLOAD
command.
UNLOAD ('SELECT * FROM archive_table WHERE created_at < current_date - interval ''1 year''')
TO 's3://your-bucket/archive-data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyAuroraS3AccessRole'
FORMAT PARQUET;
Benefits:
- Efficient file format (Parquet) optimized for analytics
- Direct export inside the DB with no extra ETL needed
2. Export with AWS Data Pipeline or AWS Glue Jobs
For other RDS engines (MySQL, SQL Server) or more complex export logic:
- Use AWS Glue Spark jobs to connect to RDS via JDBC
- Extract records that meet archival criteria (e.g., created_at older than X)
- Write datasets out as Parquet/CSV files on S3
Example PySpark snippet:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv,['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
jdbc_url = "jdbc:mysql://your-rds-endpoint:3306/yourdb"
connection_props = {
"user": "username",
"password": "password"
}
df = spark.read.jdbc(url=jdbc_url, table="archive_table", properties=connection_props)\
.filter("created_at < current_date - interval 365 days")
df.write.mode("overwrite").parquet("s3://your-bucket/archive/older_than_1_year/")
job.commit()
3. Using Custom Scripts + AWS Lambda Triggered by Events or Scheduled
You can also build a simple Python script that connects via psycopg2
or pymysql
(depending on engine), queries archive-able data in chunks, converts it to CSV or Parquet locally, then uploads it to S3 using boto3.
Wrap this script in an AWS Lambda function triggered by a CloudWatch Event Rule on a schedule (e.g., monthly). For large datasets consider running on EC2 / Fargate due to Lambda execution limits.
Basic flow example:
import boto3
import pymysql
import csv
import io
rds_host = "your-rds-endpoint"
db_user = "username"
db_password = "password"
db_name = "yourdb"
s3_bucket = "your-bucket"
s3_key_prefix = "archive/"
def export_data():
conn = pymysql.connect(host=rds_host, user=db_user,
password=db_password,
db=db_name)
cursor = conn.cursor()
query = "SELECT * FROM archive_table WHERE created_at < NOW() - INTERVAL 1 YEAR"
cursor.execute(query)
# Writing result to CSV in memory
csv_buffer = io.StringIO()
writer = csv.writer(csv_buffer)
writer.writerow([i[0] for i in cursor.description]) # header
writer.writerows(cursor.fetchall())
s3_client.upload_fileobj(io.BytesIO(csv_buffer.getvalue().encode()),
s3_bucket,
f"{s3_key_prefix}archive_{date.today()}.csv")
cursor.close()
conn.close()
if __name__ == "__main__":
s3_client = boto3.client('s3')
export_data()
Best Practices for Reliable Automation
- IAM Roles & Security
- Ensure your RDS instance has proper IAM roles attached if using Aurora UNLOAD.
- For scripts/Glue jobs/Lambda access, use least privilege IAM roles.
- Incremental Exports & ID Tracking
- Keep track of last exported record timestamp or ID to avoid duplicates.
- Partition exported files by date for easier management.
- File Formats
- Parquet or ORC preferred over CSV because they compress better and speed up analytic queries.
- Scheduling
- Use CloudWatch Events (EventBridge) rules or managed workflows (AWS Step Functions) for job orchestration.
- Monitoring & Alerts
- Track success/failure with CloudWatch Logs + SNS notifications so you know if archival fails.
Wrapping Up
If “it’s just easier” keeps your cold data locked inside RDS — costing you more than you realize — these automated pipelines will save you time, money, and headaches. Moving stale data from live transactional stores into an agile object storage like S3 lets you offload cost and complexity without sacrificing accessibility.
Whether you're running Aurora PostgreSQL with a handy UNLOAD command or orchestrating complex ETL with AWS Glue or Lambda scripts for other engines like MySQL or SQL Server — there’s a solution ready out-of-the-box.
Start small: choose one table of historical data that can be archived weekly or monthly. Build your pipeline incrementally until it fits your production needs perfectly. Before long you'll have a scalable archive solution where cold data doesn’t drag down hot workloads — but instead fuels future insights easily accessible in the cloud.
Happy archiving! If you want code examples or sample CloudFormation stacks for this setup, drop me a comment below!