How to Copy Data from DynamoDB to S3 Using AWS Glue: A Practical Guide
Working with data stored in AWS DynamoDB often means you want to analyze or archive that data efficiently. A common pattern is exporting DynamoDB tables to Amazon S3, where the data can be further processed or queried using tools like Athena or Redshift Spectrum.
In this post, I’ll walk you through a step-by-step practical guide on how to use AWS Glue to move data from DynamoDB to S3, all without writing tons of code. By the end of this guide, you'll have a working Glue job that extracts your DynamoDB table and stores the output in an S3 bucket — ready for your analytics!
Why Export DynamoDB Data to S3?
Before diving into the steps, a quick note on why this is useful:
- Analytical processing: S3 acts as a central data lake suited for running SQL queries via Athena.
- Backups and archival: Long-term storage of your data in cost-effective storage.
- Data transformation: Prepare the data into formats like Parquet or CSV for better processing downstream.
- Integration: Connect with other AWS services or third-party tools from S3.
AWS Glue provides an elegant managed ETL (Extract, Transform, Load) platform with built-in connectors for DynamoDB and S3.
Prerequisites
Make sure you have:
- AWS account access
- IAM permissions to read from DynamoDB and write to S3
- An existing DynamoDB table with some items
- An existing S3 bucket where you want to export the data
- AWS Glue service role with necessary permissions
Step 1: Create an AWS Glue Crawler for DynamoDB (Optional)
If you want Glue to automatically infer your table schema from DynamoDB:
- Navigate to AWS Glue Console > Crawlers > Add crawler.
- Name it (e.g.,
dynamodb-crawler
). - For data store, select DynamoDB, then pick your target table.
- Choose or create an IAM role with permissions for Glue.
- Configure output database in Glue Data Catalog.
- Run crawler; it will create a metadata table representing your DynamoDB table.
This step lets you leverage schema metadata which simplifies creating ETL jobs.
Step 2: Create an AWS Glue ETL Job
Now let's set up the job that reads from DynamoDB and writes to S3.
- Go to AWS Glue Console > Jobs > Add job.
- Name your job (e.g.,
dynamodb-to-s3-export
). - Choose same IAM role used earlier.
- Select “A proposed script generated by AWS Glue” or write your own if desired.
- Under Data source, select your DynamoDB table (or catalog table if crawler used).
- Under Data target, choose Amazon S3 and specify path (e.g.,
s3://your-bucket/dynamodb-export/
). - Select output format - JSON is simple for raw export, Parquet is better for analytics.
- Save job.
Step 3: Customize and Review the Code (Example Python Script)
Here is an example AWS Glue ETL PySpark script generated for exporting DynamoDB content to JSON files in S3:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Load data from DynamoDB via catalog table or directly
datasource0 = glueContext.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.input.tableName": "your-dynamodb-table-name",
"dynamodb.throughput.read.percent": "0.5"
}
)
# Optional: apply transformations here if needed
# Write output as JSON files into S3 bucket folder
glueContext.write_dynamic_frame.from_options(
frame=datasource0,
connection_type="s3",
connection_options={
"path": "s3://your-bucket/dynamodb-export/"
},
format="json"
)
job.commit()
Replace "your-dynamodb-table-name"
and "s3://your-bucket/dynamodb-export/"
with actual names.
The "dynamodb.throughput.read.percent"
controls how much capacity is consumed during reads; lowering it minimizes impact on your live DB but may slow transfers.
Step 4: Run Your Job and Verify Output
Back in the console:
- Start the job manually or via triggers/schedules.
- Monitor logs in CloudWatch if needed.
- Once complete, go look at your specified S3 prefix — you should see JSON files representing exported items.
You can now query these files using Athena by pointing it at that bucket location!
Bonus Tips
- Partitioning: If exporting large tables periodically by date/time attributes, use dynamic partitioning schemes during write phase in Glue scripts.
- Parquet format: Use Parquet over JSON/CSV for better querying performance and compression.
Example snippet replacing JSON with Parquet:
glueContext.write_dynamic_frame.from_options(
frame=datasource0,
connection_type="s3",
connection_options={"path": "s3://your-bucket/dynamodb-parquet-export/"},
format="parquet"
)
- Incremental exports: Maintain an attribute like
updatedAt
timestamp in items; filter only changed items using pushdown predicates in Glue for incremental loads.
Conclusion
Using AWS Glue to export DynamoDB tables to Amazon S3 is straightforward and powerful once set up properly:
- Minimal code required thanks to managed connections.
- Automatically handles schema inference when using crawlers.
- Flexible output formats catered towards analytics use cases.
If you have been manually copying data out or relying on Lambda scripts before — give this approach a try next time. It scales well as datasets grow and meshes perfectly into modern serverless architectures.
Happy ETLing!
If you'd like me to create sample CloudFormation templates or more advanced transformation examples next time, just ask!