DynamoDB to S3 Export Using AWS Glue: Practical Walkthrough
Archiving, analytics, and batch processing—these requirements often push DynamoDB users to offload table data to Amazon S3. Running queries directly on DynamoDB is expensive and operationally constrained; S3, especially paired with Athena or Redshift Spectrum, eliminates those limits.
Below is a streamlined workflow for exporting DynamoDB tables to S3, using AWS Glue. Emphasis: no hand-cranking code unless you want advanced transformations, and no Lambda glue code maintenance.
Typical Use Cases
- Daily backups of transactional data.
- Bulk analytics via Athena.
- Off-site archiving for compliance.
- Downstream processing with Spark, Redshift, or external warehouses.
Prerequisites
- AWS account with billing and resource creation enabled.
- IAM roles:
- Role for AWS Glue (
AWSGlueServiceRole
or custom with DynamoDB read/S3 write).
- Role for AWS Glue (
- Resources:
- At least one DynamoDB table populated (ideally >1000 items for scale).
- Target S3 bucket (verify bucket policy allows glue:PutObject).
- [Optional] AWS CLI (
aws-cli/2.x
) for ad-hoc checks.
Glue Crawler (Optional but Efficient)
Schema awareness in ETL jobs saves time and mistakes. Let Glue infer it via a crawler:
- Create Crawler:
- Console: AWS Glue > Crawlers > Add crawler
- Data store: Select
DynamoDB
, choose the table.
- Permissions: Assign or create a service role with this basic policy:
{
"Version": "2012-10-17",
"Statement": [
{"Effect": "Allow", "Action": ["dynamodb:Scan"], "Resource": "*"},
{"Effect": "Allow", "Action": ["s3:PutObject"], "Resource": "*"}
]
}
- Run crawler. Inspect the Data Catalog: schema may flatten nested fields or use types like
string
for DynamoDB JSON blobs.
Known issue: Complex nested attributes can be mis-inferred—double-check the catalog for correctness.
ETL Job: DynamoDB Table → S3
Many engineers skip the crawler for straight table copies, but you’ll want it if downstream consumers are picky about schema.
Glue Job Configuration
-
Job Setup
- AWS Glue > Jobs > Add job
- Name:
dynamodb-to-s3-archive
- IAM Role: Choose above
- Type: Spark (check entropy. Some regions default to Python shell—won’t work.)
-
Source Configuration
- Data source: DynamoDB
- Table name: Exact from console (
CaseSensitiveName
) - (If using catalog: glue catalog table instead)
-
Target
- S3 URI:
s3://bucket/path/
(use trailing slash) - Format: Parquet recommended; JSON allowed for schema-less loads.
- S3 URI:
-
Script Customization
- Let Glue generate boilerplate, then review the PySpark for non-trivial tables.
Example Script Snippet (Parquet)
from awsglue.context import GlueContext
datasource0 = glueContext.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.input.tableName": "orders",
"dynamodb.throughput.read.percent": "0.35"
}
)
# Edge case: Filter only recent records
# datasource0 = Filter.apply(frame = datasource0, f = lambda x: x["updatedAt"] > "2023-10-01T00:00:00Z")
glueContext.write_dynamic_frame.from_options(
frame = datasource0,
connection_type = "s3",
connection_options = {"path": "s3://mybucket/exports/orders/"},
format = "parquet"
)
Side note: "dynamodb.throughput.read.percent"
limits the impact to your provisioned read capacity. Default is 0.5 (50%). For live production tables, start lower; ramp up after testing. Accidentally over-provision and you may see:
An error occurred (ProvisionedThroughputExceededException) when calling the Scan operation: Rate exceeded for table
Running and Monitoring
- Trigger manually, or via Glue Workflows/Cron Schedule for regular exports.
- Monitor progress:
- Glue Console for job duration and errors.
- CloudWatch Logs: Inspect
/aws-glue/jobs/output/{job_name}
for PySpark driver logs.
- Output files appear in S3, partitioned by run timestamp by default (unless overridden).
Gotcha: Glue's default output splits into multiple files (parallel writers). Combine externally if you need a single output.
Advanced: Incremental (Delta) Exports
Exporting full tables daily wastes resources once data volumes grow. Timestamps are your friend.
- Add or maintain a field (e.g.,
updatedAt
ISO timestamp) on each item. - Use Glue predicate pushdown (not always intuitive—some bugs in pre-2023-09 Glue versions).
- Example:
incremental = Filter.apply(
frame=datasource0,
f=lambda x: x["updatedAt"] > "2024-06-10T00:00:00"
)
Not perfect: Large tables with high write rates can still bottleneck at DynamoDB Scan
speed.
Non-obvious Tips
- Schema drift: DynamoDB allows a "schema-less" model; Glue tries to infer. Columns can disappear or appear run-to-run. Locking schema via catalog table helps with downstream apps.
- Compression: Parquet with
snappy
is the sweet spot for Athena performance. - Glue Version: Use Glue 3.0+ for Spark 3.x support and better Parquet compatibility. Old jobs might run Spark 2.x—check in job config.
- Testing: Export a subset via DynamoDB scan filters (not natively exposed in Glue script—requires manual script edit).
- Alternative: AWS Data Pipeline offers similar export, but is legacy.
Quick Reference Table
Step | Action | Command/Config/Note |
---|---|---|
Crawler | Optional schema inference | Console or boto3, target DynamoDB table |
Glue Job | Create job, set source/target | DynamoDB → S3 (Parquet preferred) |
IAM Role | Attach read/write permissions | AmazonDynamoDBReadOnlyAccess , S3Full |
Output Format | Parquet (preferred), JSON for raw dumps | Specify in job config or script |
Monitoring | Glue Console, CloudWatch Logs | Log group /aws-glue/jobs/output/ |
Engineers routinely rely on this pattern to operationalize data extracts and keep analytic pipelines flowing. AWS Glue, when configured with attention to resource limits and schema details, does most of the heavy lifting. No solution is perfect—large tables can still challenge Glue’s parallel scan performance—but for 80% of cases, this approach is solid.
If you hit an edge case or want CloudFormation snippets or cross-account export guidance, drop a request.