DynamoDB to S3 Export Using AWS Glue: Practical Walkthrough

Archiving, analytics, and batch processing—these requirements often push DynamoDB users to offload table data to Amazon S3. Running queries directly on DynamoDB is expensive and operationally constrained; S3, especially paired with Athena or Redshift Spectrum, eliminates those limits.

Below is a streamlined workflow for exporting DynamoDB tables to S3, using AWS Glue. Emphasis: no hand-cranking code unless you want advanced transformations, and no Lambda glue code maintenance.

Typical Use Cases

Daily backups of transactional data.
Bulk analytics via Athena.
Off-site archiving for compliance.
Downstream processing with Spark, Redshift, or external warehouses.

Prerequisites

AWS account with billing and resource creation enabled.
IAM roles:
- Role for AWS Glue (AWSGlueServiceRole or custom with DynamoDB read/S3 write).
Resources:
- At least one DynamoDB table populated (ideally >1000 items for scale).
- Target S3 bucket (verify bucket policy allows glue:PutObject).
[Optional] AWS CLI (aws-cli/2.x) for ad-hoc checks.

Glue Crawler (Optional but Efficient)

Schema awareness in ETL jobs saves time and mistakes. Let Glue infer it via a crawler:

Create Crawler:
- Console: AWS Glue > Crawlers > Add crawler
- Data store: Select DynamoDB, choose the table.
Permissions: Assign or create a service role with this basic policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {"Effect": "Allow", "Action": ["dynamodb:Scan"], "Resource": "*"},
    {"Effect": "Allow", "Action": ["s3:PutObject"], "Resource": "*"}
  ]
}

Run crawler. Inspect the Data Catalog: schema may flatten nested fields or use types like string for DynamoDB JSON blobs.

Known issue: Complex nested attributes can be mis-inferred—double-check the catalog for correctness.

ETL Job: DynamoDB Table → S3

Many engineers skip the crawler for straight table copies, but you’ll want it if downstream consumers are picky about schema.

Glue Job Configuration

Job Setup
- AWS Glue > Jobs > Add job
- Name: dynamodb-to-s3-archive
- IAM Role: Choose above
- Type: Spark (check entropy. Some regions default to Python shell—won’t work.)
Source Configuration
- Data source: DynamoDB
- Table name: Exact from console (CaseSensitiveName)
- (If using catalog: glue catalog table instead)
Target
- S3 URI: s3://bucket/path/ (use trailing slash)
- Format: Parquet recommended; JSON allowed for schema-less loads.
Script Customization
- Let Glue generate boilerplate, then review the PySpark for non-trivial tables.

Example Script Snippet (Parquet)

from awsglue.context import GlueContext
datasource0 = glueContext.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.input.tableName": "orders",
        "dynamodb.throughput.read.percent": "0.35"
    }
)
# Edge case: Filter only recent records
# datasource0 = Filter.apply(frame = datasource0, f = lambda x: x["updatedAt"] > "2023-10-01T00:00:00Z")

glueContext.write_dynamic_frame.from_options(
    frame = datasource0,
    connection_type = "s3",
    connection_options = {"path": "s3://mybucket/exports/orders/"},
    format = "parquet"
)

Side note: "dynamodb.throughput.read.percent" limits the impact to your provisioned read capacity. Default is 0.5 (50%). For live production tables, start lower; ramp up after testing. Accidentally over-provision and you may see:

An error occurred (ProvisionedThroughputExceededException) when calling the Scan operation: Rate exceeded for table

Running and Monitoring

Trigger manually, or via Glue Workflows/Cron Schedule for regular exports.
Monitor progress:
- Glue Console for job duration and errors.
- CloudWatch Logs: Inspect /aws-glue/jobs/output/{job_name} for PySpark driver logs.
Output files appear in S3, partitioned by run timestamp by default (unless overridden).

Gotcha: Glue's default output splits into multiple files (parallel writers). Combine externally if you need a single output.

Advanced: Incremental (Delta) Exports

Exporting full tables daily wastes resources once data volumes grow. Timestamps are your friend.

Add or maintain a field (e.g., updatedAt ISO timestamp) on each item.
Use Glue predicate pushdown (not always intuitive—some bugs in pre-2023-09 Glue versions).
Example:

incremental = Filter.apply(
    frame=datasource0,
    f=lambda x: x["updatedAt"] > "2024-06-10T00:00:00"
)

Not perfect: Large tables with high write rates can still bottleneck at DynamoDB Scan speed.

Non-obvious Tips

Schema drift: DynamoDB allows a "schema-less" model; Glue tries to infer. Columns can disappear or appear run-to-run. Locking schema via catalog table helps with downstream apps.
Compression: Parquet with snappy is the sweet spot for Athena performance.
Glue Version: Use Glue 3.0+ for Spark 3.x support and better Parquet compatibility. Old jobs might run Spark 2.x—check in job config.
Testing: Export a subset via DynamoDB scan filters (not natively exposed in Glue script—requires manual script edit).
Alternative: AWS Data Pipeline offers similar export, but is legacy.

Quick Reference Table

Step	Action	Command/Config/Note
Crawler	Optional schema inference	Console or boto3, target DynamoDB table
Glue Job	Create job, set source/target	DynamoDB → S3 (Parquet preferred)
IAM Role	Attach read/write permissions	`AmazonDynamoDBReadOnlyAccess`, `S3Full`
Output Format	Parquet (preferred), JSON for raw dumps	Specify in job config or script
Monitoring	Glue Console, CloudWatch Logs	Log group `/aws-glue/jobs/output/`

Engineers routinely rely on this pattern to operationalize data extracts and keep analytic pipelines flowing. AWS Glue, when configured with attention to resource limits and schema details, does most of the heavy lifting. No solution is perfect—large tables can still challenge Glue’s parallel scan performance—but for 80% of cases, this approach is solid.

If you hit an edge case or want CloudFormation snippets or cross-account export guidance, drop a request.

Aws Glue Dynamodb To S3