Dynamodb To S3

Dynamodb To S3

Reading time1 min
#Cloud#AWS#BigData#DynamoDB#S3#DataPipeline

Efficient DynamoDB Offloads to S3 for Analytics

DynamoDB excels at operational workloads—low latency, high throughput, time-to-market. But any engineer facing ad-hoc analytics or cross-entity aggregations quickly finds the limits: poor native query capability, high cost at scale, no materialized views, minimal full-text options.

Teams operating in high-volume environments frequently ask: How can we get this data to S3, daily or near real-time, for flexible downstream processing and BI? Here’s the truth—DynamoDB was never designed as an analytics backend. S3 is. The transfer is not just routine, but foundational to robust AWS data architectures.


DynamoDB Offload Patterns to S3

Three canonical approaches surface, each with distinct pros and caveats:

  1. Native DynamoDB Export to S3
    Optimized for full-table, consistent snapshots. Lowest operational burden.

  2. Streams & Lambda (Change Data Capture, CDC)
    Delivers incremental updates. Enables near real-time extract pipelines.

  3. Manual Full Table Scan (SDK, e.g., boto3)
    Fallback for unsupported regions or custom filtering. Higher operational risk.


1. Native Export to S3 (Good for Snapshots)

Since re:Invent 2021, all DynamoDB tables with point-in-time recovery (PITR) enabled support server-side snapshots to S3 in Parquet format. The operation is truly asynchronous—no reads consumed, minimal impact. Typical turn-around for a multi-million item export: about 10–30 minutes for ~50GB (empirically, but expect variance).

How it’s done:

  • In AWS Console, go to: DynamoDB → [Table] → Export to S3
  • Specify S3 bucket and prefix (s3://my-analytics-bucket/dynamodb_exports/mytable/)
  • Choose Parquet (default); JSON is available but rarely optimal for Athena/Spark pipelines.
  • Launch export.

Under the hood:
DynamoDB spins up a temporary, isolated clone, writing to S3 via an internal batch process. You’ll see directory prefixes with consistent timestamp suffixes:

s3://my-analytics-bucket/dynamodb_exports/mytable/2024-06-10T070000Z/part-*.parquet

Advantages:

  • Zero impact on DynamoDB production read/write throttling.
  • No Lambda, no SDK, no IAM glue required.
  • Outputs are Athena-ready—schema inferred automatically.

Limitations:

  • No “export just partition X where attribute=Y”. It’s all or nothing.
  • Incremental exports not natively supported—must re-export full table.
  • S3 object key layout is fixed.

Tip:
Integrate scheduled exports with Step Functions for pipeline orchestration; handle object arrival in S3 via EventBridge triggers for downstream ETL.

Cost Note:
Charges apply for the data volume exported and S3 storage. In 2024, $0.10 per GB (approximate; refer to AWS pricing docs). Parquet minimizes footprint, but still worth planning seasonal increases before month-end analytics jobs.


2. Incremental Loads via DynamoDB Streams & Lambda

For engineering teams requiring sub-minute latency on data replication—and those who cannot tolerate full-table reprocessing—DynamoDB Streams combined with Lambda remain the gold standard.

Pattern:

  • Enable Streams (NEW_IMAGE or NEW_AND_OLD_IMAGES generally preferred).
  • Attach a Lambda for the stream; batch size tuning is critical (default is 100, but 500 can be viable for high-throughput).
  • Inside Lambda: transform stream payload (DynamoDB JSON) into analytics or raw format. Write to S3, ideally in compressed JSON Lines, Parquet, or (rarely) CSV.

Lambda Chunk Example

# Python 3.11 runtime, AWS Lambda
import boto3, json, gzip
from datetime import datetime
from io import BytesIO

s3 = boto3.client("s3")
BUCKET, PREFIX = "my-analytics-bucket", "cdc/"

def lambda_handler(event, context):
    batch = []
    for rec in event["Records"]:
        new = rec["dynamodb"].get("NewImage")
        if new:
            # Assume conversion utility exists (DynamoDB type JSON to plain JSON)
            batch.append(ddb_json_to_plain(new))
    if batch:
        buf = BytesIO()
        with gzip.GzipFile(fileobj=buf, mode="wb") as gz:
            for row in batch:
                gz.write(json.dumps(row).encode("utf-8") + b"\n")
        key = f"{PREFIX}cdc_{datetime.utcnow().isoformat()}Z.json.gz"
        s3.put_object(Bucket=BUCKET, Key=key, Body=buf.getvalue())
    return {"records_written": len(batch)}

Notable gotchas:

  • Error handling: Unhandled exceptions in Lambda will cause retries, possibly duplicate events. Deduplication downstream, or idempotent consumers, are mandatory for high-scale streams.
  • Limits: AWS Lambda execution caps (15 min, 10GB RAM as of 2024). High Kinesis rates can overwhelm; consider Kinesis Firehose to buffer if needed.
  • Checkpointing: There’s a minor but real risk of lost events if function errors post-process but pre-push; SQS dead-letter queues alleviate fallout.

Practical tip:
Batch records ~2–5MB per S3 object for optimal Athena scan efficiency. Avoid micro-objects, or Athena queries will crawl.


3. Full Table Scan & Push (SDK)

When exports must be filtered or DynamoDB-native exports are unavailable (certain AWS regions, nested or on-demand secondary views), a parallelized scan using the AWS SDK is the fallback.

import boto3, json
TABLE = "CustomerOrders"; BUCKET = "my-analytics-bucket"; KEY = "historical/orders-2024-06-10.json"
ddb = boto3.client("dynamodb"); s3 = boto3.client("s3")

def run():
    paginator = ddb.get_paginator("scan")
    all = []
    for page in paginator.paginate(TableName=TABLE):
        all.extend(page["Items"])
    # Convert using boto3 TypeDeserializer; skip for brevity
    result = json.dumps([my_deserialize(i) for i in all])
    s3.put_object(Bucket=BUCKET, Key=KEY, Body=result)
    print(f"Exported {len(all)} items → s3://{BUCKET}/{KEY}")

if __name__ == "__main__":
    run()

Caveats:

  • Table scans are expensive; throttling or ProvisionedThroughputExceededException is common. Consider Limit and targeted ProjectionExpression.
  • Exponential backoff and error handling are essential—SDK will eventually act as a DDoS if left unchecked.
  • Scans can consume strongly consistent reads—use only in off-peak windows.
  • This pattern is strictly for ad-hoc or emergency migration—not repeatable pipelines.

Best Practices & Non-Obvious Tips

  • Serialization: Use boto3.dynamodb.types.TypeDeserializer for reliable conversion. Avoid custom JSON if not necessary—edge cases arise on binary/map/set types.
  • Partitioning: Organize S3 output by event time (dt=YYYY-MM-DD/). Enables effective data pruning in Athena or Presto.
  • Compression: Always prefer gzip or native Parquet—avoid raw JSON in production pipelines.
  • Lifecycle Management: Tag S3 objects for automated archival (e.g., S3 Glacier) or deletion. Unmanaged exports balloon storage costs over time.
  • Monitoring: Export jobs lack strong retry semantics. Integrate with CloudWatch and test failover cases (missed exports, incomplete files).
  • Schema Drift: Downstream ETL may silently fail if DynamoDB schema changes mid-export. Glue crawlers help auto-detect; always validate critical columns.
  • Side note: Even native DynamoDB exports, as of mid-2024, do not include items from recently deleted Global Tables regions. Plan for reconciliation if using cross-region replication.

Reference Table

ScenarioApproachNotes
Daily full-table exportNative DynamoDB Export to S3Automatable, no custom code
Near-real-time CDCStreams + Lambda to S3Supports sub-minute latency pipelines
Region/custom unsupportedSDK-based scan + manual S3High risk, operationally complex

Key Takeaways

Moving transactional data from DynamoDB to S3 is essential for scalable analytics on AWS. Native exports are preferable for snapshots; Streams-based pipelines are mandatory for freshness. Manual scans only belong in edge scenarios.

Known issue: Downstream Athena/Glue schema evolution must be tracked proactively—fields missing from older exports will not retroactively appear in lake queries unless full table rewrites are scheduled.

For production pipelines, orchestration is as important as the data transfer itself. Automate failure handling; validate result consistency, and always model cost growth as dataset size scales.


For deeper detail—especially on schema enforcement and real-world CDC deduplication patterns—see the AWS DynamoDB docs. Discrepancies between theoretical and empirical export speeds remain; there’s little substitute for periodic end-to-end validation in pre-production environments.