S3 To Dynamodb

S3 To Dynamodb

Reading time1 min
#AWS#Cloud#Data#S3#DynamoDB#Streaming

How to Efficiently Stream Data from AWS S3 into DynamoDB for Real-Time Analytics

Most data engineers focus on batch processing between S3 and DynamoDB, missing out on the advantage of streaming architectures. This guide flips the script by showing how to build efficient streaming pipelines that unlock instant insights from your data lake.

Why Stream Data from S3 to DynamoDB?

AWS S3 offers virtually unlimited, cost-effective storage for massive amounts of data, while DynamoDB provides low-latency read and write performance—perfect for real-time analytics use cases. By streaming relevant data changes or new files from S3 directly into DynamoDB, you avoid costly batch jobs and long refresh intervals. This approach enables near-instant decision-making and analytics.

In this post, I’ll walk you through how to set up such an architecture step-by-step, using AWS native services and Python examples.


Overview of the Architecture

Here’s the typical flow you’ll want to build:

  1. Data lands in S3 — New or updated files arrive in an S3 bucket.
  2. Trigger EventBridge or Lambda — An event notification fires when new objects are created.
  3. Parse & transform — Lambda reads the file content, extracts the required records.
  4. Write to DynamoDB — The processed data writes as individual items into a DynamoDB table.
  5. Consume data instantly — Your applications, dashboards, or streaming analytics services query DynamoDB for the latest info.

Step 1: Set Up Your S3 Bucket with Event Notifications

First, you need an S3 bucket configured to emit events when new objects are created.

  • Go to your AWS Console → S3 → Select/Create your bucket.
  • Under Properties, find Event Notifications.
  • Add a notification for PUT operations (object created).
  • Choose a destination: For our pipeline, trigger an AWS Lambda function.

Example:

Event Name: NewFileCreated
Event Types: PUT (All object create events)
Prefix or suffix filters can be applied here if you want to only process certain files like `.json` or `.csv`.
Send To: Lambda function you will create next

Step 2: Create a Lambda Function to Process the Incoming Objects

Our Lambda will act as the glue—reading file contents and writing items to DynamoDB.

Permissions Needed

  • Read access to your S3 bucket (s3:GetObject)
  • Write access to your DynamoDB table (dynamodb:PutItem, UpdateItem)

Attach an IAM role with these permissions before proceeding.

Sample Python Lambda Code

import json
import boto3
import csv
from urllib.parse import unquote_plus

s3_client = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
TABLE_NAME = 'YourDynamoDBTable'

def lambda_handler(event, context):
    table = dynamodb.Table(TABLE_NAME)

    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'])
        
        print(f"Processing file {key} from bucket {bucket}")
        
        # Get object from S3
        response = s3_client.get_object(Bucket=bucket, Key=key)
        content_type = response['ContentType']

        # Assume CSV format for example; adjust parser accordingly
        lines = response['Body'].read().decode('utf-8').splitlines()
        reader = csv.DictReader(lines)

        # Write each row as item into DynamoDB
        with table.batch_writer() as batch:
            for row in reader:
                # Assuming primary key is 'id' in your CSV dataset:
                item = {
                    'id': row['id'],
                    'name': row.get('name', ''),
                    'value': int(row.get('value', 0)),
                    'timestamp': row.get('timestamp', '')
                }
                batch.put_item(Item=item)
                
    return {
        "statusCode": 200,
        "body": json.dumps("Successfully processed all records.")
    }

This example assumes:

  • Your S3 files are CSV-formatted.
  • Each file contains rows with columns: id, name, value, timestamp.
  • DynamoDB's primary key is on the id attribute.

You can customize parsing logic depending on your input format like JSON or Parquet (though Parquet needs more involved parsing).


Step 3: Configure DynamoDB Table

Make sure your DynamoDB table is optimized for frequent writes and reads.

  • Define proper partition keys for even data distribution.
  • Enable auto-scaling or provision throughput matching your usage patterns.
  • Optionally enable DAX if ultra-low latency reads are necessary.

Step 4: Testing Your Pipeline

  1. Upload a sample CSV file manually or via scripts into your target S3 bucket.

Example CSV content (saved as sample_data.csv):

id,name,value,timestamp
101,Alice,23,2024-06-01T12:00:00Z
102,Bob,45,2024-06-01T12:05:00Z
103,Charlie,30,2024-06-01T12:10:00Z
  1. Confirm that your Lambda triggers upon upload.
  2. Check CloudWatch logs inside Lambda console to ensure no errors occurred.
  3. Use AWS CLI or Console to query your DynamoDB table:
aws dynamodb get-item --table-name YourDynamoDBTable --key '{"id": {"S": "101"}}'

You should see Alice’s item appear almost instantly after upload.


Additional Enhancements & Considerations

Handle Larger Files with Multiple Lambdas or Step Functions

For very large datasets streamed into one file at once, consider invoking AWS Step Functions to orchestrate multiple Lambdas or break down large files by chunks.

Schema Evolution & Data Validation

Implement version checks and schema validation directly inside Lambda functions to avoid malformed data ingestion.

Use Kinesis Firehose for Enriched Streaming

If transformation needs grow complex or if latency demands increase further, consider inserting Kinesis Data Firehose between S3 and DynamoDB for smooth buffering and optimized writes.


Conclusion

Streaming data from AWS S3 into DynamoDB unlocks powerful real-time analytics capabilities that traditional batch approaches cannot match.

By combining:

  • S3's scalable storage,
  • Lambda's serverless event processing,
  • and DynamoDB’s ultra-fast NoSQL performance,

you build a responsive system that feeds instant business insights directly from your data lake with minimal delay.

Give it a try today with this simple pipeline! And as always—tweak based on your specific dataset size, throughput requirements, and analytics goals to get optimal results.


Feel free to reach out or comment below if you'd like help customizing this setup! Happy streaming 🚀