Dynamodb To Bigquery

Dynamodb To Bigquery

Reading time1 min
#Cloud#Analytics#BigData#DynamoDB#BigQuery#RealTime

Efficiently Syncing DynamoDB Data to BigQuery for Real-Time Analytics

Syncing operational NoSQL workloads into an analytics warehouse—without hour-long batch ETL cycles—remains a common challenge. Many teams park their transactional data in DynamoDB and depend on BigQuery for heavy analytics. The friction: DynamoDB is optimized for low-latency single-row access, while BigQuery expects columnar, denormalized flows. Bridging this architectural gap efficiently, and in (near) real time, requires careful integration of AWS and Google primitives.


Real-Time Sync: Core Drivers

Heavy ETL isn’t just a cost center, it’s a latency bottleneck. When business depends on up-to-the-minute dashboards or alerting, asynchronous batch loads are insufficient.

Key principles driving a real-time sync architecture:

  • Incremental data capture: Move only what changed (not the entire table).
  • Low operational burden: Serverless, event-driven, minimal infrastructure.
  • Controlled cost: Batching, error handling, and stream throttling.
  • Schema adaptation: DynamoDB’s flexible schemas vs. BigQuery’s rigidity.

Typical scenario: A finance team needs real-time risk computed off transactions stored in DynamoDB. The analytics pipeline can’t lag by hours. Loading periodic S3 exports introduces unacceptable staleness.


Plumbing: From DynamoDB Streams to BigQuery Streaming

High-Level Architecture

  1. DynamoDB Streams captures inserts, updates, and deletes in the table.
  2. AWS Lambda triggers on stream events, transforms and prepares JSON payloads.
  3. BigQuery Streaming Insert API ingests rows into a target table in seconds.

ASCII sketch:

[DynamoDB Table]
      |
      V
 [DynamoDB Streams]
      |
     (Lambda Trigger)
      |
      V
 [AWS Lambda] ---> [BigQuery Streaming Insert]

Step 1: Enable DynamoDB Streams

Enable streams in the AWS Console:

  • Go to your table > Exports and streams > enable DynamoDB Streams.
  • Choose NEW_IMAGE or NEW_AND_OLD_IMAGESNEW_IMAGE suffices for pure inserts/updates.

Streams generate a real-time log of every write, consumed in order. Critically, stream ARN must be unique per table; avoid reusing ARNs across environments.


Step 2: Lambda to Consume Streams, Write to BigQuery

Create the Lambda function in Python 3.11 (minimum recommended for future library compatibility). Install dependencies (google-cloud-bigquery) by bundling with a requirements.txt, or package as a ZIP/layer. Runtime memory: 512–1024MB for low-latency.

Sample core handler, adapted for practical pitfalls (e.g., DynamoDB data types):

from google.cloud import bigquery
import os
import json

bq_client = bigquery.Client()

def convert_ddb_img(ddb_img):
    row = {}
    for k, v in ddb_img.items():
        # Simple types; see AWS docs for possible types (S, N, BOOL etc)
        if 'S' in v:
            row[k] = v['S']
        elif 'N' in v:
            row[k] = float(v['N'])
        elif 'BOOL' in v:
            row[k] = v['BOOL']
        else:
            # Known issue: lists and maps are not handled here
            row[k] = None
    return row

def lambda_handler(event, context):
    dataset = os.environ['BQ_DATASET_ID']
    table = os.environ['BQ_TABLE_ID']
    rows = []
    for record in event['Records']:
        if record['eventName'] in ['INSERT', 'MODIFY']:
            new_img = record['dynamodb']['NewImage']
            rows.append(convert_ddb_img(new_img))
        # Optionally handle DELETE events here

    if rows:
        table_ref = bq_client.dataset(dataset).table(table)
        errors = bq_client.insert_rows_json(table_ref, rows)
        if errors:
            print("BigQuery insert errors:", errors)
            # Good practice: integrate with CloudWatch alarms or Dead Letter Queue
    return {"inserted": len(rows)}

Note: Lambda cold starts (~500ms) may add initial delay. For a stable pipeline, set reserved concurrency >1.


Step 3: Authentication—AWS ↔ Google

Lambda requires a Google service account JSON key. Do not hardcode secrets. Store the service account JSON in AWS Secrets Manager or Systems Manager Parameter Store, and decrypt at runtime:

from google.oauth2 import service_account

creds_json = json.loads(os.environ['GCP_SA_KEY_JSON'])
bq_client = bigquery.Client(credentials=service_account.Credentials.from_service_account_info(creds_json))

IAM role attached to Lambda must permit reading from Secrets Manager + network access to Google APIs (VPC endpoints won’t help; outbound internet is needed unless Private Google Access is enabled).


Step 4: Schema in BigQuery

BigQuery expects strict schemas. DynamoDB, on the other hand, allows sparse attributes and nested documents. Define your BigQuery table with appropriate types—STRING, FLOAT, TIMESTAMP, etc. Example schema:

FieldTypeMode
idSTRINGREQUIRED
created_atTIMESTAMPNULLABLE
total_valueFLOATNULLABLE

Gotcha: If you use sets or nested maps in DynamoDB and stream them unmodified, you’ll get BigQuery schema errors (Invalid JSON payload received. Unknown name ...). Flatten complex types in Lambda before insert.


Reliability & Scaling in Practice

  • Batching: Insert up to 500 rows per call for throughput/cost efficiency. Too many parallel inserts may trip Quota exceeded errors in BigQuery (see quotaExceeded docs).

  • Retries: On insert failure, parse error payloads. For transient BigQuery errors (e.g., rateLimitExceeded), retry with backoff. For unrecoverable schema issues, send to a DLQ (e.g., SQS).

  • Error Handling Example:

    BigQuery insert errors: [{'index': 0, 'errors': [{'reason': 'invalid', 'message': 'Invalid value for ...'}]}]
    
  • Monitoring: Emit Lambda metrics, parse BigQuery insert response for failed rows, and log record IDs. Don’t rely solely on AWS metrics; cross-check with BigQuery streaming usage and error logs.

  • Cold start impact: For sustained high loads, consider a provisioned concurrency Lambda or a lightweight Kinesis consumer (for >1,000 writes/sec).


Trade-Offs & Alternatives

  • This pattern is robust for moderate write volumes (hundreds per second). For >10k/s sustained or heavy nested structures, using a Dataflow (Apache Beam) pipeline or third-party CDC tools (e.g., Debezium) may be more appropriate.
  • Lambda concurrency and BigQuery streaming quotas can limit throughput. Plan capacity for traffic spikes.

Tip: CI/CD and IaC

Automate deployment using Terraform modules or AWS SAM templates. Lambda environment variables, stream ARN, and secret ARNs must be parameterized. Validate permissions with gcloud iam service-accounts test-iam-permissions.


Final Thoughts

Real-time DynamoDB-to-BigQuery sync avoids the pitfalls of traditional ETL and enables immediate analytics workflows. While Lambda + streaming inserts covers many use cases, always benchmark end-to-end latency and watch for schema drift as your DynamoDB evolves.

Reference links:

Debugging sync failures? Check Lambda logs for Resource not found: Table and BigQuery insert quotas first. No system is perfect; iterate as your use case grows.