How to Build a Reliable, Real-Time Data Pipeline from Pub/Sub to BigQuery Without Losing a Single Message
Ingesting streaming data reliably into BigQuery is critical for real-time analytics and decision-making. Ensuring zero data loss and maintaining throughput under load distinguishes scalable pipelines from fragile ones.
Forget simplistic tutorials that gloss over failures—this guide dives into robust strategies for exactly-once delivery, error handling, and monitoring in Pub/Sub to BigQuery pipelines, the kind of know-how that founders of data-driven companies swear by.
Why Pub/Sub to BigQuery Pipelines Are So Important
Google Cloud Pub/Sub is a fully-managed real-time messaging service designed to ingest massive amounts of streaming data. BigQuery, Google’s serverless data warehouse, provides lightning-fast SQL analytics at scale.
Together, they form a backbone for streaming analytics, powering fraud detection, IoT telemetry, user behavior tracking, and much more. But this simplicity is deceptive. Without careful planning:
- Messages can be lost or duplicated.
- Backpressure can overwhelm downstream systems.
- Error handling can become a nightmare.
- Monitoring and alerting become unreliable.
Building a reliable real-time pipeline means guaranteeing exactly-once message delivery from Pub/Sub into BigQuery, handling backpressure gracefully, and quickly surfacing failures for remediation.
Core Challenges with Pub/Sub to BigQuery Pipelines
Before we jump into the "how", let's quickly cover the common pitfalls:
- At-least-once delivery semantics: Pub/Sub inherently guarantees at-least-once delivery, not exactly-once. That means message duplication is possible.
- Idempotency: Your pipeline must tolerate duplicates to prevent data skew.
- Backpressure and scaling: BigQuery streaming inserts have quotas and rate limits; overwhelming them leads to errors.
- Error handling: Transient network or service errors should be retried; poison messages (bad data) should be quarantined without blocking the entire pipeline.
- Monitoring: Visibility into pipeline health and message lag is essential to prevent silent failures.
Step 1: Setting Up Pub/Sub and BigQuery
Pub/Sub Topic and Subscription
Create a Pub/Sub topic to receive your streaming data:
gcloud pubsub topics create my-events-topic
Create a subscription for your pipeline to consume messages:
gcloud pubsub subscriptions create my-events-subscription \
--topic=my-events-topic \
--ack-deadline=60
Setting an ack-deadline higher than the expected processing time reduces premature message redelivery.
BigQuery Table Setup
Prepare a BigQuery dataset and table with a schema matching your event data. Here’s an example schema:
- event_id (STRING) [unique message ID for deduplication]
- event_timestamp (TIMESTAMP)
- user_id (STRING)
- event_type (STRING)
- event_payload (STRING)
Creating a table:
bq mk --table my_dataset.my_events_table \
event_id:STRING,event_timestamp:TIMESTAMP,user_id:STRING,event_type:STRING,event_payload:STRING
Step 2: Achieving Exactly-Once Delivery with Deduplication
Because Pub/Sub can deliver duplicates, you must handle idempotency on the BigQuery side.
- Use a unique
event_id
(UUID or hash) generated when producing messages. - Use BigQuery’s insertId field during streaming inserts, which automatically deduplicates inserts with the same
insertId
.
Example: Streaming Insert with insertId
Using the BigQuery client libraries (Python example):
from google.cloud import bigquery
from google.cloud import pubsub_v1
import json
bq_client = bigquery.Client()
dataset_id = 'my_dataset'
table_id = 'my_events_table'
def callback(message):
data = json.loads(message.data)
row = {
"event_id": data['event_id'],
"event_timestamp": data['event_timestamp'],
"user_id": data['user_id'],
"event_type": data['event_type'],
"event_payload": json.dumps(data['event_payload'])
}
errors = bq_client.insert_rows_json(
f"{dataset_id}.{table_id}",
[row],
row_ids=[data['event_id']] # insertId for deduplication
)
if errors:
print(f"Failed to insert row: {errors}")
message.nack() # Let Pub/Sub redeliver
else:
message.ack() # Acknowledge after successful insert
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path('your-project', 'my-events-subscription')
subscriber.subscribe(subscription_path, callback=callback)
print("Listening for messages...")
Key points:
row_ids
param sets theinsertId
to your unique event_id.- Only ack the message if BigQuery insert succeeds.
- Nack messages to retry failures.
Step 3: Handling Backpressure and Quotas
BigQuery streaming insert quotas:
- 100,000 rows per second per table.
- 1,000,000 rows per second per project.
- 10,000 rows per second per second per table (per second per table limit varies).
If your pipeline exceeds these, inserts will be throttled or rejected.
Strategies:
- Batch inserts: Buffer messages and insert in batches of ~500 rows.
- Flow control in the subscriber: Limit the number of concurrent messages being processed.
- Exponential backoff: Retry transient errors with increasing delays.
- Use Dataflow: For high-volume pipelines, Google Cloud Dataflow manages scaling, retries, checkpoints, and batch vs streaming writes under the hood.
Example: Batch insert with manual ack:
import time
buffer = []
batch_size = 500
def callback(message):
data = json.loads(message.data)
buffer.append({
"event_id": data['event_id'],
"event_timestamp": data['event_timestamp'],
"user_id": data['user_id'],
"event_type": data['event_type'],
"event_payload": json.dumps(data['event_payload'])
})
if len(buffer) >= batch_size:
errors = bq_client.insert_rows_json(
f"{dataset_id}.{table_id}",
buffer,
row_ids=[row['event_id'] for row in buffer]
)
if errors:
print(f"Batch insert errors: {errors}")
# Do not ack messages to trigger retries
else:
message.ack()
buffer.clear()
# Implement a timer to flush the buffer every X seconds as well.
Step 4: Graceful Error Handling and Poison Message Management
- Transient errors: Retry streaming inserts with exponential backoff.
- Serialization or schema errors: These may be poison messages that can't be processed.
- Dead-letter topics (DLQ): Configure a dead-letter Pub/Sub topic for messages that repeatedly fail.
- Logging and alerts: Log errors with context and send alerts on failure spikes.
Set up a dead-letter topic and configure your subscription:
gcloud pubsub topics create my-events-dead-letter-topic
gcloud pubsub subscriptions update my-events-subscription \
--dead-letter-topic=my-events-dead-letter-topic \
--max-delivery-attempts=5
After 5 failed delivery attempts, messages land in the dead-letter topic for manual inspection.
Step 5: Monitoring and Alerting
Keep an eye on:
- Subscription ack latency: High latency means slow processing.
- Pub/Sub errors or throttling.
- BigQuery streaming insert error rates.
- Subscription message backlog.
- Dead-letter topic messages.
Use Google Cloud Console dashboards or build custom alerting with:
- Cloud Monitoring custom metrics.
- Log-based metrics with alerts on error patterns.
Bonus: Use Dataflow for Simplified, Scalable Pipelines
Google Cloud Dataflow automates many of these concerns:
- Built-in exactly-once processing semantics.
- Auto-scaling to handle load.
- Native BigQuery streaming inserts with deduplication.
- Dead-letter sinks for failed messages.
Here’s a very simple Dataflow pipeline in Python (Apache Beam):
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
class ParseAndFormat(beam.DoFn):
def process(self, element):
import json
data = json.loads(element)
yield {
"event_id": data['event_id'],
"event_timestamp": data['event_timestamp'],
"user_id": data['user_id'],
"event_type": data['event_type'],
"event_payload": json.dumps(data['event_payload'])
}
options = PipelineOptions(streaming=True)
with beam.Pipeline(options=options) as p:
(p
| "Read From PubSub" >> beam.io.ReadFromPubSub(subscription="projects/my-project/subscriptions/my-events-subscription").with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: x.decode("utf-8"))
| "Parse and format" >> beam.ParDo(ParseAndFormat())
| "Write to BigQuery" >> beam.io.WriteToBigQuery(
'my-project:my_dataset.my_events_table',
schema='event_id:STRING,event_timestamp:TIMESTAMP,user_id:STRING,event_type:STRING,event_payload:STRING',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR' )
)
Final Thoughts
Building a perfectly reliable pipeline from Pub/Sub to BigQuery is challenging but absolutely achievable with a combination of:
- Idempotent writes with BigQuery
insertId
. - Proper flow control and batch-ingestion.
- Thoughtful error handling with retries and dead-letter topics.
- Continuous monitoring with alerting.
- Optionally offloading complexity to managed services like Dataflow.
This infrastructure supercharges data-driven decision-making by keeping your analytical warehouse in sync with live event streams — no lost messages, no gaps in insight.
If you found this tutorial useful or have questions about implementing these patterns, feel free to reach out or leave a comment below!