Reliable Automated Data Sync: Cloud SQL to BigQuery (No CSV Detours)
Moving transactional data out of Cloud SQL and into BigQuery should not involve manual exports, CSV juggling, or brittle script chains. Yet many teams still do it the hard way out of habit—or because they don’t know how to build streaming, zero-intermediate pipelines with GCP’s managed services.
Below is a pragmatic approach, based on production use with GCP-native tools, to continuously (or on-demand) replicate data from Cloud SQL (MySQL/PostgreSQL) to BigQuery without passing through file dumps or manual intervention.
Problem Framing
Typical ETL workflows for reporting or analytics look like this:
[Cloud SQL] -> (Export CSV) -> [Cloud Storage] -> (Load/import) -> [BigQuery]
This has several well-known pain points:
- Introduces batch latency (hourly/daily updates lag real business).
- Manual steps or fragile automation (missed transfers, CSV format drift).
- Storage bloat from intermediate dump files.
- No change tracking; upsert/merge logic gets complicated.
For operational analytics, none of these are acceptable.
Approaches: Pros and Cons
Method | Strengths | Caveats |
---|---|---|
Manual CSV or scheduled SQL export/load | Minimal initial setup | Not scalable, error-prone, slow |
GCS-based Dataflow batch pipeline | Automatable | Latency, extra temp storage, cost |
Datastream-based Change Data Capture (CDC) | Near-real-time, robust | Setup/maintenance, permissions, quota |
BigQuery federated queries (Cloud SQL connection) | Zero data duplication | Live DB read latency, max 100GB, quotas |
Rule of thumb: batch exports may suffice for QA or rarely-changing data. For production analytics, use CDC (Datastream or equivalent).
Architecture 1: Continuous Sync via Datastream + Dataflow
For most production-grade systems, Datastream’s CDC with Dataflow streaming pipelines achieves the required reliability and latency. See the high-level flow:
+-----------------+ +---------------+ +----------+
| Cloud SQL |<------->| Datastream |------->| GCS |
+-----------------+ (CDC) +---------------+ (AVRO) | ^
| |
v |
+----------+|
|Dataflow |
+----------+
|
v
+----------+
|BigQuery |
+----------+
Note: As of 2024-06, Datastream cannot write directly to BigQuery; you must land events in GCS, then apply them into BigQuery via Dataflow.
Minimum Requirements
- Cloud SQL (MySQL ≥5.7, PostgreSQL ≥13)
- Configure binary logging (for MySQL), wal_level=logical (for PostgreSQL)
- Enable: Datastream API, Dataflow API, BigQuery API
Step 1: Configure Cloud SQL for CDC
For MySQL:
-
Enable binary logging
-
Set
binlog_format=ROW
-
Ensure sufficient retention (
expire_logs_days
) -
Grant a replication user with access:
CREATE USER 'datastream'@'%' IDENTIFIED BY 'STRONG_PASSWORD'; GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'datastream'@'%';
For PostgreSQL:
- Set
wal_level=logical
- Adjust
max_replication_slots
andmax_wal_senders
- Create a logical replication slot
Known issue: If you forget to adjust these parameters, Datastream will fail with an error like:
Error: logical replication slot not available (sqlstate 55006)
Step 2: Configure Datastream Stream
- Source: Cloud SQL instance
- Destination: GCS bucket (advise regional alignment)
GCP UI wizard (or use Terraform for repeatable infra) will guide you through subnetting, access controls, source/destination configuration.
Datastream writes event files (Avro format) to GCS prefixes you specify.
Step 3: Dataflow CDC Application
Launch the official Dataflow streaming template to load CDC changes to BigQuery. This requires specifying source GCS, schema mapping (JSON), and target tables.
CLI Example (minimal):
gcloud dataflow jobs run ds-bq-cdc-prod \
--gcs-location gs://dataflow-templates/latest/Datastream_to_BigQuery \
--region=us-central1 \
--parameters \
inputFilePattern=gs://<datastream-bucket>/data/**,\
inputSchemaPattern=gs://<datastream-bucket>/schema/**,\
gcsTempLocation=gs://<tmp-bucket>/tmp/,\
bigQueryDataset=<target_dataset>
- Dataflow job consumes new Avro events and applies them as INSERT/UPDATE/DELETE to BigQuery tables. Most users set up table partitioning and retention as well.
Tip: Monitor Dataflow logs for errors such as:
BigQuery error in load operation: Provided schema does not match table
This usually stems from an upstream schema change in Cloud SQL; see ‘Schema Drift’ discussion below.
Architecture 2: Direct Federated Querying
For ad hoc access, smaller datasets, or pilot projects, connect BigQuery directly to Cloud SQL via an external connection.
Example:
CREATE OR REPLACE EXTERNAL TABLE `reporting.ad_customer`
WITH CONNECTION `my-region.my-sql-conn`
OPTIONS (table='main.customers', database='main');
Then:
SELECT id, name, created_at FROM `reporting.ad_customer`
WHERE created_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY);
Trade-offs:
- Latency and query limits (live over TCP; subject to SQL quotas).
- Not ideal for large analytical queries (>100GB/table).
- Useful for quick troubleshooting or prototyping.
When to Use Which
Use Case | Toolchain |
---|---|
Near-real-time analytics, high volume | Datastream + Dataflow |
Reporting on fresh but smaller datasets | Federated Query |
Simple batch ETL, non-critical | Scheduled export/load |
Note: There are hybrid patterns (e.g., scheduled view materialization) but in practice, most pipelines settle on one of these models.
Keeping Schema in Sync (Non-Obvious Tip)
Schema drift kills CDC jobs. Use Flyway or Liquibase migrations both for Cloud SQL and, critically, for downstream BigQuery table DDLs. Integrate migration DDL steps into your CI/CD—triggered after upstream table changes before redeploying Dataflow.
Optionally, include schema checks in your pipeline to detect and halt on incompatible changes. BigQuery will error on missing columns or type mismatches.
Side Notes and Gaps
- GCP IAM for Dataflow and Datastream often needs extra tuning; grant the minimal roles (
roles/datastream.admin
,roles/dataflow.worker
), and verify service accounts. - Cost can spike if CDC event volume is high; review retention settings on GCS buckets.
- Dataflow pipeline restarts sometimes reapply historical events; design idempotency in downstream BI tools.
- For testing, always validate with a non-critical database. Residual state in GCS or Dataflow can be hard to debug post-failure.
References
- Cloud Datastream docs
- Dataflow CDC template
- BigQuery federated query
- Cloud SQL server parameters
- Flyway schema migration
If repeatability and low-latency analytics matter, invest in a CDC pipeline—manual exports are a relic. For quick looks, federated queries suffice. In either case, tune for schema evolution and error handling to avoid late-night surprises.