Gcp Cloud Sql To Bigquery

Gcp Cloud Sql To Bigquery

Reading time1 min
#Cloud#Data#BigQuery#GCP#CloudSQL#Datastream

Reliable Automated Data Sync: Cloud SQL to BigQuery (No CSV Detours)

Moving transactional data out of Cloud SQL and into BigQuery should not involve manual exports, CSV juggling, or brittle script chains. Yet many teams still do it the hard way out of habit—or because they don’t know how to build streaming, zero-intermediate pipelines with GCP’s managed services.

Below is a pragmatic approach, based on production use with GCP-native tools, to continuously (or on-demand) replicate data from Cloud SQL (MySQL/PostgreSQL) to BigQuery without passing through file dumps or manual intervention.


Problem Framing

Typical ETL workflows for reporting or analytics look like this:

[Cloud SQL] -> (Export CSV) -> [Cloud Storage] -> (Load/import) -> [BigQuery]

This has several well-known pain points:

  • Introduces batch latency (hourly/daily updates lag real business).
  • Manual steps or fragile automation (missed transfers, CSV format drift).
  • Storage bloat from intermediate dump files.
  • No change tracking; upsert/merge logic gets complicated.

For operational analytics, none of these are acceptable.


Approaches: Pros and Cons

MethodStrengthsCaveats
Manual CSV or scheduled SQL export/loadMinimal initial setupNot scalable, error-prone, slow
GCS-based Dataflow batch pipelineAutomatableLatency, extra temp storage, cost
Datastream-based Change Data Capture (CDC)Near-real-time, robustSetup/maintenance, permissions, quota
BigQuery federated queries (Cloud SQL connection)Zero data duplicationLive DB read latency, max 100GB, quotas

Rule of thumb: batch exports may suffice for QA or rarely-changing data. For production analytics, use CDC (Datastream or equivalent).


Architecture 1: Continuous Sync via Datastream + Dataflow

For most production-grade systems, Datastream’s CDC with Dataflow streaming pipelines achieves the required reliability and latency. See the high-level flow:

         +-----------------+         +---------------+        +----------+
         |  Cloud SQL      |<------->|  Datastream   |------->|  GCS     |
         +-----------------+   (CDC) +---------------+   (AVRO) |    ^
                                                           |     |
                                                           v     |
                                                     +----------+|
                                                     |Dataflow  |
                                                     +----------+
                                                          |
                                                          v
                                                    +----------+
                                                    |BigQuery  |
                                                    +----------+

Note: As of 2024-06, Datastream cannot write directly to BigQuery; you must land events in GCS, then apply them into BigQuery via Dataflow.

Minimum Requirements

  • Cloud SQL (MySQL ≥5.7, PostgreSQL ≥13)
  • Configure binary logging (for MySQL), wal_level=logical (for PostgreSQL)
  • Enable: Datastream API, Dataflow API, BigQuery API

Step 1: Configure Cloud SQL for CDC

For MySQL:

  • Enable binary logging

  • Set binlog_format=ROW

  • Ensure sufficient retention (expire_logs_days)

  • Grant a replication user with access:

    CREATE USER 'datastream'@'%' IDENTIFIED BY 'STRONG_PASSWORD';
    GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'datastream'@'%';
    

For PostgreSQL:

  • Set wal_level=logical
  • Adjust max_replication_slots and max_wal_senders
  • Create a logical replication slot

Known issue: If you forget to adjust these parameters, Datastream will fail with an error like:

Error: logical replication slot not available (sqlstate 55006)

Step 2: Configure Datastream Stream

  • Source: Cloud SQL instance
  • Destination: GCS bucket (advise regional alignment)

GCP UI wizard (or use Terraform for repeatable infra) will guide you through subnetting, access controls, source/destination configuration.

Datastream writes event files (Avro format) to GCS prefixes you specify.

Step 3: Dataflow CDC Application

Launch the official Dataflow streaming template to load CDC changes to BigQuery. This requires specifying source GCS, schema mapping (JSON), and target tables.

CLI Example (minimal):

gcloud dataflow jobs run ds-bq-cdc-prod \
  --gcs-location gs://dataflow-templates/latest/Datastream_to_BigQuery \
  --region=us-central1 \
  --parameters \
    inputFilePattern=gs://<datastream-bucket>/data/**,\
    inputSchemaPattern=gs://<datastream-bucket>/schema/**,\
    gcsTempLocation=gs://<tmp-bucket>/tmp/,\
    bigQueryDataset=<target_dataset>
  • Dataflow job consumes new Avro events and applies them as INSERT/UPDATE/DELETE to BigQuery tables. Most users set up table partitioning and retention as well.

Tip: Monitor Dataflow logs for errors such as:

BigQuery error in load operation: Provided schema does not match table

This usually stems from an upstream schema change in Cloud SQL; see ‘Schema Drift’ discussion below.


Architecture 2: Direct Federated Querying

For ad hoc access, smaller datasets, or pilot projects, connect BigQuery directly to Cloud SQL via an external connection.

Example:

CREATE OR REPLACE EXTERNAL TABLE `reporting.ad_customer`
WITH CONNECTION `my-region.my-sql-conn`
OPTIONS (table='main.customers', database='main');

Then:

SELECT id, name, created_at FROM `reporting.ad_customer`
WHERE created_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY);

Trade-offs:

  • Latency and query limits (live over TCP; subject to SQL quotas).
  • Not ideal for large analytical queries (>100GB/table).
  • Useful for quick troubleshooting or prototyping.

When to Use Which

Use CaseToolchain
Near-real-time analytics, high volumeDatastream + Dataflow
Reporting on fresh but smaller datasetsFederated Query
Simple batch ETL, non-criticalScheduled export/load

Note: There are hybrid patterns (e.g., scheduled view materialization) but in practice, most pipelines settle on one of these models.


Keeping Schema in Sync (Non-Obvious Tip)

Schema drift kills CDC jobs. Use Flyway or Liquibase migrations both for Cloud SQL and, critically, for downstream BigQuery table DDLs. Integrate migration DDL steps into your CI/CD—triggered after upstream table changes before redeploying Dataflow.

Optionally, include schema checks in your pipeline to detect and halt on incompatible changes. BigQuery will error on missing columns or type mismatches.


Side Notes and Gaps

  • GCP IAM for Dataflow and Datastream often needs extra tuning; grant the minimal roles (roles/datastream.admin, roles/dataflow.worker), and verify service accounts.
  • Cost can spike if CDC event volume is high; review retention settings on GCS buckets.
  • Dataflow pipeline restarts sometimes reapply historical events; design idempotency in downstream BI tools.
  • For testing, always validate with a non-critical database. Residual state in GCS or Dataflow can be hard to debug post-failure.

References


If repeatability and low-latency analytics matter, invest in a CDC pipeline—manual exports are a relic. For quick looks, federated queries suffice. In either case, tune for schema evolution and error handling to avoid late-night surprises.