Bigquery To Redshift

Bigquery To Redshift

Reading time1 min
#Cloud#Data#AWS#BigQuery#Redshift#DataMigration

BigQuery to Redshift: Pragmatic Data Migration for AWS-Centric Architectures

Shifting analytical workloads from Google BigQuery to Amazon Redshift often comes down to cost modeling, ecosystem integration, or compliance. The theory sounds simple; reality introduces edge cases and migration snags. Here’s how a working engineer handles the process.


When Redshift Outpaces BigQuery

Redshift regularly wins in AWS-native deployments—particularly for consistent workloads, custom distribution, and deep integration with S3, Glue, or downstream lakes. It’s also an easier sell when regulatory teams insist on contiguous-region residency.

Typical drivers:

  • Predictable Usage: Redshift’s reserved-instance or capacity-based pricing crushes per-query costs at scale.
  • Tight AWS Integration: Direct COPY from S3, Athena federation, and Data Sharing models play well with broader AWS data estate.
  • Manual Tuning: Complex joins or heavy time-series queries benefit from explicit distkeys, sortkeys, and column encodings.

1. Assess and Profile Existing BigQuery Assets

Don’t clone blindly. Identify:

  • Critical tables (by business usage, not just storage size).
  • Update cadences—static vs. streaming, hourly, daily, etc.
  • Data types—watch for unsupported BQ types (e.g., STRUCT, ARRAY).
  • Query access patterns:
    • Example: A 1TB sales_data table filtered by order_date and joined on customer_id—those patterns drive Redshift key choices.

Usually, a snapshot of the BQ INFORMATION_SCHEMA plus a daily access log review is sufficient. Missing this can lead to improperly sharded Redshift clusters and performance bottlenecks.


2. Design the Redshift Schema—Mind the Gaps

Redshift SQL diverges from BigQuery in subtle (and not-so-subtle) ways.

Migrating schema:

  • Flatten nested/repeated data—Redshift doesn’t support ARRAY or STRUCT. Do this before export, not after. (Gotcha: UNNEST logic must be applied to BQ views first.)
  • Map types:
    • BQ STRING → Redshift VARCHAR(n) or TEXT
    • BQ TIMESTAMP → Redshift TIMESTAMP (sometimes maps cleanly, but double-check time zone semantics)
  • Define DISTKEY and SORTKEY per workload profile; these materially impact join speed and vacuuming.

Practical example (Redshift 2.x):

CREATE TABLE sales_data (
  order_id VARCHAR(50) NOT NULL,
  customer_id VARCHAR(50) NOT NULL,
  product_id VARCHAR(50),
  order_date DATE,
  quantity INT,
  amount DECIMAL(12,2)
)
DISTSTYLE KEY
DISTKEY(customer_id)
SORTKEY(order_date);

Note: If the schema in BigQuery evolves dynamically, Redshift DDL will be less forgiving—explicitly manage versioning.


3. Exporting from BigQuery: No Direct Path

No one-click solution exists. Best flows involve exporting to Google Cloud Storage (GCS) in Parquet—avoid CSV unless you need human readability, since you’ll lose compression gains and metadata fidelity.

Parquet export:

bq extract --destination_format=PARQUET \
project:dataset.sales_data \
gs://my-gcs-bucket/sales_data_*.parquet

For large datasets: Batch extracts by date partition or use bq extract wildcards. Expect transient errors if exceeding BQ or GCS write quotas.

Move GCS → S3:
Either Storage Transfer Service (for managed throughput) or, if time-constrained, gsutil + awscli.

gsutil -m cp gs://my-gcs-bucket/sales_data_*.parquet /tmp/bq-export/
aws s3 cp /tmp/bq-export/ s3://my-s3-bucket/sales_data/ --recursive

Known issue: Permissions propagation across org boundaries. Set BQ export files to world-readable if cross-cloud IAM gets tangled, but restrict access ASAP after.

Alternatively, rclone handles multi-threaded sync, preserving timestamps.


4. Redshift: Efficient Data Loading from S3

Redshift’s COPY leverages parallelism. Poorly tuned settings destroy performance.

  1. Set up an IAM Role:
    Give Redshift cluster s3:GetObject for the relevant bucket/prefix. Attach directly to the Redshift cluster configuration.
  2. COPY command:
COPY sales_data
FROM 's3://my-s3-bucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Access'
FORMAT AS PARQUET
REGION 'us-east-1'; -- Region must match S3 bucket

Practical tip:
For Parquet ingest, confirm Redshift node version (SELECT version())—older clusters may need explicit column mapping or fail silently on type mismatches. Review STL_LOAD_ERRORS for diagnostics.

Row count mismatches?
Check for invalid rows (bad data, schema drift during migration), not just failures.


5. Validate and Compare: Trust Nothing

Check:

  • Row counts (run SELECT COUNT(*) on both sides).
  • Checksum of key columns (e.g., md5(concat(order_id, amount)))
  • Timestamps and floating-point handling (known source of discrepancies; BQ may allow millisecond precision on TIMESTAMP vs. Redshift’s micro/nanosecond rounding)
  • Spot-query for outliers/NULL propagation.

Small Python script example (for one-table compare):

import psycopg2, google.cloud.bigquery
# ... connect both ...
# SELECT COUNT(*) and key samples for parity checks

Automate checks—rarely worth manual inspection for >1MM rows.


6. Post-Migration: Performance Tuning and Cleanup

Redshift exposes substantial optimization knobs post-load:

  • Analyze tables: ANALYZE sales_data;
  • Vacuum (esp. after bulk load): VACUUM FULL sales_data; (but beware, large vacuums can block/queue)
  • Revisit DISTKEY/SORTKEY if queries underperform (trade-off: more disk I/O vs. rafted CPU).
  • Compression encodings:
ANALYZE COMPRESSION sales_data;
  • Monitor WLM queues and slot allocation—misconfiguration here defeats COPY parallelism.

Side note:
If moving incrementally, orchestrate DMS or AWS Glue pipelines for delta capture. Streaming is rarely worth the complexity unless near real-time is critical.


Practical Migration Details & Caveats

  • Version drift: Redshift syntax changes with revs; confirm COPY ... FORMAT AS PARQUET is valid for your cluster (>= 1.0.14613).
  • Schema drift management: If BigQuery is schema-on-read and evolves rapidly, Redshift will need periodic ALTER TABLE scripts.
  • Backup strategy: Always snapshot datasets before starting. Restoring from S3 is fast; re-extracting from BQ rarely is.
  • Audit trails: Redshift audit logs (system_table: STL_LOAD_COMMITS) can be essential during the cutover window.

Reference Table: Core Steps

StageTool/MethodNotes
ProfilingBQ console, SQL queriesExport INFORMATION_SCHEMA
Extractionbq extract, GCSParquet, partitioned
TransferGCS → S3 sync / rcloneIAM permissions
LoadRedshift COPY from S3Check cluster version
ValidationSQL, Python checksRow count, checksums
TuningANALYZE, VACUUM, EncodingMonitor STL logs

Non-obvious tip:
If BigQuery produced partitioned Parquet files, preserve partition directories in S3 and use Redshift Spectrum for part of the migration—querying external tables before final ingest. This can reduce downtime and validate in situ with zero-copy.


Migrating analytic workloads between clouds isn’t flawless. BigQuery to Redshift involves design decisions with cost, performance, and operational risk profiles. The process isn’t a single script—but a repeatable set of checks and trade-offs.

Questions around automated schema evolution, real-time migration, or external table hybrid strategies inevitably arise. No one-size-fits-all—expect to adapt per use case.

Questions remain? There are always edge cases not covered in docs. Engineers learn more by failing fast—so duplicate data, test, and catalog every difference.