BigQuery to Redshift: Pragmatic Data Migration for AWS-Centric Architectures
Shifting analytical workloads from Google BigQuery to Amazon Redshift often comes down to cost modeling, ecosystem integration, or compliance. The theory sounds simple; reality introduces edge cases and migration snags. Here’s how a working engineer handles the process.
When Redshift Outpaces BigQuery
Redshift regularly wins in AWS-native deployments—particularly for consistent workloads, custom distribution, and deep integration with S3, Glue, or downstream lakes. It’s also an easier sell when regulatory teams insist on contiguous-region residency.
Typical drivers:
- Predictable Usage: Redshift’s reserved-instance or capacity-based pricing crushes per-query costs at scale.
- Tight AWS Integration: Direct COPY from S3, Athena federation, and Data Sharing models play well with broader AWS data estate.
- Manual Tuning: Complex joins or heavy time-series queries benefit from explicit distkeys, sortkeys, and column encodings.
1. Assess and Profile Existing BigQuery Assets
Don’t clone blindly. Identify:
- Critical tables (by business usage, not just storage size).
- Update cadences—static vs. streaming, hourly, daily, etc.
- Data types—watch for unsupported BQ types (e.g., STRUCT, ARRAY).
- Query access patterns:
- Example: A 1TB
sales_data
table filtered byorder_date
and joined oncustomer_id
—those patterns drive Redshift key choices.
- Example: A 1TB
Usually, a snapshot of the BQ INFORMATION_SCHEMA plus a daily access log review is sufficient. Missing this can lead to improperly sharded Redshift clusters and performance bottlenecks.
2. Design the Redshift Schema—Mind the Gaps
Redshift SQL diverges from BigQuery in subtle (and not-so-subtle) ways.
Migrating schema:
- Flatten nested/repeated data—Redshift doesn’t support ARRAY or STRUCT. Do this before export, not after. (Gotcha:
UNNEST
logic must be applied to BQ views first.) - Map types:
- BQ
STRING
→ RedshiftVARCHAR(n)
orTEXT
- BQ
TIMESTAMP
→ RedshiftTIMESTAMP
(sometimes maps cleanly, but double-check time zone semantics)
- BQ
- Define
DISTKEY
andSORTKEY
per workload profile; these materially impact join speed and vacuuming.
Practical example (Redshift 2.x):
CREATE TABLE sales_data (
order_id VARCHAR(50) NOT NULL,
customer_id VARCHAR(50) NOT NULL,
product_id VARCHAR(50),
order_date DATE,
quantity INT,
amount DECIMAL(12,2)
)
DISTSTYLE KEY
DISTKEY(customer_id)
SORTKEY(order_date);
Note: If the schema in BigQuery evolves dynamically, Redshift DDL will be less forgiving—explicitly manage versioning.
3. Exporting from BigQuery: No Direct Path
No one-click solution exists. Best flows involve exporting to Google Cloud Storage (GCS) in Parquet—avoid CSV unless you need human readability, since you’ll lose compression gains and metadata fidelity.
Parquet export:
bq extract --destination_format=PARQUET \
project:dataset.sales_data \
gs://my-gcs-bucket/sales_data_*.parquet
For large datasets: Batch extracts by date partition or use bq extract
wildcards. Expect transient errors if exceeding BQ or GCS write quotas.
Move GCS → S3:
Either Storage Transfer Service (for managed throughput) or, if time-constrained, gsutil
+ awscli
.
gsutil -m cp gs://my-gcs-bucket/sales_data_*.parquet /tmp/bq-export/
aws s3 cp /tmp/bq-export/ s3://my-s3-bucket/sales_data/ --recursive
Known issue: Permissions propagation across org boundaries. Set BQ export files to world-readable if cross-cloud IAM gets tangled, but restrict access ASAP after.
Alternatively, rclone
handles multi-threaded sync, preserving timestamps.
4. Redshift: Efficient Data Loading from S3
Redshift’s COPY
leverages parallelism. Poorly tuned settings destroy performance.
- Set up an IAM Role:
Give Redshift clusters3:GetObject
for the relevant bucket/prefix. Attach directly to the Redshift cluster configuration. - COPY command:
COPY sales_data
FROM 's3://my-s3-bucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Access'
FORMAT AS PARQUET
REGION 'us-east-1'; -- Region must match S3 bucket
Practical tip:
For Parquet ingest, confirm Redshift node version (SELECT version()
)—older clusters may need explicit column mapping or fail silently on type mismatches. Review STL_LOAD_ERRORS for diagnostics.
Row count mismatches?
Check for invalid rows (bad data, schema drift during migration), not just failures.
5. Validate and Compare: Trust Nothing
Check:
- Row counts (run
SELECT COUNT(*)
on both sides). - Checksum of key columns (e.g.,
md5(concat(order_id, amount))
) - Timestamps and floating-point handling (known source of discrepancies; BQ may allow millisecond precision on
TIMESTAMP
vs. Redshift’s micro/nanosecond rounding) - Spot-query for outliers/NULL propagation.
Small Python script example (for one-table compare):
import psycopg2, google.cloud.bigquery
# ... connect both ...
# SELECT COUNT(*) and key samples for parity checks
Automate checks—rarely worth manual inspection for >1MM rows.
6. Post-Migration: Performance Tuning and Cleanup
Redshift exposes substantial optimization knobs post-load:
- Analyze tables:
ANALYZE sales_data;
- Vacuum (esp. after bulk load):
VACUUM FULL sales_data;
(but beware, large vacuums can block/queue) - Revisit
DISTKEY
/SORTKEY
if queries underperform (trade-off: more disk I/O vs. rafted CPU). - Compression encodings:
ANALYZE COMPRESSION sales_data;
- Monitor WLM queues and slot allocation—misconfiguration here defeats COPY parallelism.
Side note:
If moving incrementally, orchestrate DMS or AWS Glue pipelines for delta capture. Streaming is rarely worth the complexity unless near real-time is critical.
Practical Migration Details & Caveats
- Version drift: Redshift syntax changes with revs; confirm
COPY ... FORMAT AS PARQUET
is valid for your cluster (>= 1.0.14613
). - Schema drift management: If BigQuery is schema-on-read and evolves rapidly, Redshift will need periodic ALTER TABLE scripts.
- Backup strategy: Always snapshot datasets before starting. Restoring from S3 is fast; re-extracting from BQ rarely is.
- Audit trails: Redshift audit logs (system_table: STL_LOAD_COMMITS) can be essential during the cutover window.
Reference Table: Core Steps
Stage | Tool/Method | Notes |
---|---|---|
Profiling | BQ console, SQL queries | Export INFORMATION_SCHEMA |
Extraction | bq extract , GCS | Parquet, partitioned |
Transfer | GCS → S3 sync / rclone | IAM permissions |
Load | Redshift COPY from S3 | Check cluster version |
Validation | SQL, Python checks | Row count, checksums |
Tuning | ANALYZE, VACUUM, Encoding | Monitor STL logs |
Non-obvious tip:
If BigQuery produced partitioned Parquet files, preserve partition directories in S3 and use Redshift Spectrum for part of the migration—querying external tables before final ingest. This can reduce downtime and validate in situ with zero-copy.
Migrating analytic workloads between clouds isn’t flawless. BigQuery to Redshift involves design decisions with cost, performance, and operational risk profiles. The process isn’t a single script—but a repeatable set of checks and trade-offs.
Questions around automated schema evolution, real-time migration, or external table hybrid strategies inevitably arise. No one-size-fits-all—expect to adapt per use case.
Questions remain? There are always edge cases not covered in docs. Engineers learn more by failing fast—so duplicate data, test, and catalog every difference.