I have recently migrated 23 Terabytes of BigQuery datasets from the US region to Sydney which includes numerous Google Analytics datasets. Many would think that in this world of driverless cars, space travel and AI powered chatbots that this feat is merely a voice command away — but beware my friends, the road ahead is not short of perils.
Your best bet before embarking on such a journey would be to understand the various options that you have and to understand them well. Some of the options that I’ve consulted included:
- Using Apache Airflow in Cloud Composer as written by David Sabater Dinter of Google.
- Using Cloud Dataflow as outlined by my dear colleague and master magician when it comes to Data and Analytics, Graham Polley.
- Using the BigQuery Data Transfer Service for Cross Region Copy which is still in alpha at the time of writing.
- Using the bq command line tool and some shell scripting mania to perform what options 1 and 2 above does, which is to loop through tables in the dataset, export each table to a Google Cloud Storage bucket in the same region as the source dataset, copy the exported data to another bucket in the desired region, recreating the tables and loading the data back into those tables.
- Using massively parallel programming languages like golang and BigQuery libraries to achieve the concepts of option 4.
The following requirements were carefully considered when selecting the most suitable approach:
- The resulting dataset should have the same name as the original dataset as there are hundreds of Tableau dashboards and BigQuery Saved Queries that depends on these datasets. A name change would require many more days or weeks of effort to reconfigure each dashboard and saved query. Not to mention how boring and monotonous work would be! As such, option 3 above was not ideal because the BigQuery Data Transfer Service for Cross Region Copy would only copy 1000 tables at a time and multiple runs had to be triggered. The copy process would also have to be repeated twice as datasets cannot have the same name even if they reside in different regions. Thus, the first copy attempt will copy the tables into a dataset of a different name before the original dataset can be dropped and the final dataset recreated with the original name and tables copied back into it. It would just take forever to go through the hundred or so datasets.
- There must not be any historical data loss. Each table in the dataset should have a checksum generated before and after the migration to ensure that the integrity of data is still intact. Options 1, 2 and 3 above did not automate the generation of the checksum and would have involved the creation of a custom script to perform the task. The following query was used to generate the checksum of a table:
SELECT BIT_XOR(FARM_FINGERPRINT(TO_JSON_STRING(t))) AS checksum
FROM `dataset.table` t
- There must not be more than 24h loss of Google Analytics data. Google Analytics exports data to BigQuery in a batched or streaming mode. In batched mode, intraday batches are performed 3 times a day. Each intraday batch overrides the previous batch entirely. As the Google Analytics to BigQuery export will be disabled prior to migration and re-established after the migration, data will not be lost so long as the link is re-established prior to the last intraday batch. There will however be data loss in streaming mode as deltas are sent in 15 minute intervals.
- All metadata and IAM bindings must be preserved during the migration. Metadata such as table descriptions and labels as well as custom IAM bindings on the dataset level are not copied across by code in options 1 and 2.
- Views, scheduled queries and authorised views must be migrated at the same time. Neither option 1, 2 or 3 above copies views, scheduled queries and authorised views.
Based on the requirements above, I settled on creating a golang program to perform the migration as it checks all the requirements, is highly performant, and is just a pure joy for me to write.
In writing my migration program, the following pitfalls were thoroughly considered. Please note that these considerations may apply to other approaches as well.
- Quota and limits for BigQuery Extract and Load. There is a 10TB extract limit per day across all projects. So if your dataset is larger than 10TB, you may have to either consider breaking the migration into batches, or use BigQuery Storage API.
- Don’t forget about API rate limits especially the general request per second and concurrent requests per user limits! Sometimes I get a bit too trigger happy with parallel processing especially when goroutines make it so simple on golang.
- The extracted file format should also be carefully considered. Tables can be extracted as CSV, AVRO, Newline Delimited JSON. Before deciding on the red or blue pill, study the advantages and disadvantages of each format. I went with the Newline Delimited JSON to preserve data formats and nested records.
- When copying large amounts of exported data across regional buckets, performance is a concern. gsutil even with the `-m` argument and from a VM in the GCP network was not suitable. I leveraged on the Google Cloud Storage Transfer Service for better performance
- Don’t forget about pricing considerations. The following operations should be included into your cost analysis of the data migration activity:
- Executing queries to generate checksums performs full table scans
- Exporting data to Google Cloud Storage buckets incurs storage costs
- Creating backup datasets incurs extra storage costs
- Creating new datasets negates long term storage discounts
- Copying data to buckets across regions incurs network egress charges
- Another kicker is view, and scheduled queries interdependencies. Views and scheduled queries can reference tables and views from other datasets making them interdependent and must be migrated together. To discover the relationships, I used tools like Graphviz. There are also numerous online version of this tool. The order to which these views and scheduled queries are restored is also paramount.
The time has come for you to finally run the finely tuned migration approach that you have designed. You have dotted all the Is and crossed all the Ts. You brewed a large pot of coffee and have slouched into your comfy chair — but WAIT … there is one last piece of advice.
Stagger your migrations. Divide your datasets into groups based on criticality to the business. Migrate datasets that are of lowest importance to the business first before moving onto more complex and riskier datasets that you cannot afford to get wrong. This gives you a chance to learn from the experience and further fine tune the migration approach.
All the best!