Google BigQuery is a modern, cloud-based data warehouse designed to augment the data handling capabilities of Big Data management systems. With very high data storage and processing capacity, it easily eclipses the power of traditional data warehouses for running complex analytical workloads.
When dealing with Big Data, companies are forever playing the catchup game. The combination of velocity and volume makes it difficult to predict future data handling capacity for enterprise IT infrastructure. With over 36% of IT decision-makers facing this reality, it is a real problem. Google realized this early on and thankfully built BigQuery.
In this post, we will explore the unique capabilities of Google BigQuery and identify the best practices for integrating it within the enterprise Big Data workflow.
What is BigQuery?
Initially released in 2010, Google BigQuery is a serverless data warehousing platform. It is highly scalable and can handle data in multiples of petabytes. It is also performant, with a highly parallel architecture that delivers very fast query responses. As a result, it is a superior alternative to traditional data warehouses.
There are essentially four phases to the typical workflow of a Big Data pipeline:
With traditional data warehouses, the processing and analysis phase causes a major bottleneck when the ingested data soars beyond a certain limit. Google BigQuery expedites these phases so that the data is processed with little overhead.
In Google BigQuery, data is organized within a top-level container known as the BigQuery Dataset. Within a BigQuery Dataset, the data is arranged in tables. Data from different tables can be logically combined into views for easier querying.
What public datasets does BigQuery support?
Google BigQuery hosts a few important public datasets, made available through the Google Cloud Public Dataset Program to foster innovation around data.
Some of the notable datasets made available through this program are patents, crime, and COVID-19-related data, and maps. These datasets can be searched in the Google Cloud marketplace and exported to the Google BigQuery console after signing in.
What are the benefits of using BigQuery datasets?
One characteristic of Big Data, apart from volume and velocity, is the variety and veracity of data. Variety is because of different structuring, resulting in structured, semi-structured, and unstructured data, interspersed across data sources. Veracity in Big Data is caused due to anomalies in raw data or inconsistencies in processed data, causing duplicates, errors, or other forms of abnormalities.
A Google BigQuery dataset supports all types of data structuring. Therefore, structured data as tables, and semi-structured or unstructured data in the form of CSV, JSON, and other formats, can be stored in the same dataset, and combined into views and queried together. As a result, data engineers do not have to set up separate pipelines for handling each structural type of data.
Additionally, Google BigQuery provides fine-grained control over a dataset, down to the column and row level. This mechanism ensures a single source of truth. It also alleviates the need for an additional wrapper of tools for data governance around the datasets to trace veracity issues in data.
With the rise of Artificial Intelligence (AI), Big Data pipelines are expected to perform data pre-processing tasks.
Rather than setting up a separate ETL pipeline, Google BigQuery enables data scientists and data analysts to build and operationalize ML models right within the dataset.
What are the key features of BigQuery?
Besides its resilience in handling Big Data, Google BigQuery also offers some significant features that make it worth leveraging.
Google BigQuery natively supports standard SQL. This means that the data engineering teams have a familiar and well-known query language to work with the datasets. The SQL dialect is ANSI 2011 compliant and supports additional constructs for building and working with ML models.
Google BigQuery has built-in support for streaming data analytics. Streaming data can be ingested via BigQuery Streaming API, providing a low-latency, high-throughput access to the datasets. In addition, it also supports third-party streaming services and Pub/Sub messaging platforms. The built-in query acceleration ensures that the ingested streaming data is immediately available for querying in real-time.
Apart from these features, BigQuery also has native support for geospatial data. With this feature, data teams can perform analytics with spatial data to build location intelligence. They can also explore newer ways of presenting analytics reports within the context of geospatial data.
7 Essential Practices for Working with BigQuery Datasets
Google BigQuery is a great choice for developers and DataOps teams. Thanks to its free-tier options, on-demand availability, and flexible pricing, it is quite easy to get started with BigQuery.
However, Google BigQuery is a hosted platform and not the usual open-source tool that someone can spin off in a local environment. Consequently, working with Google BigQuery datasets requires some restraint and discipline.
Here are the seven vital practices to make the most of Google BigQuery datasets, in terms of practicality, performance, and price.
1. Optimize queries for column-based access
Google BigQuery is a columnar database. The data within the datasets are all stored separately in columns, instead of rows. As a result, it is always advisable to run queries with column names, instead of using the standard wildcard ‘*” for selecting all columns. For example, this query returns the publication_number column from the patents.publications dataset:
SELECT publication_number FROM `patents-public-data.patents.publications` LIMIT 10
Whereas, this query returns all the columns, which will make the query response size many times more than the data returned in the former case:
SELECT * FROM `patents-public-data.patents.publications` LIMIT 10
Additionally, dividing the tables in a dataset based on partitions and clusters can reduce the querying time and increase the performance.
It is important to remember that every query sent to Google BigQuery and every query response returned from it gets metered. Hence, care must be taken to prune the queries to limit the columns in the query response. This saves a lot of costs. Similarly, partitioning the table increases the query performance, which saves time.
2. Optimize the queries for Machine Learning (ML)
Google BigQuery datasets support direct machine learning interventions. By leveraging BigQuery ML, a built-in machine learning tool in BigQuery, data scientists can create and train ML models without the need to move data to a separate machine learning environment.
Using the same BigQuery Dataset, the data can be split into training, validation, and test sets, to train the model on one set, tune the hyperparameters on another set, and test the performance of the model on a third set. All of this is possible with custom SQL keywords for building and executing ML models. This has direct time and cost savings.
3. Configure change data capture (CDC)
Google BigQuery supports many options for data ingestion. Batch loading is a suitable choice for exporting a table as a first-time operation. However, for ingesting data as part of subsequent data updates, batch processing is inefficient.
Production data pipelines and real-time analytics jobs are better served through CDC technology. It captures the data updates from the data source, as it happens, with minimum latency.
Google BigQuery supports CDC. It also integrates with third-party data integration solution providers that facilitate better CDC orchestration with multiple data sources.
4. Maximize the analytics outcome
Google BigQuery is purpose-built for analytics. However, certain analytics tasks involve repeated data wrangling operations to access the same data. Much like query performance, the analytics performance must also be optimized for saving time and costs for such repeated operations. Here are a few ways:
- Data schema: All tables must follow a data schema to ensure that data types and indexes are assigned appropriately. This is paramount for unstructured data, which is initially ingested as a table with a STRING data type. Before analytical processing, it is a must to transform such tables containing unstructured or semi-structured data.
- Materialized views: For specific analytics outputs that are accessed frequently, it is better to have materialized views of data. Materialized views are stored as physical subsets of the dataset and are faster to query compared to views.
- Query cache: BigQuery offers a caching feature. All query results are written to a table for instant access on the subsequent trigger of the same query. For analytics queries with very large response data, it pays to tweak the cache configuration for improving the repeat query execution time.
5. Watch over data security
BigQuery Datasets should always be secured for access control using Google Cloud Identity and Access Management (IAM). This is an often overlooked practice when starting out, but must be enforced, especially for securing the data used to train the ML models.
Even otherwise, from a data governance perspective, access control must be always in place such that permissions are granted only to those who need them. Similarly, all users must have access only to the data they need to perform their tasks.
6. Enable data lineage
Data lineage allows DataOps teams to trace data path from ingestion to consumption, recording all the actions taken along the way. In this way, it is possible to check for any alterations or transformations performed on data.
Google BigQuery recently added this feature, which is currently available in preview mode. Once enabled, a “Data Lineage” tab is available on the BigQuery console as a visual depiction. The recommended incorporation is into the audit workflow for easy exploration of data assets’ usage.
7. Always be monitoring the costs
Similar to other hosted platforms, Google BigQuery follows a pay-as-you-go model. Therefore, any increase in storage capacity and query operations runs the risk of cost overruns. Effectively managing the trade-offs between these two factors is the secret to controlling the costs. Accordingly, it is advisable to tune the BigQuery datasets based on a few options:
- Table expiration: Tables within a dataset can automatically be deleted after a certain period. This helps reduce storage costs by removing data that is no longer needed.
- Storage tiers: BigQuery supports the concept of active and long-term storage tiers. The active tier is the default for accessing the dataset tables. The long-term storage tier is a lower-cost storage option for data that is accessed less frequently. This tier is designed for infrequently accessed or queried tables and is optimized for cost savings.
- Flat-rate pricing: Google BigQuery reserves a certain capacity by choosing a flat-rate commitment. This can help reduce the cost of running queries that are run frequently or require a lot of computing resources.
Apart from these options, all the query and analytics optimization practices mentioned above have an indirect contribution to cost savings. However, as an additional measure, Google BigQuery also offers budgets and quota limits configuration to keep a tab on costs.
Bringing BigQuery Closer to Data Sources with Equalum
Google BigQuery is a worthy solution to all the heavy lifting associated with the data pre-processing and ML training. It is also possible to augment its capabilities with in-flight ETL transformations. This approach is particularly helpful when dealing with streaming data.
By leveraging a data integration platform, such as Equalum, that has streaming ETL capabilities, data teams can build federated, real-time data streams from multiple data sources to Google BigQuery datasets. This is made possible with CDC. Additionally, with in-flight ETL, a virtual ETL pipeline can be built to enrich, aggregate, and cleanse the data before it reaches the BigQuery dataset. Equalum also enables you to bulk load to popular cloud data warehouses and lakes, helping you reduce costs while still improving the performance of the load and maintaining low latency. Get a free demo today.
Equalum’s CDC technology seamlessly integrates with DataOps and data engineers’ workflows by providing a low-code, visual interface to design their ETL pipelines. Along with its enterprise-grade reliability, Equalum is an ideal choice for building a future-proof data pipeline with Google BigQuery.