A Little Background on Hadoop
With Hadoop and many related types of large distributed clustered systems, managing hundreds if not thousands of cpus, cores and disks is a serious system administration challenge for any enterprise big or small. Cloud based Hadoop engines like Amazon EMR and Google Hadoop make this a little easier, but these cloud solutions are not ideal for typical long-running data analytics because of the time it takes to setup the virtual instances and spray the data out of S3 and into the virtual data nodes. And then you have to tear down everything after you are done with your MapReduce/HDFS instances to avoid paying big dollars for long running VMs. Not to mention you have to copy your data back out of HDFS and back into S3 before your ephemeral data nodes are shutdown - not ideal for any serious Big Data analtyics.Then there is the fact that Hadoop and MapReduce are batch oriented and thus not ideal for real-time analytics. So while we have taken many steps forward in technology evolution, the system administration challenges in managing large Hadoop clusters, for example, is still a problem and cloud based Hadoop has many limitations and restrictions as already mentioned. In its current form, cloud based Hadoop solutions are too expensive for long running cluster processing and not ideal for long-term distributed data storage. Not to mention the fact that virtualization and Hadoop are not a great fit just yet given the current state of virtualization and public cloud hardware and software technology - this is a separate discussion.
The BigQuery Alternative
So if I want to build a serious enterprise scale Big Data Warehouse it sounds like I have to build it myself and manage it on my own. Now, enter into the picture Google BigQuery and Dremel. BigQuery is a serious game changer in a number of ways. First it truly pushes big data into the clouds and even more importantly it pushes the system administration of the cluster (basically a multi-tenant Google super cluster) into the clouds and leaves this type of admin work to people (like Google) that are very good at this sort of thing. Second it is truly multi-tenant from the ground up, so efficient utilization of system resources is greatly improved, something Hadoop is currently weak at.Put your Data Warehouse in the Cloud
So now given all this, what if you could build your data warehouse and analytics engine in the clouds with BigQuery? BigQuery gives you massive data storage to house your data sets and powerful SQL like language called Dremel for building your analytics and reports. Think of BigQuery as one of your datamarts where you can store both fast and slow changing dimensions of your data warehouse in BigQuery's cloud storage tables. Then using Dremel you can build near real-time and complex analytical queries and run all this against terabytes of data. And all of this is available to you without buying or managing any Big Data hardware clusters!Modeling Your Data
In a classical Data Warehouse (DW), you organize your schema around a set of fact tables and dimension tables using some sort of snowflake schema or perhaps a simplified star schema. This is what is typically done for RDBMS based data warehouses. But for anyone who has worked with HDFS, HBase and other columnar or NoSQL data stores, this relational model of a DW no longer applies. Modeling a DW in a NoSQL or columnar data store requires a different approach. And this is what is needed when modeling your DW in BigQuery's data tables.Slow Changing Dimensions
Slow Changing Dimensions (SCD) are straight forward to implement with a BigQuery data warehouse. Since typically in a SCD model you are inserting new records each time into your DW. SCD models are common when you are creating periodic fixed point in time snapshots from your operational data stores. For example, quarterly sales data is always inserted into the DW tables with some kind of time stamp or date dimension. With a BigQuery data store you would put each record into each BigQuery table with a date/time stamp. So your ETL would like something like this:Nothing special here with this ETL diagram other than the data is moving between your enterprise to the Google Cloud. The output ETL is directed to BigQuery for storage in one or more BigQuery tables (note this can be staged via Google Cloud Storage). But now keep in mind that when creating a Big Data Warehouse, you are typically storing your data in a NoSQL, Columnar or HDFS type data store and thus you don't have a full RDMBS and all the related SQL join capability, so typically you must design your schemas to be much more denormalized than what is normally done in a DW. But BigQuery is a hybrid type data store so it does allow for joins and provides rich aggregate functions. How you model the time dimension is of particular importance - more on this later. So your schema for a SCD table might look like something like this:
Key(s)... | Columns... | EffectiveDate
The time dimension in this case is directly collapsed into what would normally be your fact table and you would want, as much as possible, to denormalize the tables so your queries require minimal joins. As noted Dremel allows for joins but requires that at least one of the tables in the join be "small". Where small means less than 8MB of compressed data.So now in Dremel's SQL language to select a specific record, for a particular point in time, you would simply perform a normal looking SQL statement such as this:
SELECT Column1 FROM MyTable WHERE EffectiveDate=DATE_OF_INTEREST
This query will select a record at a known date. With this approach, you can for example query for sales quarterly data where you know the records must exist for that particular date. But what if you want the most "current" record at any given point in time? This is actually something Dremel and BigQuery excel at, because it gives you SQL functionality, such as subselects, that are not typically found in NoSQL type storage engines. The query would look like this:SELECT Column1 FROM MyTable WHERE EffectiveDate = (SELECT EffectiveDate FROM MyTable WHERE EffectiveDate <= EffectiveDate)
This query can sometimes be considered bad practice in a standard RDBMS (especially for very large tables), because of performance considerations of the subselect. However, with Dremel, this is not a problem given the way Dremel queries scale out and the fact that they do not rely on indexes.Fast Changing Dimensions
Fast Changing Dimensions (FCD) require a bit more effort to create in a typical DW, and this is no different with BiqQuery. In a FCD, you are often capturing frequent or near real-time changes from your operational data stores and through your ETL moving the new data into your DW. Your ETL engine must normally pay mind to when to insert a new fact or time dimension record and it often involves "terminating" the previously current record in the linage of a record history set. But buy leveraging the power of Dremel, FCD can be supported in BigQuery by just inserting a new record when the on-premises ETL engine detects a change, without terminating existing current records. And because you can perform the effective date based sub select, noted above, there is now no reason to maintain an effective/termination date fields for each record. You only need the effective date.This makes the FCD schema model, stored in BigQuery, identical to the SCD model for managing the time dimension, however there is a catch. The ETL process must maintain a "Staging DW" of the records that exist on the BigQuery side. This Staging DW only holds the most current records of your table that exists in BigQuery, so this keeps it lean and it will not grow larger over time.
So with this model your ETL will only send changes to the Google Cloud. This overall approach for FCD is useful for modeling ERP type data, for example, where records have effective and termination dates and where tracking changes is critical. Here is a diagram of the FCD ETL flow:
Note, for the case of FCD model that is non ERP centric (data model does not depend on effective/termination date semantics), the Staging DW will not be required. This is typically the case when you are just blasting high volume loosely structured data into BigQuery, such as logs events or other timestamped action/event data. In this case, you don't have to detect changes and can just send the data to BigQuery for storage as it comes in.
Put your Data Warehouse in the Cloud
At Grand Logic we offer a powerful new way to build and augment your internal data warehouse with a BigQuery datamart in the Google cloud. Leveraging our real-time and batch capable ETL engines we can move your fast or slow moving dimensional data into unlimited capacity BigQuery tables and allow you to run real-time SQL Dremel queries for rich reporting that will scale. And do all this with little upfront costs and infrastructure compared to managing your own HDFS and HBase cluster in Hadoop, for example.With our flagship automation engine and ETL engine, JobServer, we can help you build a powerful data warehouse in the Google cloud with rich analytics with little upfront investment that will scale to massive levels. Pay as you go with full control over your data and your reporting.
Stay tuned to this blog for more details on how Grand Logic can help you build your Data Warehouse in the clouds. We will be discussing more details of our JobServer product and how our consulting services can get you going with BigQuery.
Contact us to learn how our JobServer product can help you scale your ETL and Data Warehousing into the cloud.