Cloud Analytics & ML with Sam Taha

There are a lot of changes occurring these days with the Big Data revolution such as cloud computing, NoSQL, Columnar stores, and virtualization just to mention a few of the fast moving technologies that are transforming how we manage our data and run our IT operations. Big Data, powered by technologies such as Hadoop and NoSQL, is changing how many enterprises manage their data warehousing and scale their analytics reporting. Storing terabytes of data, and even petabytes, is now in the reach of any enterprise that can afford to spend the money on potentially hundreds or thousands of commodity cores and disks to run parallel and distributed processing engines like MapReduce for instance. But is Hadoop the right fit for everyone? Are their alternatives, especially for those that want more reat-time big data analytics? Read on.

A Little Background on Hadoop

With Hadoop and many related types of large distributed clustered systems, managing hundreds if not thousands of cpus, cores and disks is a serious system administration challenge for any enterprise big or small. Cloud based Hadoop engines like Amazon EMR and Google Hadoop make this a little easier, but these cloud solutions are not ideal for typical long-running data analytics because of the time it takes to setup the virtual instances and spray the data out of S3 and into the virtual data nodes. And then you have to tear down everything after you are done with your MapReduce/HDFS instances to avoid paying big dollars for long running VMs. Not to mention you have to copy your data back out of HDFS and back into S3 before your ephemeral data nodes are shutdown - not ideal for any serious Big Data analtyics.

Then there is the fact that Hadoop and MapReduce are batch oriented and thus not ideal for real-time analytics. So while we have taken many steps forward in technology evolution, the system administration challenges in managing large Hadoop clusters, for example, is still a problem and cloud based Hadoop has many limitations and restrictions as already mentioned. In its current form, cloud based Hadoop solutions are too expensive for long running cluster processing and not ideal for long-term distributed data storage. Not to mention the fact that virtualization and Hadoop are not a great fit just yet given the current state of virtualization and public cloud hardware and software technology - this is a separate discussion.

The BigQuery Alternative

So if I want to build a serious enterprise scale Big Data Warehouse it sounds like I have to build it myself and manage it on my own. Now, enter into the picture Google BigQuery and Dremel. BigQuery is a serious game changer in a number of ways. First it truly pushes big data into the clouds and even more importantly it pushes the system administration of the cluster (basically a multi-tenant Google super cluster) into the clouds and leaves this type of admin work to people (like Google) that are very good at this sort of thing. Second it is truly multi-tenant from the ground up, so efficient utilization of system resources is greatly improved, something Hadoop is currently weak at.

Put your Data Warehouse in the Cloud

So now given all this, what if you could build your data warehouse and analytics engine in the clouds with BigQuery? BigQuery gives you massive data storage to house your data sets and powerful SQL like language called Dremel for building your analytics and reports. Think of BigQuery as one of your datamarts where you can store both fast and slow changing dimensions of your data warehouse in BigQuery's cloud storage tables. Then using Dremel you can build near real-time and complex analytical queries and run all this against terabytes of data. And all of this is available to you without buying or managing any Big Data hardware clusters!

Modeling Your Data

In a classical Data Warehouse (DW), you organize your schema around a set of fact tables and dimension tables using some sort of snowflake schema or perhaps a simplified star schema. This is what is typically done for RDBMS based data warehouses. But for anyone who has worked with HDFS, HBase and other columnar or NoSQL data stores, this relational model of a DW no longer applies. Modeling a DW in a NoSQL or columnar data store requires a different approach. And this is what is needed when modeling your DW in BigQuery's data tables.

Slow Changing Dimensions

Slow Changing Dimensions (SCD) are straight forward to implement with a BigQuery data warehouse. Since typically in a SCD model you are inserting new records each time into your DW. SCD models are common when you are creating periodic fixed point in time snapshots from your operational data stores. For example, quarterly sales data is always inserted into the DW tables with some kind of time stamp or date dimension. With a BigQuery data store you would put each record into each BigQuery table with a date/time stamp. So your ETL would like something like this:

Nothing special here with this ETL diagram other than the data is moving between your enterprise to the Google Cloud. The output ETL is directed to BigQuery for storage in one or more BigQuery tables (note this can be staged via Google Cloud Storage). But now keep in mind that when creating a Big Data Warehouse, you are typically storing your data in a NoSQL, Columnar or HDFS type data store and thus you don't have a full RDMBS and all the related SQL join capability, so typically you must design your schemas to be much more denormalized than what is normally done in a DW. But BigQuery is a hybrid type data store so it does allow for joins and provides rich aggregate functions. How you model the time dimension is of particular importance - more on this later. So your schema for a SCD table might look like something like this:

Key(s)... | Columns... | EffectiveDate

The time dimension in this case is directly collapsed into what would normally be your fact table and you would want, as much as possible, to denormalize the tables so your queries require minimal joins. As noted Dremel allows for joins but requires that at least one of the tables in the join be "small". Where small means less than 8MB of compressed data.

So now in Dremel's SQL language to select a specific record, for a particular point in time, you would simply perform a normal looking SQL statement such as this:

SELECT Column1 FROM MyTable WHERE EffectiveDate=DATE_OF_INTEREST

This query will select a record at a known date. With this approach, you can for example query for sales quarterly data where you know the records must exist for that particular date. But what if you want the most "current" record at any given point in time? This is actually something Dremel and BigQuery excel at, because it gives you SQL functionality, such as subselects, that are not typically found in NoSQL type storage engines. The query would look like this:

SELECT Column1 FROM MyTable WHERE EffectiveDate = (SELECT EffectiveDate FROM MyTable WHERE EffectiveDate <= EffectiveDate)

This query can sometimes be considered bad practice in a standard RDBMS (especially for very large tables), because of performance considerations of the subselect. However, with Dremel, this is not a problem given the way Dremel queries scale out and the fact that they do not rely on indexes.

Fast Changing Dimensions

Fast Changing Dimensions (FCD) require a bit more effort to create in a typical DW, and this is no different with BiqQuery. In a FCD, you are often capturing frequent or near real-time changes from your operational data stores and through your ETL moving the new data into your DW. Your ETL engine must normally pay mind to when to insert a new fact or time dimension record and it often involves "terminating" the previously current record in the linage of a record history set. But buy leveraging the power of Dremel, FCD can be supported in BigQuery by just inserting a new record when the on-premises ETL engine detects a change, without terminating existing current records. And because you can perform the effective date based sub select, noted above, there is now no reason to maintain an effective/termination date fields for each record. You only need the effective date.

This makes the FCD schema model, stored in BigQuery, identical to the SCD model for managing the time dimension, however there is a catch. The ETL process must maintain a "Staging DW" of the records that exist on the BigQuery side. This Staging DW only holds the most current records of your table that exists in BigQuery, so this keeps it lean and it will not grow larger over time.

So with this model your ETL will only send changes to the Google Cloud. This overall approach for FCD is useful for modeling ERP type data, for example, where records have effective and termination dates and where tracking changes is critical. Here is a diagram of the FCD ETL flow:

Note, for the case of FCD model that is non ERP centric (data model does not depend on effective/termination date semantics), the Staging DW will not be required. This is typically the case when you are just blasting high volume loosely structured data into BigQuery, such as logs events or other timestamped action/event data. In this case, you don't have to detect changes and can just send the data to BigQuery for storage as it comes in.

Put your Data Warehouse in the Cloud

At Grand Logic we offer a powerful new way to build and augment your internal data warehouse with a BigQuery datamart in the Google cloud. Leveraging our real-time and batch capable ETL engines we can move your fast or slow moving dimensional data into unlimited capacity BigQuery tables and allow you to run real-time SQL Dremel queries for rich reporting that will scale. And do all this with little upfront costs and infrastructure compared to managing your own HDFS and HBase cluster in Hadoop, for example.

With our flagship automation engine and ETL engine, JobServer, we can help you build a powerful data warehouse in the Google cloud with rich analytics with little upfront investment that will scale to massive levels. Pay as you go with full control over your data and your reporting.

Stay tuned to this blog for more details on how Grand Logic can help you build your Data Warehouse in the clouds. We will be discussing more details of our JobServer product and how our consulting services can get you going with BigQuery.

Contact us to learn how our JobServer product can help you scale your ETL and Data Warehousing into the cloud.

Big Data is very much about discovering information locked in your mountains of data that come out of your production center, IT operations, enterprise systems, and back office databases. Information is all in the eye of the beholder so one person's junk is another person's gold. These days with the volumes of social data and device data growing at astronomical levels there is a lot of data to sift through and make sense out of.

While it is true that the more data you can capture the more possible information to discover there is a limit to this. I think we are going through a cycle where capturing and trying to make sense out of vast volumes of data (social data, sensor data....etc) is becoming more economical and somewhat mainstream with respect to technology and tools. However, this is a cyclic I believe, at some point business will realize that maybe they are getting diminishing returns on all this data they are capturing and storing. For example, do I really care what I tweeted 20 years ago (20 years from now). I probably will never have the time to go back and look at it and I am not sure it is valuable to any marketing person (but who knows).

There is definitely gold to be mined in many data sets that now go untapped and technologies like Hadoop, BigQuery, Storm to name a few are good tools to use but not everything fits into the Big Data tent either.

There has been a lot of hype around Big Data these days and I see a lot of people trying to fit problems that really have no reason being shoehorned into Hadoop, other than it being the cool thing to do. You could do the data crunching in easier ways for example. However, the tool sets are expanding to give developers, scientist and business people more options when deciding how to store and analyze their data.

When thinking of Big Data first ask yourself the following question:

1) How much data do I want to capture and store (do you need to persist detailed records/data?)
2) How fast is this data being created (velocity).
3) How long do I want to keep it (forever?).
4) How long am I willing to wait to get "information" when I run my analysis (batch/hourly/daily or real-time).
5) What will cost me to keep all this data around and do I have the system admin muscle to do this?

This might help you determine in which of the particular emerging Big Data technology buckets your problem best fits and which approach to take (cloud cluster, on-premises cluster...etc).

Wednesday, September 26, 2012

BigQuery: Data Warehouse in the Clouds

A Little Background on Hadoop

The BigQuery Alternative

Put your Data Warehouse in the Cloud

Modeling Your Data

Slow Changing Dimensions

Fast Changing Dimensions

Put your Data Warehouse in the Cloud

Tuesday, September 18, 2012

The Big Data Evolution Will Continue - No Kidding