Cloud Analytics & ML with Sam Taha

ETL is a common computing paradigm used in a variety of data movement and data management scenarios. As demand for more insight into business data as grown, ETL has been used to move more data from operational data stores into OLAP and data warehousing environments. This has expanded the need for analytics and other solutions that rely on data being reconstituted into easier to consume forms or data models more efficient to solve specific problems.

So nothing special going on here, but as data volumes have grown and sources of data have exploded, the transformation part of ETL (the "T") is becoming more of a challenge, especially as organizations demand more near real-time analytics and up to date information. Transforming the volumes of operational data is becoming a computing bottleneck and often limits what you can do with data after it has been transformed and loaded into downstream data marts. See a typical ETL data flow diagram below.

Big Data to the Rescue
With the evolution of big data and Hadoop, new tools have been brought to bear that can provide help in the overall ETL computing process. However, with Hadoop, the ETL model needs to be revisited. Hadoop can bring tremendous computing resources to more efficiently transform data into target models. While Hadoop can serve as part of your overall processing fabric and can be leverage directly for OLAP and itself be used for data warehousing (e.g. HBase data store), it can also serve as a intermediate staging area that can be used to populate traditional relational data marts.

Using Hadoop in this way allows it to be used as an intermediate store for data until it can later be transformed into target models. We can accomplish this "load first" approach using Hadoop, by changing the ETL model around a bit. Instead of extracting and transforming data first, we can instead extract and load data into Hadoop storage, for staging, and then take full advantage of the Hadoop compute infrastructure to transform (using Map Reduce, Impala, Drill…etc) the data into target models that can feed traditional relational data marts and OLAP engines. See diagram for example:

Hadoop for Transformation
This essentially allows organizations to use Hadoop as the transformation platform that allows developers to perform more complex transformations that were not practical in the normal ETL universe. So think of Hadoop as the new super charged "T" in the "ELT" paradigm, where data is moved as efficiently as possible from operational stores and loaded ("L") into HDFS (and HBASE or Cassandra) as fast as possible. Then the "T" can be performed within the Hadoop ecosystem. This allows Hadoop to be a powerful intermediary layer that can drive new analytics and allow existing analytics to keep up with the deluge of data. This also allows existing OLAP and data warehouses to continue to consume data out of Hadoop for existing analytics.

So let us start getting used to the concept of "ELT" as the new big data cousin of ETL. Hadoop is more than just a historical archive or dumping ground for unstructured data. It can be a powerful transform computing layer that can drive better data warehousing for new and existing analytics solutions.

Grand Logic delivers software solutions that automate your business processes and tame your IT operations & Big Data analytics. Grand Logic delivers data and job automation software, Hadoop consulting services and that maximize your Big Data investment.

Tuesday, May 14, 2013

Hadoop the New 'T' in ETL