Cloud Analytics & ML with Sam Taha

Monday, February 29, 2016

Essbase Analytics with Tableau, Cassandra and Spark

Using Hyperion Essbase? Looking to get some of that financial, accounting, sales, and marketing data that is locked in your Essbase cube, out and into something more accessable? Essbase is a very powerful data modeling platform, but it was built quite a while back (in tech time) when multi-dimensional modeling and DSL languages like MDX where a new frontier for data modeling and analytics.

Essbase was also built when requirements for analytics, reporting and visualization were much more constrained and the expectations for realtime reporting were not as demanding as they are now (not to mention data volumes). There are many organizations using Essbase for critical business functions, so streamlining the path to quicker decision making and more robust what-if type analysis is critical to being competitive and for optimizing the operational performance of your business.

Oracle Essbase has a number of built-in applications for reporting and business intelligence that can provide business analysts and developers with access to visualizing and the ability of drilling down into the data within the cube. But with the evolution of Big Data and new modern analytical and visualization tools, wouldn't you like to get that data you have locked up in Essbase out of that legacy cube and into something more accessible and flexible such as Tableau for rich and rapid visualization and wouldn't you like to have your terabytes of cube data in Cassandra and available to Apache Spark for powerful access to big data style data ETL, machine learning and mashing and correlating with other data sources?

Well, while there is no easy out of the box solution to accomplish all of this, the dream to turn your Essbase cube into another data lake that is part of your Big Data ocean and more available for rich analytics and predictive modeling and visualization is achievable with a little work and elbow grease.

Let's start by describing how you can do this. The first step is getting your data out of Essbase and this is probably the most difficult step. There are a number of ways to access data from Essbase. It first starts with understanding what "information" you want to extract. You typically don't want to directly extract the raw data that is in the Essbase cube (but you could do that as well). Such data is often too granular (one of the reasons it is in a cube), so you might need to perform some aggregations (across the dimensions) and apply some business logic as you extract the data from the cube. This is an ETL step that more or less denormalizes that data out of the cube and flattens it out into a format that will be ideal for Tableau (further downstream in the process) and applies necessary business logic to the data to get it into a more consumable form. Tableau is ideal at consuming such "flattened" information given how it extracts dimensionality out of denormalized input sources.

Often times what is typically stored in Essbase dimensions and metrics are the detailed data elements (financial, sales...etc) that might need some business transformation applied to them before extraction out of the cube. So this ETL process will prepare the data for ultimate consumption by Tableau. This is part of the art of the design modeling that goes into the overall data transformation pipeline and that requires that you must understand what category of information you are after from the raw source Essbase data that is locked in the cube. This is part of the modeling exercise you must go through and is a very critical step to get correct in order for the data to be in a structure that can be visualized by Tableau.

Now for the actual mechanics of extracting data from Essbase you have a few options for how to do this. Essbase provides a few ways to get data out of the cube.

The diagram above shows two options for extracting data from Essbase. Smart View is one option that leverages a spreadsheet approach for extracting, transforming and flattening data out of the cube for preparation to be channeled further downstream. While Smart View is not a pure programmatic API, the excel spreadsheet capabilities allow for a lot of ad-hoc exploring for getting data out of the cube and it should not be underestimated what can be done with Smart View and via the supported Essbase APIs available through Excel.

The second option shown in the diagram is using the Essbase Java API. Using the Java API allows for directly querying the Essbase database and gives very dynamic and flexible access to the cube. This can be the most robust way to get at data in the cube but is the most development intensive and a bit harder to make flexible and configurable (unlike Excel).

One thing to note is that Smart View and the Java API are not mutually exclusive. Behind the scenes Smart View is using the Java API and functions as a middleman service that allows Excel to interface with Essbase. There is a Smart View server which exposes web services accessed by Smart View. The Smart View server (aka Analytics Provider Services or APS for short) then uses the Essbase Java API to talk natively to the Essbase server directly.

The main goal of this step (whether using Smart View or Java API), is to extract the cube data that we ultimately want to feed into Tableau.

The next step is storing the extracted data described in the first step. The goal here is to store the flattened data in Cassandra tables. This requires a loader custom app to take the flattened data and load into Cassandra. What is critical to consider in the design up front, is whether the load process will be purge and reload, time series DW loading (fast changing dimensional data) or change data loading DW loading (slow changing dimensional data). See diagram below.

Storing the data in Cassandra will set us up for the final stage of the process which is creating the Tableau Data Extract that will deliver the final data processing stage. Note that in setting up the data for loading into Cassandra, Spark can be used to aide in the ETL process. One often overlooked feature in Apache Spark is that it is an excellent ETL tool. In fact, often times Spark deployment efforts end up performing quite a bit of ETL logic in order to prepare data for the final stage of modeling and machine learning processing. Apache Spark is a great tool for the emerging field of realtime ETL that is powering the next evolution of data movement in Big Data environments.

The next step in the process is using the Cassandra structured data in an environment where the Cassandra tables can be made visible to Tableau for realtime extraction and modeling. This is where Apache Spark comes into the picture. Normally if you setup Cassandra as a direct data source for Tableau, you will have processing limitations as Cassandra can't perform joins and aggregations needed by Tableau, because with Cassandra the Tableau analytics will be forced to occur on the Tableau client side. However, with Spark in the equation the Tableau analytics and related processing can happen within the Spark cluster.

Here is a final picture of the major components in the workflow and processing pipeline:

While there are some pitfalls to be weary of, this is often the case in any Big Data build out. And using products like Essbase and Tableau don't make the build out any easier. It would be nice to have less moving parts, but with a sound deployment and infrastructure this architecture can be made to scale out and is practical to apply in smaller footprint deployments as well.

Here are a couple of useful links that describe in more detail how the Spark, Cassandra and Tableau integration work:

With this architecture you get the scalability of Spark and Cassandra for both data processing and storage scale out. In addition, with this approach you avoid a common requirement with Tableau to create TDEs (Tableau Data Extracts) that are cached/stored on Tableau Server because often times source systems such as Essbase and even traditional RDBMS environments don't scale to support Tableau Server/Desktop needs for realtime aggregations and transformations. Apache Spark steps in to provide the Big Data computational backbone needed to drive the Tableau realtime visualizations and modeling. While Tableau Server is great at serving Tableau web UI and helping with some the data governance (note this is an area Tableau is improving in), Tableau's server-side storage and processing capabilities are somewhat limited (as of this writeup).

To sum things up, Essbase cubes and related reporting services are not very scalable and accessible beasts, so this is where the combination of Cassandra and Spark can help out and give Tableau a better compute backbone that can drive interactive data visualization of your Essbase cube. Hopefully this information will inspire you to look at using Tableau with Essbase and help you ultimately unlock the potential of your Essbase financial data!

Tableau and Essbase can be a great combination for building rich reporting and dashboards and without the overhead and complexity of traditional data warehousing and BI tools. Get your financial data out of Essbase and into Tableau and into the hands of our executives and decision makers. Contact Grand Logic to learn more.

Tuesday, February 2, 2016

Spark Processing for Low Latency Interactive Applications

Apache is typically thought of as a replacement for Hadoop MapReduce for batch job processing. While it is true that Spark is often used for efficient large scale distributed cluster type processing for compute intensive jobs, it can also be used for processing low latency operations used in more interactive applications.

Note this is different than Spark Streaming and micro-batching. What we are talking about here is using Spark's traditional batch memory centric MapReduce functionality and powerful Scala (or Java/Python/R APIs) for low-latency and short duration interactive type processing via REST APIs integrated directly into application code.

The Spark processing API is very powerful and expressive for doing rich processing and the Spark compute engine is efficient at optimizing data processing and access to memory and workers/executors. Leveraging this in your interactive CRUD applications can be a boon for application developers. Spark makes this possible with a number of capabilities available to developers once you have tuned your Spark cluster for this type of computing scenario.

First, latency can be reduced by caching Spark contexts and even caching (when appropriate) RDDs. The Job Server open source project, is a Spark related project that allows you to manage a pool of Spark contexts that essentially creates cached connections to a running Spark cluster. By leveraging Job Server's cached Spark contexts and REST API, application developers can access Spark with lower latency and enable access to multi-user shared resources and processing on the Spark cluster. Another interesting project that can useful for interactive applications is Apache Toree - check it out as well.

Secondly, you can setup a Standalone Spark cluster adjacent to your traditional application server cluster (tomcat servlet engine cluster for example) that is optimized for handling concurrent application requests. Spark has a number of configuration options that allow a Spark cluster to be tuned for concurrent short duration job processing. This can be done by sharing Spark Contexts as described and by using the Spark fair scheduler and tuning RDD partition sizing for the given set of worker executions that keep partition shuffling to a minimum. You can learn more from this video presentation on optimizing Job Server for low-latency and shared concurrent processing.

By leveraging and tuning a multi-user friendly Spark cluster, this frees application developers to leverage Spark's powerful Scala, Java, Python and R API's in ways not available in the past to traditional application developers. With this capability you can enhance traditional CRUD application development with low-latency MapReduce type of functionality to create applications not imaginable before to developers.

With this type of architecture where your traditional application servers are using an interactive low-latency Spark cluster via a REST API, you can integrate a variety of data sources and data/analytics services together using Spark. You can, for example, mash up data from your relational database and Cassandra or MongoDB to create processing and data mashup you could not do easily with hand written application code. This approach opens up a bountiful world of powerful Spark APIs to application developers. Keep in mind of course that if your Spark operations require execution on a large set of workers/nodes and RDD partitions, this will likely not lead to very good response times. But any operation with a reasonable number of stages and that can be configured to process on one or a few partition RDDs has the potential to fit this scenario, but again something for you as the developer to quantify.

Running a Spark cluster tuned for servicing interactive CRUD applications is achievable and one of the next frontiers that Spark is opening up for application developers. This will open the door for data integrations and no-ETL computing that was not feasible or imaginable in the past. Meshing data from multiple data stores and leveraging Sparks powerful processing APIs is now accesable to application developers and no longer the realm of backend batch processing developers. Get started today. Standup a Spark cluster, tune it up for low-latency processing, setup Job Server and then create some amazing interactive services!

Monday, February 1, 2016

Temporal Database Design with NoSQL

Managing data as a function of time in a database is a common requirement for many applications and data warehousing systems. Knowing when a data element or group of elements have changed and over what period of time the data is valid over, is often a required feature in many applications and analytical systems.

While not easy compared to traditional CRUD database development, supporting this type of bitemporal management functionality using a traditional RDBMS such as MySQL or Oracle is a fairly well understood by data modelers and database designers. Such temporal data modeling can be done in a variety of ways in a relational database for both OLTP and OLAP style applications. For example, Oracle and IBM DB2 have built-in extensions for managing bitemporal dimensionality at the table and schema level. It is also possible to roll your own solution with any of the major RDBMS engines by applying time dimension columns (very carefully) to your schema and then with the appropriate DML and transactions manage the updating and insertion of new change records. To do this precisely and 100% consistently the database is required to support durable ACID transactions, something all RDBMS have in spades. See wikipedia links for a background on temporal database models.

Now this is all great, temporal and bitemporal table/schema design is an understood concept by data architects in the RDBMS world. Now how do you do this if you are on the Big Data and NoSql bandwagon? To begin with most NoSQL databases lack support for ACID transactions, which is a prerequisite for handling temporal operations on slow changing dimensions (temporal data) and bitemporal dimensions (valid time dimension and transaction time dimension). ACID transactions are required in order to properly mark expired records as new records are being appended. Records must never overlap and must properly and precisely be expired as new valid time and transaction time record slices are added to the database.

NoSQL databases such as Cassandra and Couchbase are powerful database engines that can be leveraged for a wide segment of data processing and storage needs. NoSQL databases offer many benefits including built in distributed storage/processing, flexible schema modeling and efficient sparse data management. Many of these benefits come at a price although that limit NoSQL database applicability in cases where durable ACID transactions are required for scenarios such as managing multi-row, multi-table transactions for both OLTP and OLAP data processing.

To address this limitation in NoSQL databases, a NoSQL such as Couchbase or Cassandra, for example, can be paired with an ACID database in such a way (the pairing is both operationally and at a schema design level) as to allow using the NoSQL database for what is best at while supporting bitemporal operations via pairing with a RDBMS. Under the hood this is done seamlessly by having a data serialization and deserialization API that synchronizes and coordinates DML operations between the RDBMS and the NoSQL database. The schema design structure provides a polyglot database framework that supports temporal and bitemporal data modeling and provides a data access and query API that supports durable bitemporal operations while supporting the flexibility and advantages of a NoSQL database modeling (document, key/value...etc).

This approach can be applied to NoSQL databases in both OLTP, data warehousing and Big Data environments. So leverage your favorite NoSQL database with best of both worlds! Get your polyglot engines going, your favorite NoSQL database just got bitemporal! Contact Grand Logic to learn how we can help you build your next bitemporal database environment.